Path: blob/master/guides/ipynb/gptq_quantization_in_keras.ipynb
4284 views
GPTQ Quantization in Keras
Author: Jyotinder Singh
Date created: 2025/10/16
Last modified: 2025/10/16
Description: How to run weight-only GPTQ quantization for Keras & KerasHub models.
What is GPTQ?
GPTQ ("Generative Pre-Training Quantization") is a post-training, weight-only quantization method that uses a second-order approximation of the loss (via a Hessian estimate) to minimize the error introduced when compressing weights to lower precision, typically 4-bit integers.
Unlike standard post-training techniques, GPTQ keeps activations in higher-precision and only quantizes the weights. This often preserves model quality in low bit-width settings while still providing large storage and memory savings.
Keras supports GPTQ quantization for KerasHub models via the keras.quantizers.GPTQConfig class.
Load a KerasHub model
This guide uses the Gemma3CausalLM model from KerasHub, a small (1B parameter) causal language model.
Configure & run GPTQ quantization
You can configure GPTQ quantization via the keras.quantizers.GPTQConfig class.
The GPTQ configuration requires a calibration dataset and tokenizer, which it uses to estimate the Hessian and quantization error. Here, we use a small slice of the WikiText-2 dataset for calibration.
You can tune several parameters to trade off speed, memory, and accuracy. The most important of these are weight_bits (the bit-width to quantize weights to) and group_size (the number of weights to quantize together). The group size controls the granularity of quantization: smaller groups typically yield better accuracy but are slower to quantize and may use more memory. A good starting point is group_size=128 for 4-bit quantization (weight_bits=4).
In this example, we first prepare a tiny calibration set, and then run GPTQ on the model using the .quantize(...) API.
Model Export
The GPTQ quantized model can be saved to a preset and reloaded elsewhere, just like any other KerasHub model.
Performance & Benchmarking
Micro-benchmarks collected on a single NVIDIA 4070 Ti Super (16 GB). Baselines are FP32.
Dataset: WikiText-2.
| Model (preset) | Perplexity Increase % (↓ better) | Disk Storage Reduction Δ % (↓ better) | VRAM Reduction Δ % (↓ better) | First-token Latency Δ % (↓ better) | Throughput Δ % (↑ better) |
|---|---|---|---|---|---|
| GPT2 (gpt2_base_en_cnn_dailymail) | 1.0% | -50.1% ↓ | -41.1% ↓ | +0.7% ↑ | +20.1% ↑ |
| OPT (opt_125m_en) | 10.0% | -49.8% ↓ | -47.0% ↓ | +6.7% ↑ | -15.7% ↓ |
| Bloom (bloom_1.1b_multi) | 7.0% | -47.0% ↓ | -54.0% ↓ | +1.8% ↑ | -15.7% ↓ |
| Gemma3 (gemma3_1b) | 3.0% | -51.5% ↓ | -51.8% ↓ | +39.5% ↑ | +5.7% ↑ |
Detailed benchmarking numbers and scripts are available here.
Analysis
There is notable reduction in disk space and VRAM usage across all models, with disk space savings around 50% and VRAM savings ranging from 41% to 54%. The reported disk savings understate the true weight compression because presets also include non-weight assets.
Perplexity increases only marginally, indicating model quality is largely preserved after quantization.
Practical tips
GPTQ is a post-training technique; training after quantization is not supported.
Always use the model's own tokenizer for calibration.
Use a representative calibration set; small slices are only for demos.
Start with W4 group_size=128; tune per model/task.