Quantization • Pruning • Knowledge Distillation • Distributed Training • TPU/GPU Optimization
Deep learning models can be very large and slow. They often need a lot of memory, strong GPUs, and long training time.
To make models faster and lighter, we use model optimization and scaling techniques.
These techniques help us run models on smaller devices like mobiles, laptops, drones, robots, and even microchips.
In this chapter, you will learn five important techniques:
- Quantization
- Pruning
- Knowledge Distillation
- Distributed Training
- TPU/GPU Optimization
All concepts are explained in simple, clear English with real-world examples and code.
1. Quantization
Quantization reduces the size of a deep learning model by converting high-precision numbers (float32) into smaller numbers (int8 or float16).
Why It Helps
- Model becomes much smaller
- Runs faster
- Uses less energy
- Works well on mobile devices
Real Example
A 100 MB model can shrink to 25 MB after quantization, making it run smoothly on smartphones.
Simple Explanation
Instead of storing numbers like:
0.123456789 (32-bit float)
Quantization stores:
0.12 (8-bit integer or float16)
The model becomes lighter but still works well.
Code Example (TensorFlow Lite Quantization)
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("my_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open("model_quantized.tflite", "wb") as f:
f.write(tflite_model)2. Pruning
Pruning removes unnecessary neurons or connections from a model.
Simple Explanation
Think of pruning like trimming extra branches from a tree.
Removing weak or unused neurons makes the model faster.
Why It Helps
- Reduces model size
- Improves speed
- Reduces overfitting
Real Example
An image classifier with 5 million parameters may only need 3 million.
Pruning removes the extra ones.
Code Example (TensorFlow Model Pruning)
import tensorflow_model_optimization as tfmot
prune = tfmot.sparsity.keras.prune_low_magnitude
pruned_model = prune(
original_model,
pruning_schedule=tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.0,
final_sparsity=0.5,
begin_step=0,
end_step=1000
)
)3. Knowledge Distillation
Knowledge Distillation transfers knowledge from a large model (teacher model) into a smaller model (student model).
Simple Explanation
Imagine a smart teacher teaching a younger student.
The student learns the important patterns but uses fewer resources.
Why It Helps
- Creates compact, fast models
- Small model performs almost like the big one
Real Example
Large models like BERT can be distilled into TinyBERT or DistilBERT, which are faster and lighter.
Code Example (Basic Distillation Logic)
teacher = teacher_model(input_data)
student_output = student_model(input_data)
loss = 0.5 * mse(teacher, student_output) + 0.5 * ce(labels, student_output)This combines teacher knowledge + true training labels.
4. Distributed Training
Distributed training means training a model across multiple GPUs, multiple TPUs, or even multiple computers.
This makes training much faster.
Simple Explanation
Training a huge deep learning model on one computer can take days.
Using 8 GPUs splits the work into parts—like 8 students solving a big assignment together.
Where It's Used
- Big companies (Google, Meta, OpenAI)
- Large language models (LLMs)
- Vision models with millions of images
Code Example (TensorFlow Multi-GPU Training)
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = build_model()
model.compile(optimizer='adam', loss='categorical_crossentropy')5. TPU/GPU Optimization
Modern AI training uses specialized hardware like:
- GPU (Graphics Processing Unit)
- TPU (Tensor Processing Unit)
Why They Help
- Run massive matrix operations quickly
- Increase model training speed
- Reduce training cost
GPU Optimization Techniques
- Use mixed precision (float16 + float32)
- Use batched operations
- Keep data on GPU memory
TPU Optimization Techniques
- Use fixed batch sizes
- Use XLA compiler
- Structure the model in parallel blocks
Code Example (Mixed Precision for GPU)
from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)This improves speed on GPUs and TPUs.
6. Combining All Techniques
A modern deep learning workflow usually includes:
| Technique | Purpose |
|---|---|
| Quantization | Make the model small |
| Pruning | Remove unnecessary weights |
| Distillation | Transfer knowledge from big → small |
| Distributed Training | Train large models quickly |
| GPU/TPU Optimization | Speed up computation |
Together, these techniques make AI models fast, affordable, and deployable on real devices.
7. Mini Practice Codes
A. Save a compressed model
model.save("compressed_model.h5")B. Convert to TensorFlow Lite
tflite_model = tf.lite.TFLiteConverter.from_keras_model(model).convert()C. Enable GPU training
print("GPU Available:", tf.config.list_physical_devices("GPU"))