Adam • RMSProp • SGD • Batch Normalization • Dropout • Learning Rate Scheduling • Weight Initialization
Training a deep learning model is like teaching a student.
If the teaching method is slow or confusing, the student learns poorly.
Optimization techniques are "smart training methods" that help neural networks learn faster, more accurately, and more efficiently.
In this chapter, you will learn the most important optimization concepts used in modern deep learning: Adam, RMSProp, SGD, Batch Normalization, Dropout, Learning Rate Scheduling, and Weight Initialization.
1. Optimizers: Adam, RMSProp, and SGD
Optimizers are mathematical tools that update a model's weights during training.
They reduce loss and improve accuracy.
A. SGD (Stochastic Gradient Descent)
SGD updates weights a little at a time using small batches of data.
Simple Explanation
Imagine climbing down a hill by taking small steps.
SGD takes one step at a time, checking direction after each step.
Advantages
- Simple
- Works well for small problems
Disadvantages
- Slow
- Gets stuck easily
- Not ideal for large deep networks
Code Example
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)B. RMSProp
RMSProp adjusts the learning rate automatically for each weight, making training smoother.
Simple Explanation
If a weight is causing unstable learning, RMSProp reduces its learning rate.
If a weight is stable, it increases its learning rate.
Advantages
- Faster learning
- Great for RNNs and sequential data
Code Example
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)C. Adam (Most Popular Optimizer)
Adam is the most widely used optimizer in deep learning.
Simple Explanation
Adam combines the strengths of SGD and RMSProp:
- Learns fast
- Adapts to different types of data
- Works well on almost every task
Advantages
- Very fast
- Very accurate
- Works everywhere: vision, NLP, audio
Code Example
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)2. Batch Normalization
Batch Normalization (BatchNorm) is a technique that stabilizes and speeds up training.
Simple Explanation
When students learn, their performance improves if the environment is stable (good lighting, no noise).
BatchNorm creates a "stable environment" for neural networks by keeping the values inside the model controlled and balanced.
Benefits of BatchNorm
- Faster training
- Higher accuracy
- Reduces overfitting
- Makes deep networks easier to train
Code Example
from tensorflow.keras.layers import Dense, BatchNormalization
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())3. Dropout
Dropout is a technique used to reduce overfitting.
Simple Explanation
During training, Dropout randomly "turns off" some neurons temporarily.
This forces the model to learn in different ways and not rely too much on a few neurons.
Real Example
If a student relies too much on one friend for answers, he learns nothing.
Dropout prevents the network from "cheating" by relying on certain neurons.
Benefits
- Prevents overfitting
- Improves generalization
- Good for large models
Code Example
from tensorflow.keras.layers import Dropout
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5)) # 50% dropout4. Learning Rate Scheduling
Learning Rate Scheduling adjusts the learning rate during training.
Simple Explanation
- Start with a higher learning rate to learn quickly
- Later reduce it to fine-tune the learning
This is like learning a new skill:
- First, learn broadly
- Later, focus on details
Popular Learning Rate Schedules
- Step decay
- Exponential decay
- Reduce on plateau (reduces LR when accuracy stops improving)
Code Example
lr_schedule = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 0.001 * 0.95 ** epoch
)5. Weight Initialization
Weight Initialization means setting the starting values of weights before training begins.
Why Initialization Matters
Good initialization:
- Helps the model learn faster
- Prevents vanishing or exploding values
- Improves accuracy
Poor initialization:
- Makes the model learn slowly
- Causes the model to get stuck
Popular Initializers
- Random Normal
- Xavier Initialization (Good for sigmoid/tanh)
- He Initialization (Best for ReLU networks)
Code Example
from tensorflow.keras.initializers import HeNormal
model.add(Dense(64, activation='relu', kernel_initializer=HeNormal()))6. Putting All Techniques Together
A modern deep learning model often uses:
- Adam optimizer for fast learning
- BatchNorm for stable training
- Dropout to prevent overfitting
- LR scheduling to fine-tune learning
- He initialization to start correctly
These tools help build strong and reliable neural networks.
7. Complete Example Model Using All Techniques
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
model = Sequential([
Dense(128, activation='relu', kernel_initializer='he_normal'),
BatchNormalization(),
Dropout(0.3),
Dense(64, activation='relu', kernel_initializer='he_normal'),
BatchNormalization(),
Dropout(0.3),
Dense(10, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
print(model.summary())