Self-Attention Mechanism • Encoder-Decoder Architecture • BERT • GPT • Vision Transformers (ViT)
Transformers are one of the most important inventions in modern artificial intelligence.
They power today's most advanced systems including ChatGPT, Google Translate, BERT, GPT-4, Vision Transformers, and many others.
Transformers are designed to understand relationships in data, especially in sequences like text, audio, or even image patches.
This chapter explains how they work in simple, easy-to-understand language.
1. What Are Transformers?
Transformers are deep learning models that use a mechanism called self-attention to understand which parts of the input are important.
Unlike older models such as RNNs and LSTMs:
- Transformers do not read data one step at a time
- They look at the entire sequence at once
- They focus attention on the most important words or parts
This makes Transformers fast, powerful, and extremely accurate.
2. Self-Attention Mechanism
Self-attention is the heart of a transformer.
Simple Explanation
Imagine you are reading this sentence:
"The cat sat on the mat because it was tired."
To understand the word "it", your brain looks back and realizes "it" refers to the cat.
Transformers do the same using self-attention.
They find connections between words, no matter how far apart they are.
Why Self-Attention Is Powerful
- It can understand long sentences
- It can focus on important words
- It works in parallel → makes training very fast
Analogy
Think of a group project where everyone can instantly talk to everyone else.
This is how self-attention works — every word talks to every other word.
✅ Minimal Self-Attention Code (PyTorch)
import torch
import torch.nn.functional as F
# Example: self-attention for a sequence of 4 tokens
x = torch.rand(4, 8) # 4 tokens, each with 8 features
# Weight matrices
W_q = torch.rand(8, 8)
W_k = torch.rand(8, 8)
W_v = torch.rand(8, 8)
# Compute Q, K, V
Q = x @ W_q
K = x @ W_k
V = x @ W_v
# Attention scores
att_scores = Q @ K.T
att_weights = F.softmax(att_scores, dim=-1)
# Output
output = att_weights @ V
print(output.shape)3. Encoder–Decoder Architecture
Transformers originally used two parts:
A. Encoder
- Reads input
- Understands meaning
- Generates features
Used in BERT, Vision Transformers, etc.
B. Decoder
- Takes encoder output
- Generates predictions (words, images, etc.)
Used in GPT, translation models, etc.
Example: English → French Translation
| Encoder | Decoder |
|---|---|
| Reads English sentence | Generates French sentence |
The encoder learns what the sentence means.
The decoder learns how to express it in the target language.
4. BERT (Bidirectional Encoder Representations from Transformers)
BERT is a popular encoder-only transformer.
Why It's Special
- Reads text from both directions (left and right)
- Understands deep context
- Excellent for understanding tasks
Used For
- Text classification
- Sentiment analysis
- Question answering
- Named Entity Recognition (NER)
Example
If the text is: "The bank raised interest rates."
BERT understands "bank" = financial bank.
✅ Mini BERT-like Usage Example (HuggingFace)
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Deep learning is amazing!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)5. GPT (Generative Pre-trained Transformer)
GPT models are decoder-only transformers.
Why GPT Is Special
- Generates text word-by-word
- Learns from huge amounts of data
- Can write essays, stories, code, and more
Uses
- Chatbots
- Code generation
- Creative writing
- Answering questions
How GPT Works
- Reads the prompt
- Predicts the next word
- Repeats until completion
✅ Tiny GPT-like Code Example (PyTorch)
import torch
import torch.nn as nn
class TinyGPT(nn.Module):
def __init__(self, vocab=1000, dim=64):
super().__init__()
self.embed = nn.Embedding(vocab, dim)
self.attn = nn.MultiheadAttention(dim, 4)
self.fc = nn.Linear(dim, vocab)
def forward(self, x):
x = self.embed(x)
x, _ = self.attn(x, x, x)
return self.fc(x)
model = TinyGPT()
x = torch.randint(0, 999, (5, 1)) # 5-word prompt
print(model(x).shape)6. Vision Transformers (ViT)
Transformers are not just for text. They can also work on images.
How ViT Works
- Break image into patches
- Turn patches into embeddings
- Use self-attention to understand the image
- Classify the image
Why ViT Is Powerful
- Learns long-range patterns
- Competes with CNNs
- Works well with large datasets
Example
If the image is a cat:
- Patch 1 → ear
- Patch 2 → eye
- Patch 3 → fur
- Patch 4 → whiskers
ViT combines these features to understand it's a cat.
7. Real-World Applications of Transformers
NLP
- ChatGPT
- Google Translate
- Email summarization
- Grammar correction
Vision
- Image classification
- Object detection
- Medical imaging
Audio
- Speech recognition
- Music generation
Multimodal AI
- Image + text models (CLIP)
- Text-to-image models (DALL·E, Stable Diffusion)
Transformers are now used everywhere in AI.
8. Summary Table
| Transformer Type | Examples | Best Use |
|---|---|---|
| Encoder | BERT, ViT | Understanding |
| Decoder | GPT | Text generation |
| Encoder–Decoder | T5, Translation models | Input → Output tasks |