Self-Attention Mechanism • Encoder-Decoder Architecture • BERT • GPT • Vision Transformers (ViT)
Transformers are one of the most important inventions in modern deep learning.
They power today's smartest AI systems, including Google Translate, BERT, GPT, ChatGPT, and Vision Transformers.
Transformers can understand long sentences, translate languages, summarize text, analyze images, and even generate new content.
What makes Transformers so powerful is their ability to focus on important information using a process called self-attention.
This chapter explains Transformers in simple, clear language for grade 10–11 students.
1. Self-Attention Mechanism
The self-attention mechanism is the heart of all Transformer models.
Simple Explanation
Self-attention helps a model decide which words (or pixels) are important in a sentence (or image).
Example
Sentence: "The cat sat on the mat, and it was warm."
To understand "it," the model must know:
- "it" refers to "the mat," not "the cat"
Self-attention allows the model to focus on the correct word without forgetting earlier words.
Why Self-Attention Is Better Than RNNs
- Faster
- Remembers long sequences
- Works in parallel
- Handles long sentences easily
Intuition
Self-attention works like students reading a paragraph and highlighting important parts.
Mini Code Example (Self-Attention in PyTorch)
import torch
import torch.nn.functional as F
# Example: 3 tokens, each represented by 4 features
x = torch.rand(3, 4)
# Self-attention: Query, Key, and Value
Q = x @ torch.rand(4, 4)
K = x @ torch.rand(4, 4)
V = x @ torch.rand(4, 4)
attention_scores = Q @ K.T
attention_weights = F.softmax(attention_scores, dim=-1)
output = attention_weights @ V
print(output)2. Encoder–Decoder Architecture
Transformers often use two main components:
A. Encoder
- Reads the input
- Understands meaning and relationships
- Produces a context-rich representation
B. Decoder
- Takes encoder output
- Generates predictions (words, answers, translations)
Where It Is Used
- Machine translation
- Text summarization
- Chatbots
- Question answering
Simple Example
For English → Urdu translation:
- Encoder understands the English sentence
- Decoder generates the Urdu translation
Diagram (Conceptual)
Input → Encoder → Representation → Decoder → Output
3. BERT (Bidirectional Encoder Representations from Transformers)
BERT is a famous model created by Google.
It uses only the encoder part of a Transformer.
What Makes BERT Special
- Reads text in both directions (left and right)
- Understands context deeply
- Great for understanding tasks
BERT Is Used For
- Sentiment analysis
- Text classification
- Named Entity Recognition (NER)
- Question answering
Simple Example
Sentence: "The bank is near the river."
BERT captures that "bank" is a river bank, not a financial bank.
Code Example (Using BERT for classification)
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Deep learning is amazing!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)4. GPT (Generative Pretrained Transformer)
GPT models use only the decoder part of a Transformer.
They are designed for text generation.
Why GPT Is Powerful
- Predicts the next word in a sequence
- Learns grammar, facts, and writing style
- Generates paragraphs, code, stories, and explanations
GPT Is Used For
- ChatGPT
- Code generation
- Essay writing
- Conversation bots
- Creative storytelling
Example
Input: "In the future, robots will"
GPT may continue: "…work with humans to create smarter cities."
Mini Code Example (Text generation using GPT-2)
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
inputs = tokenizer.encode("Artificial intelligence will", return_tensors="pt")
outputs = model.generate(inputs, max_length=30)
print(tokenizer.decode(outputs[0]))5. Vision Transformers (ViT)
Transformers were first created for text, but now they are used for images too.
How ViT Works
Instead of looking at entire images, ViT:
- Cuts the image into small patches
- Treats each patch like a "word"
- Uses self-attention to understand the full image
Why ViT Is Important
- Works extremely well on large image datasets
- Competes with CNNs (like ResNet)
- Used in medical imaging, satellite vision, and object detection
Simple Example
If you give ViT a picture of a dog:
- Patch 1 = fur
- Patch 2 = nose
- Patch 3 = ear
Self-attention connects all patches to understand the full object.
Mini Code Example (Vision Transformer)
from transformers import ViTFeatureExtractor, ViTModel
from PIL import Image
import requests
img = Image.open(requests.get("https://picsum.photos/224", stream=True).raw)
extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
inputs = extractor(images=img, return_tensors="pt")
model = ViTModel.from_pretrained("google/vit-base-patch16-224")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)6. Why Transformers Changed Deep Learning Forever
Transformers solved many limitations of earlier models like RNNs and CNNs.
Advantages
- Understand long sequences
- Train faster
- Work in parallel
- Produce higher accuracy
- Can be used for text, images, and audio
Transformers Power Today's AI
- Google Search
- YouTube recommendations
- ChatGPT
- Siri and Alexa
- Medical image analysis
- Self-driving car perception
7. Summary Table
| Model | Uses | Strength |
|---|---|---|
| Encoder-only (BERT) | Understanding | Deep context learning |
| Decoder-only (GPT) | Text generation | Creativity and prediction |
| Encoder–Decoder (T5, BART) | Translation & summarization | High accuracy |
| Vision Transformers (ViT) | Image tasks | Strong long-range reasoning |