TRANSFORMERS

Transformers are one of the most important inventions in modern deep learning.

They power today's smartest AI systems, including Google Translate, BERT, GPT, ChatGPT, and Vision Transformers.

Transformers can understand long sentences, translate languages, summarize text, analyze images, and even generate new content.

What makes Transformers so powerful is their ability to focus on important information using a process called self-attention.

This chapter explains Transformers in simple, clear language for grade 10–11 students.

1. Self-Attention Mechanism

The self-attention mechanism is the heart of all Transformer models.

Simple Explanation

Self-attention helps a model decide which words (or pixels) are important in a sentence (or image).

Example

Sentence: "The cat sat on the mat, and it was warm."

To understand "it," the model must know:

"it" refers to "the mat," not "the cat"

Self-attention allows the model to focus on the correct word without forgetting earlier words.

Why Self-Attention Is Better Than RNNs

Faster
Remembers long sequences
Works in parallel
Handles long sentences easily

Intuition

Self-attention works like students reading a paragraph and highlighting important parts.

Mini Code Example (Self-Attention in PyTorch)

import torch
import torch.nn.functional as F

# Example: 3 tokens, each represented by 4 features
x = torch.rand(3, 4)

# Self-attention: Query, Key, and Value
Q = x @ torch.rand(4, 4)
K = x @ torch.rand(4, 4)
V = x @ torch.rand(4, 4)

attention_scores = Q @ K.T
attention_weights = F.softmax(attention_scores, dim=-1)
output = attention_weights @ V

print(output)

2. Encoder–Decoder Architecture

Transformers often use two main components:

A. Encoder

Reads the input
Understands meaning and relationships
Produces a context-rich representation

B. Decoder

Takes encoder output
Generates predictions (words, answers, translations)

Where It Is Used

Machine translation
Text summarization
Chatbots
Question answering

Simple Example

For English → Urdu translation:

Encoder understands the English sentence
Decoder generates the Urdu translation

Diagram (Conceptual)

Input → Encoder → Representation → Decoder → Output

3. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a famous model created by Google.

It uses only the encoder part of a Transformer.

What Makes BERT Special

Reads text in both directions (left and right)
Understands context deeply
Great for understanding tasks

BERT Is Used For

Sentiment analysis
Text classification
Named Entity Recognition (NER)
Question answering

Simple Example

Sentence: "The bank is near the river."

BERT captures that "bank" is a river bank, not a financial bank.

Code Example (Using BERT for classification)

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Deep learning is amazing!", return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

4. GPT (Generative Pretrained Transformer)

GPT models use only the decoder part of a Transformer.

They are designed for text generation.

Why GPT Is Powerful

Predicts the next word in a sequence
Learns grammar, facts, and writing style
Generates paragraphs, code, stories, and explanations

GPT Is Used For

ChatGPT
Code generation
Essay writing
Conversation bots
Creative storytelling

Example

Input: "In the future, robots will"

GPT may continue: "…work with humans to create smarter cities."

Mini Code Example (Text generation using GPT-2)

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

inputs = tokenizer.encode("Artificial intelligence will", return_tensors="pt")
outputs = model.generate(inputs, max_length=30)

print(tokenizer.decode(outputs[0]))

5. Vision Transformers (ViT)

Transformers were first created for text, but now they are used for images too.

How ViT Works

Instead of looking at entire images, ViT:

Cuts the image into small patches
Treats each patch like a "word"
Uses self-attention to understand the full image

Why ViT Is Important

Works extremely well on large image datasets
Competes with CNNs (like ResNet)
Used in medical imaging, satellite vision, and object detection

Simple Example

If you give ViT a picture of a dog:

Patch 1 = fur
Patch 2 = nose
Patch 3 = ear

Self-attention connects all patches to understand the full object.

Mini Code Example (Vision Transformer)

from transformers import ViTFeatureExtractor, ViTModel
from PIL import Image
import requests

img = Image.open(requests.get("https://picsum.photos/224", stream=True).raw)

extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
inputs = extractor(images=img, return_tensors="pt")

model = ViTModel.from_pretrained("google/vit-base-patch16-224")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

6. Why Transformers Changed Deep Learning Forever

Transformers solved many limitations of earlier models like RNNs and CNNs.

Advantages

Understand long sequences
Train faster
Work in parallel
Produce higher accuracy
Can be used for text, images, and audio

Transformers Power Today's AI

Google Search
YouTube recommendations
ChatGPT
Siri and Alexa
Medical image analysis
Self-driving car perception

7. Summary Table

Model	Uses	Strength
Encoder-only (BERT)	Understanding	Deep context learning
Decoder-only (GPT)	Text generation	Creativity and prediction
Encoder–Decoder (T5, BART)	Translation & summarization	High accuracy
Vision Transformers (ViT)	Image tasks	Strong long-range reasoning

TRANSFORMERS

Transformers are one of the most important inventions in modern deep learning.

They power today's smartest AI systems, including Google Translate, BERT, GPT, ChatGPT, and Vision Transformers.

Transformers can understand long sentences, translate languages, summarize text, analyze images, and even generate new content.

What makes Transformers so powerful is their ability to focus on important information using a process called self-attention.

This chapter explains Transformers in simple, clear language for grade 10–11 students.

1. Self-Attention Mechanism

The self-attention mechanism is the heart of all Transformer models.

Simple Explanation

Self-attention helps a model decide which words (or pixels) are important in a sentence (or image).

Example

Sentence: "The cat sat on the mat, and it was warm."

To understand "it," the model must know:

"it" refers to "the mat," not "the cat"

Self-attention allows the model to focus on the correct word without forgetting earlier words.

Why Self-Attention Is Better Than RNNs

Faster
Remembers long sequences
Works in parallel
Handles long sentences easily

Intuition

Self-attention works like students reading a paragraph and highlighting important parts.

Mini Code Example (Self-Attention in PyTorch)

import torch
import torch.nn.functional as F

# Example: 3 tokens, each represented by 4 features
x = torch.rand(3, 4)

# Self-attention: Query, Key, and Value
Q = x @ torch.rand(4, 4)
K = x @ torch.rand(4, 4)
V = x @ torch.rand(4, 4)

attention_scores = Q @ K.T
attention_weights = F.softmax(attention_scores, dim=-1)
output = attention_weights @ V

print(output)

2. Encoder–Decoder Architecture

Transformers often use two main components:

A. Encoder

Reads the input
Understands meaning and relationships
Produces a context-rich representation

B. Decoder

Takes encoder output
Generates predictions (words, answers, translations)

Where It Is Used

Machine translation
Text summarization
Chatbots
Question answering

Simple Example

For English → Urdu translation:

Encoder understands the English sentence
Decoder generates the Urdu translation

Diagram (Conceptual)

Input → Encoder → Representation → Decoder → Output

3. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a famous model created by Google.

It uses only the encoder part of a Transformer.

What Makes BERT Special

Reads text in both directions (left and right)
Understands context deeply
Great for understanding tasks

BERT Is Used For

Sentiment analysis
Text classification
Named Entity Recognition (NER)
Question answering

Simple Example

Sentence: "The bank is near the river."

BERT captures that "bank" is a river bank, not a financial bank.

Code Example (Using BERT for classification)

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Deep learning is amazing!", return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

4. GPT (Generative Pretrained Transformer)

GPT models use only the decoder part of a Transformer.

They are designed for text generation.

Why GPT Is Powerful

Predicts the next word in a sequence
Learns grammar, facts, and writing style
Generates paragraphs, code, stories, and explanations

GPT Is Used For

ChatGPT
Code generation
Essay writing
Conversation bots
Creative storytelling

Example

Input: "In the future, robots will"

GPT may continue: "…work with humans to create smarter cities."

Mini Code Example (Text generation using GPT-2)

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

inputs = tokenizer.encode("Artificial intelligence will", return_tensors="pt")
outputs = model.generate(inputs, max_length=30)

print(tokenizer.decode(outputs[0]))

5. Vision Transformers (ViT)

Transformers were first created for text, but now they are used for images too.

How ViT Works

Instead of looking at entire images, ViT:

Cuts the image into small patches
Treats each patch like a "word"
Uses self-attention to understand the full image

Why ViT Is Important

Works extremely well on large image datasets
Competes with CNNs (like ResNet)
Used in medical imaging, satellite vision, and object detection

Simple Example

If you give ViT a picture of a dog:

Patch 1 = fur
Patch 2 = nose
Patch 3 = ear

Self-attention connects all patches to understand the full object.

Mini Code Example (Vision Transformer)

from transformers import ViTFeatureExtractor, ViTModel
from PIL import Image
import requests

img = Image.open(requests.get("https://picsum.photos/224", stream=True).raw)

extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
inputs = extractor(images=img, return_tensors="pt")

model = ViTModel.from_pretrained("google/vit-base-patch16-224")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

6. Why Transformers Changed Deep Learning Forever

Transformers solved many limitations of earlier models like RNNs and CNNs.

Advantages

Understand long sequences
Train faster
Work in parallel
Produce higher accuracy
Can be used for text, images, and audio

Transformers Power Today's AI

Google Search
YouTube recommendations
ChatGPT
Siri and Alexa
Medical image analysis
Self-driving car perception

7. Summary Table

Model	Uses	Strength
Encoder-only (BERT)	Understanding	Deep context learning
Decoder-only (GPT)	Text generation	Creativity and prediction
Encoder–Decoder (T5, BART)	Translation & summarization	High accuracy
Vision Transformers (ViT)	Image tasks	Strong long-range reasoning

deep-learning Topics

deep-learning Tutorial

TRANSFORMERS

1. Self-Attention Mechanism

Simple Explanation

Example

Why Self-Attention Is Better Than RNNs

Intuition

Mini Code Example (Self-Attention in PyTorch)

2. Encoder–Decoder Architecture

A. Encoder

B. Decoder

Where It Is Used

Simple Example

Diagram (Conceptual)

3. BERT (Bidirectional Encoder Representations from Transformers)

What Makes BERT Special

BERT Is Used For

Simple Example

Code Example (Using BERT for classification)

4. GPT (Generative Pretrained Transformer)

Why GPT Is Powerful

GPT Is Used For

Example

Mini Code Example (Text generation using GPT-2)

5. Vision Transformers (ViT)

How ViT Works

Why ViT Is Important

Simple Example

Mini Code Example (Vision Transformer)

6. Why Transformers Changed Deep Learning Forever

Advantages

Transformers Power Today's AI

7. Summary Table

deep-learning Topics

deep-learning Tutorial

TRANSFORMERS

1. Self-Attention Mechanism

Simple Explanation

Example

Why Self-Attention Is Better Than RNNs

Intuition

Mini Code Example (Self-Attention in PyTorch)

2. Encoder–Decoder Architecture

A. Encoder

B. Decoder

Where It Is Used

Simple Example

Diagram (Conceptual)

3. BERT (Bidirectional Encoder Representations from Transformers)

What Makes BERT Special

BERT Is Used For

Simple Example

Code Example (Using BERT for classification)

4. GPT (Generative Pretrained Transformer)

Why GPT Is Powerful

GPT Is Used For

Example

Mini Code Example (Text generation using GPT-2)

5. Vision Transformers (ViT)

How ViT Works

Why ViT Is Important

Simple Example

Mini Code Example (Vision Transformer)

6. Why Transformers Changed Deep Learning Forever

Advantages

Transformers Power Today's AI

7. Summary Table