Learn Data Analysis with Python
Introduction
Imagine you have a spreadsheet with rows and columns—like a list of students, their grades, or a shop's sales. Pandas is a Python tool that lets you handle these "tables" easily. It helps you read, clean, explore, and change data with just a few lines of code. In this lesson, you'll learn the basics to get started quickly and confidently.
Audience and Goal
- For Grade 8–9 students learning Python for real-world data tasks
- Goal: Learn what pandas is, how to load and explore a table of data, make simple changes, and save your work
What You Need
- Python installed (3.8+ is great)
- pandas library (install using:
pip install pandas) - A code editor or notebook (IDLE, VS Code, Jupyter, etc.)
Step-by-Step Guide
1) Meet Pandas
Pandas is a Python library for working with data in tables.
- A DataFrame is pandas' word for "a table with rows and columns," like a spreadsheet.
- A Series is one column of that table.
2) Install and Import Pandas
Open your terminal/command prompt and run:
pip install pandasIn your Python file or notebook, start with:
import pandas as pd3) Make Your First Table (DataFrame)
You can create a table from a Python dictionary (key-value pairs).
4) Read Data from a CSV File
CSV files are simple text files that store tables. Each row is a line, and commas separate the values. Pandas can load CSV files with one function call.
5) Explore Your Data
- head(): shows the first few rows
- info(): shows columns and data types
- describe(): gives simple stats (counts, min, max, average for numbers)
- shape: tells you how many rows and columns
6) Select and Filter
- Select a column by name (like picking one column from a spreadsheet).
- Filter rows to keep only the ones you care about (like "only items with price > 10").
7) Add New Columns and Handle Missing Values
- Create new columns from other columns (like "total price = price x quantity").
- Missing values can be filled with a default value (like 0 or "Unknown") or removed.
8) Save Your Work
Save your cleaned or updated table back to a CSV file using to_csv.
Python Code Examples
Example 1: Create and Explore a DataFrame
What it does: builds a small table in code and explores it.
import pandas as pd
# Create a small table (DataFrame) from a dictionary
data = {
"name": ["Ava", "Ben", "Cara", "Don", "Eva"],
"age": [14, 15, 14, 16, 15],
"score": [88, 92, 79, 85, 90]
}
df = pd.DataFrame(data)
# Look at the first few rows
print("First rows:")
print(df.head())
# Check the size (rows, columns)
print("Shape (rows, columns):", df.shape)
# See basic info (column types, non-missing counts)
print("\nInfo:")
print(df.info())
# Summary stats for number columns (count, mean, min, max, etc.)
print("\nDescribe:")
print(df.describe())Example 2: Read, Select, Filter, Sort, and Add a Column
What it does: reads a CSV (we'll simulate a CSV in memory), selects columns, filters rows, sorts, and adds a new column.
import pandas as pd
from io import StringIO # Lets us pretend a string is a file
# Simulated CSV content
csv_text = """item,category,price,quantity
Pencil,Stationery,0.99,10
Notebook,Stationery,2.49,5
Apple,Food,0.50,12
Granola Bar,Food,1.20,8
Water Bottle,Other,3.00,3
"""
# Read CSV from the string (in real life, use pd.read_csv("filename.csv"))
df = pd.read_csv(StringIO(csv_text))
print("Original data:")
print(df)
# Select a single column (a Series)
print("\nPrices column:")
print(df["price"])
# Select multiple columns (a new DataFrame)
print("\nItem and price columns:")
print(df[["item", "price"]])
# Filter rows: keep only items that cost at least $1.00
expensive = df[df["price"] >= 1.00]
print("\nItems costing at least $1.00:")
print(expensive)
# Add a new column: total value = price * quantity
df["total_value"] = df["price"] * df["quantity"]
print("\nAdded total_value column:")
print(df)
# Sort by total_value (descending: biggest first)
sorted_df = df.sort_values(by="total_value", ascending=False)
print("\nSorted by total_value (biggest first):")
print(sorted_df)Tip:
If you have a real CSV file named shop.csv in the same folder, you can do:
df = pd.read_csv("shop.csv")Example 3: Handling Missing Values (NaN)
What it does: shows how to find, fill, and drop missing data.
import pandas as pd
import numpy as np
data = {
"name": ["Ava", "Ben", "Cara", "Don", "Eva"],
"age": [14, np.nan, 14, 16, np.nan], # np.nan represents a missing value
"score": [88, 92, None, 85, 90] # None also becomes a missing value in pandas
}
df = pd.DataFrame(data)
print("Original data with missing values:")
print(df)
# Check how many missing values each column has
print("\nMissing value counts:")
print(df.isna().sum())
# Option 1: Fill missing ages with a default value (e.g., 15)
df_filled = df.copy()
df_filled["age"] = df_filled["age"].fillna(15)
# Fill missing scores with the average score (mean)
mean_score = df_filled["score"].mean()
df_filled["score"] = df_filled["score"].fillna(mean_score)
print("\nAfter filling missing values:")
print(df_filled)
# Option 2: Drop rows with any missing values (use carefully!)
df_dropped = df.dropna()
print("\nAfter dropping rows with missing values:")
print(df_dropped)Saving Your Results
To save your cleaned table to a new CSV file:
df.to_csv("cleaned_data.csv", index=False)Small Practical Exercise: Mini Movie Rentals 🎬
Scenario: You help a small movie rental kiosk look at simple data.
Starter Data
(you can copy this into your code with StringIO like in Example 2):
item,genre,price_per_day,days_rented
Movie A,Action,3.5,2
Movie B,Comedy,2.0,5
Movie C,Action,4.0,1
Movie D,Drama,2.5,3
Movie E,Comedy,2.0,2Tasks:
- Load the CSV into a DataFrame.
- Add a new column
total_cost = price_per_day * days_rented. - Show only the rows where
total_costis at least 7.0. - Sort the whole table by
total_costfrom highest to lowest. - Save the sorted table to a CSV file named
rentals_report.csv(no index).
Hints:
- Use
pd.read_csvwithStringIO(or a real file). - For filtering, use
df[df["total_cost"] >= 7.0]. - For sorting, use
df.sort_values(by="total_cost", ascending=False). - For saving, use
df.to_csv("rentals_report.csv", index=False).
Challenge (optional):
Which genre has the most rentals in this tiny dataset? Try df["genre"].value_counts().
Recap
- Pandas helps you work with table-like data (DataFrames) in Python.
- You learned how to create a DataFrame, read CSVs, explore data (head, info, describe), select and filter rows, add new columns, handle missing values, sort, and save your work.
- With these basics, you can start analyzing real-world data sets confidently.