🎉 Welcome to PyVerse! Start Learning Today

PYTHONDATA SCIENCE

Getting Started with Pandas for Data Handling

Learn Data Analysis with Python

Introduction

Imagine you have a spreadsheet with rows and columns—like a list of students, their grades, or a shop's sales. Pandas is a Python tool that lets you handle these "tables" easily. It helps you read, clean, explore, and change data with just a few lines of code. In this lesson, you'll learn the basics to get started quickly and confidently.

Audience and Goal

  • For Grade 8–9 students learning Python for real-world data tasks
  • Goal: Learn what pandas is, how to load and explore a table of data, make simple changes, and save your work

What You Need

  • Python installed (3.8+ is great)
  • pandas library (install using: pip install pandas)
  • A code editor or notebook (IDLE, VS Code, Jupyter, etc.)

Step-by-Step Guide

1) Meet Pandas

Pandas is a Python library for working with data in tables.

  • DataFrame is pandas' word for "a table with rows and columns," like a spreadsheet.
  • Series is one column of that table.

2) Install and Import Pandas

Open your terminal/command prompt and run:

pip install pandas

In your Python file or notebook, start with:

import pandas as pd

3) Make Your First Table (DataFrame)

You can create a table from a Python dictionary (key-value pairs).

4) Read Data from a CSV File

CSV files are simple text files that store tables. Each row is a line, and commas separate the values. Pandas can load CSV files with one function call.

5) Explore Your Data

  • head(): shows the first few rows
  • info(): shows columns and data types
  • describe(): gives simple stats (counts, min, max, average for numbers)
  • shape: tells you how many rows and columns

6) Select and Filter

  • Select a column by name (like picking one column from a spreadsheet).
  • Filter rows to keep only the ones you care about (like "only items with price > 10").

7) Add New Columns and Handle Missing Values

  • Create new columns from other columns (like "total price = price x quantity").
  • Missing values can be filled with a default value (like 0 or "Unknown") or removed.

8) Save Your Work

Save your cleaned or updated table back to a CSV file using to_csv.

Python Code Examples

Example 1: Create and Explore a DataFrame

What it does: builds a small table in code and explores it.

import pandas as pd # Create a small table (DataFrame) from a dictionary data = { "name": ["Ava", "Ben", "Cara", "Don", "Eva"], "age": [14, 15, 14, 16, 15], "score": [88, 92, 79, 85, 90] } df = pd.DataFrame(data) # Look at the first few rows print("First rows:") print(df.head()) # Check the size (rows, columns) print("Shape (rows, columns):", df.shape) # See basic info (column types, non-missing counts) print("\nInfo:") print(df.info()) # Summary stats for number columns (count, mean, min, max, etc.) print("\nDescribe:") print(df.describe())

Example 2: Read, Select, Filter, Sort, and Add a Column

What it does: reads a CSV (we'll simulate a CSV in memory), selects columns, filters rows, sorts, and adds a new column.

import pandas as pd from io import StringIO # Lets us pretend a string is a file # Simulated CSV content csv_text = """item,category,price,quantity Pencil,Stationery,0.99,10 Notebook,Stationery,2.49,5 Apple,Food,0.50,12 Granola Bar,Food,1.20,8 Water Bottle,Other,3.00,3 """ # Read CSV from the string (in real life, use pd.read_csv("filename.csv")) df = pd.read_csv(StringIO(csv_text)) print("Original data:") print(df) # Select a single column (a Series) print("\nPrices column:") print(df["price"]) # Select multiple columns (a new DataFrame) print("\nItem and price columns:") print(df[["item", "price"]]) # Filter rows: keep only items that cost at least $1.00 expensive = df[df["price"] >= 1.00] print("\nItems costing at least $1.00:") print(expensive) # Add a new column: total value = price * quantity df["total_value"] = df["price"] * df["quantity"] print("\nAdded total_value column:") print(df) # Sort by total_value (descending: biggest first) sorted_df = df.sort_values(by="total_value", ascending=False) print("\nSorted by total_value (biggest first):") print(sorted_df)

Tip:

If you have a real CSV file named shop.csv in the same folder, you can do:

df = pd.read_csv("shop.csv")

Example 3: Handling Missing Values (NaN)

What it does: shows how to find, fill, and drop missing data.

import pandas as pd import numpy as np data = { "name": ["Ava", "Ben", "Cara", "Don", "Eva"], "age": [14, np.nan, 14, 16, np.nan], # np.nan represents a missing value "score": [88, 92, None, 85, 90] # None also becomes a missing value in pandas } df = pd.DataFrame(data) print("Original data with missing values:") print(df) # Check how many missing values each column has print("\nMissing value counts:") print(df.isna().sum()) # Option 1: Fill missing ages with a default value (e.g., 15) df_filled = df.copy() df_filled["age"] = df_filled["age"].fillna(15) # Fill missing scores with the average score (mean) mean_score = df_filled["score"].mean() df_filled["score"] = df_filled["score"].fillna(mean_score) print("\nAfter filling missing values:") print(df_filled) # Option 2: Drop rows with any missing values (use carefully!) df_dropped = df.dropna() print("\nAfter dropping rows with missing values:") print(df_dropped)

Saving Your Results

To save your cleaned table to a new CSV file:

df.to_csv("cleaned_data.csv", index=False)

Small Practical Exercise: Mini Movie Rentals 🎬

Scenario: You help a small movie rental kiosk look at simple data.

Starter Data

(you can copy this into your code with StringIO like in Example 2):

item,genre,price_per_day,days_rented Movie A,Action,3.5,2 Movie B,Comedy,2.0,5 Movie C,Action,4.0,1 Movie D,Drama,2.5,3 Movie E,Comedy,2.0,2

Tasks:

  1. Load the CSV into a DataFrame.
  2. Add a new column total_cost = price_per_day * days_rented.
  3. Show only the rows where total_cost is at least 7.0.
  4. Sort the whole table by total_cost from highest to lowest.
  5. Save the sorted table to a CSV file named rentals_report.csv (no index).

Hints:

  • Use pd.read_csv with StringIO (or a real file).
  • For filtering, use df[df["total_cost"] >= 7.0].
  • For sorting, use df.sort_values(by="total_cost", ascending=False).
  • For saving, use df.to_csv("rentals_report.csv", index=False).

Challenge (optional):

Which genre has the most rentals in this tiny dataset? Try df["genre"].value_counts().

Recap

  • Pandas helps you work with table-like data (DataFrames) in Python.
  • You learned how to create a DataFrame, read CSVs, explore data (head, info, describe), select and filter rows, add new columns, handle missing values, sort, and save your work.
  • With these basics, you can start analyzing real-world data sets confidently.

Loading quizzes...