Introduction to NumPy and Pandas

What you'll learn

Why NumPy arrays are faster and more powerful than Python lists for math
How to build, slice, and compute with NumPy arrays (including broadcasting)
How to use Pandas Series and DataFrames to clean, filter, and summarize real data
How NumPy and Pandas work together for analysis
A mini project that analyzes a small dataset with both libraries

Setup

Make sure you have Python 3.9+ installed.

Install the libraries:

pip install numpy pandas

Part 1 — NumPy: Fast math with arrays

Plain Python lists are great for storing values, but they're slow for math. NumPy arrays store numbers in a compact block of memory and use fast C code under the hood.

Key ideas

Array: like a list, but typed and fast. Has shape (size of each dimension) and dtype (data type).
Vectorization: do math on whole arrays without loops.
Broadcasting: NumPy automatically stretches arrays with size 1 to match shapes during operations.
Axes: axis=0 means "down the rows," axis=1 means "across the columns" for 2D arrays.

Code: Creating arrays and basic info

import numpy as np

# 1D and 2D arrays
a = np.array([1, 2, 3], dtype=np.int32)
b = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=float)

print("a:", a, "dtype:", a.dtype, "shape:", a.shape)
print("b:\n", b, "\ndtype:", b.dtype, "shape:", b.shape, "ndim:", b.ndim)

# Quick arrays
zeros = np.zeros((2, 3))
ones = np.ones((3, 2))
r = np.arange(0, 10, 2)      # 0,2,4,6,8
lin = np.linspace(0, 1, 5)   # 5 evenly spaced numbers from 0 to 1
print("zeros:\n", zeros)
print("r:", r)
print("lin:", lin)

Vectorized math and ufuncs

x = np.array([10, 20, 30, 40], dtype=np.float64)
print("x * 0.1:", x * 0.1)           # Multiply every element
print("x + 5:", x + 5)               # Add to every element
print("sqrt:", np.sqrt(x))           # Universal function
print("mean:", np.mean(x))           # Aggregate

Aggregations and axes

M = np.array([[1, 2, 3],
              [4, 5, 6]])
print("sum all:", M.sum())
print("sum by column (axis=0):", M.sum(axis=0))  # [1+4, 2+5, 3+6]
print("sum by row (axis=1):", M.sum(axis=1))

Broadcasting (automatic shape matching)

# Column vector (3x1) + row vector (1x4) -> full 3x4 grid
col = np.array([[1],
                [2],
                [3]])
row = np.array([10, 20, 30, 40])
print("Broadcast result:\n", col + row)

Indexing, slicing, boolean masks

arr = np.arange(1, 13).reshape(3, 4)  # 3 rows, 4 cols
print("arr:\n", arr)
print("Element at row 1, col 2:", arr[1, 2])   # zero-based indexing
print("First row:", arr[0, :])
print("Last two columns:\n", arr[:, -2:])

# Boolean mask: pick even numbers
mask = (arr % 2 == 0)
print("Mask:\n", mask)
print("Even numbers:", arr[mask])

Random numbers (reproducible)

np.random.seed(42)  # same random values every run
rand_ints = np.random.randint(0, 100, size=(5, 3))
rand_norm = np.random.normal(loc=0, scale=1, size=5)
print("Random ints:\n", rand_ints)
print("Random normal:", rand_norm)

Part 2 — Pandas: Data tables you can filter and summarize

Pandas sits on top of NumPy and adds labels (row index, column names) and lots of data tools. Think of a DataFrame like a spreadsheet you can control with Python.

Key ideas

Series: one column (1D, labeled).
DataFrame: many columns (2D table), each column is a Series.
loc selects by labels; iloc selects by position.
GroupBy summarizes data by categories.
Missing values: NaN. Use isna, fillna, dropna.

Create a DataFrame and explore

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Ava", "Ben", "Cara", "Dan", "Elle"],
    "class": ["Red", "Blue", "Red", "Blue", "Red"],
    "math": [88, 92, np.nan, 75, 85],
    "science": [91, 84, 89, 90, 88]
})

print(df.head())      # first rows
print(df.info())      # column types, non-null counts
print(df.describe())  # numeric summary

Selecting and filtering

# Select columns
print(df["math"])                # Series
print(df[["name", "science"]])   # DataFrame

# loc: filter rows with a condition, and choose columns
smart = df.loc[df["math"] >= 85, ["name", "math"]]
print("Math >= 85:\n", smart)

# iloc: select by position [rows, cols]
print("First 3 rows, cols 1..2:\n", df.iloc[0:3, 1:3])

New columns and vectorized operations

df["average"] = df[["math", "science"]].mean(axis=1)
# Grade letters using NumPy where
df["grade"] = np.where(df["average"] >= 90, "A",
                np.where(df["average"] >= 80, "B",
                np.where(df["average"] >= 70, "C", "D")))
print(df[["name", "average", "grade"]])

GroupBy and aggregation

grouped = df.groupby("class").agg(
    math_mean=("math", "mean"),
    sci_mean=("science", "mean"),
    count=("name", "count")
)
print(grouped)

Handle missing values (NaN)

print("Missing per column:\n", df.isna().sum())

# Fill math NaN with the column mean
df["math"] = df["math"].fillna(df["math"].mean())
print("After fill:\n", df)

Reading from and writing to CSV

df.to_csv("students.csv", index=False)
loaded = pd.read_csv("students.csv")
print("Loaded:\n", loaded.head())

NumPy + Pandas together

Pandas columns are NumPy arrays inside, so NumPy functions work on them.
You can convert DataFrame parts to NumPy with .to_numpy() when needed.

Examples

# Use NumPy ufuncs directly on Series
loaded["log_math"] = np.log(loaded["math"])
print(loaded[["math", "log_math"]].head())

# Convert to NumPy for custom computation (row-wise norm)
scores = loaded[["math", "science"]].to_numpy()
row_norm = np.sqrt((scores**2).sum(axis=1))
loaded["score_norm"] = row_norm
print(loaded[["name", "score_norm"]])

Practical project — Student Scores Analyst

Goal: Use Pandas to load, clean, and summarize a small dataset, and NumPy to compute z-scores to spot unusually high or low scores.

What you'll build

A CSV of student test scores across subjects and dates
A cleaned DataFrame with missing values filled by subject mean
Summaries: top students, subject stats
Z-scores for each score to find outliers
Saved results as CSVs

Step 0: Create a sample dataset (or replace with your own CSV)

import numpy as np
import pandas as pd

np.random.seed(7)
students = ["Ava", "Ben", "Cara", "Dan", "Elle", "Finn", "Gia", "Hugo"]
subjects = ["Math", "Science", "History"]
dates = pd.date_range("2025-01-01", periods=10, freq="W")

rows = []
for d in dates:
    for s in subjects:
        for st in students:
            score = np.random.randint(50, 101)   # 50..100
            # Randomly make ~8% of scores missing
            if np.random.rand() < 0.08:
                score = np.nan
            rows.append({"date": d, "student": st, "subject": s, "score": score})

data = pd.DataFrame(rows)
data.to_csv("scores.csv", index=False)
print("Saved scores.csv with", len(data), "rows")

Step 1: Load and inspect

df = pd.read_csv("scores.csv", parse_dates=["date"])
print(df.head())
print(df.info())
print("Missing per column:\n", df.isna().sum())

Step 2: Clean missing scores (fill with subject mean)

# Compute mean score per subject, aligned to each row
subject_means = df.groupby("subject")["score"].transform("mean")
df["score_filled"] = df["score"].fillna(subject_means)

# Check no missing remain in score_filled
print("Missing in score_filled:", df["score_filled"].isna().sum())

Step 3: Summaries

# Top 5 students by average score
student_avg = df.groupby("student")["score_filled"].mean().sort_values(ascending=False)
print("Top students:\n", student_avg.head(5))

# Subject-level stats
subject_stats = df.groupby("subject")["score_filled"].agg(["count", "mean", "median", "std", "min", "max"])
print("Subject stats:\n", subject_stats)

Step 4: Z-scores per subject (using NumPy)

Explanation: z = (value - mean) / std. This compares a score to others in the same subject.

mean_by_subj = df.groupby("subject")["score_filled"].transform("mean")
std_by_subj = df.groupby("subject")["score_filled"].transform("std")
df["zscore"] = (df["score_filled"] - mean_by_subj) / std_by_subj

# Find unusually high/low scores (|z| >= 2)
outliers = df.loc[df["zscore"].abs() >= 2, ["date", "student", "subject", "score_filled", "zscore"]]
print("Outliers:\n", outliers.sort_values("zscore"))

Step 5: Pivot table for a dashboard-like view

pivot = pd.pivot_table(
    df, index="student", columns="subject",
    values="score_filled", aggfunc="mean"
)
print("Average score per student per subject:\n", pivot.round(1))

Step 6: Save results

student_avg.to_csv("student_averages.csv", header=["avg_score"])
subject_stats.to_csv("subject_stats.csv")
outliers.to_csv("outliers.csv", index=False)
pivot.to_csv("student_subject_matrix.csv")
print("Saved analysis CSVs.")

Optional challenges

Filter by a date range to compare early vs. late performance.
Add a new feature: consistency = std per student (lower is steadier).
Use NumPy to compute the 90th percentile score overall:
p90 = np.percentile(df["score_filled"], 90)

Common tips and gotchas

Use df.loc[condition, columns] to filter and update safely.
When combining conditions, use parentheses and & (and), | (or).
Example: df.loc[(df.math > 80) & (df.science > 80)]
Pay attention to shapes in NumPy; shape mismatches cause errors. Use reshape or keep dimensions aligned.
Set a random seed for reproducible results when using np.random.

Summary

NumPy gives you fast, vectorized arrays with powerful math, broadcasting, and aggregation tools.
Pandas builds labeled tables (DataFrames) that make real-world data cleaning, filtering, and summarizing easy.
They work beautifully together: Pandas columns are NumPy arrays, so you can use NumPy functions directly.
You practiced loading data, fixing missing values, grouping and aggregating, computing z-scores, and saving results.

Next step ideas

Add plots with matplotlib or pandas.plot() to visualize your results.
Try merging two DataFrames (pd.merge) like "scores" with "student_info" to enrich your analysis.

python Topics

python Tutorial

Introduction to NumPy and Pandas

What you'll learn

Setup

Part 1 — NumPy: Fast math with arrays

Key ideas

Code: Creating arrays and basic info

Vectorized math and ufuncs

Aggregations and axes

Broadcasting (automatic shape matching)

Indexing, slicing, boolean masks

Random numbers (reproducible)

Part 2 — Pandas: Data tables you can filter and summarize

Key ideas

Create a DataFrame and explore

Selecting and filtering

New columns and vectorized operations

GroupBy and aggregation

Handle missing values (NaN)

Reading from and writing to CSV

NumPy + Pandas together

Examples

Practical project — Student Scores Analyst

What you'll build

Step 0: Create a sample dataset (or replace with your own CSV)

Step 1: Load and inspect

Step 2: Clean missing scores (fill with subject mean)

Step 3: Summaries

Step 4: Z-scores per subject (using NumPy)

Step 5: Pivot table for a dashboard-like view

Step 6: Save results

Optional challenges

Common tips and gotchas

Summary

Next step ideas