What you'll learn
- Why NumPy arrays are faster and more powerful than Python lists for math
- How to build, slice, and compute with NumPy arrays (including broadcasting)
- How to use Pandas Series and DataFrames to clean, filter, and summarize real data
- How NumPy and Pandas work together for analysis
- A mini project that analyzes a small dataset with both libraries
Setup
Make sure you have Python 3.9+ installed.
Install the libraries:
pip install numpy pandasPart 1 — NumPy: Fast math with arrays
Plain Python lists are great for storing values, but they're slow for math. NumPy arrays store numbers in a compact block of memory and use fast C code under the hood.
Key ideas
- Array: like a list, but typed and fast. Has shape (size of each dimension) and dtype (data type).
- Vectorization: do math on whole arrays without loops.
- Broadcasting: NumPy automatically stretches arrays with size 1 to match shapes during operations.
- Axes: axis=0 means "down the rows," axis=1 means "across the columns" for 2D arrays.
Code: Creating arrays and basic info
import numpy as np
# 1D and 2D arrays
a = np.array([1, 2, 3], dtype=np.int32)
b = np.array([[1, 2, 3],
[4, 5, 6]], dtype=float)
print("a:", a, "dtype:", a.dtype, "shape:", a.shape)
print("b:\n", b, "\ndtype:", b.dtype, "shape:", b.shape, "ndim:", b.ndim)
# Quick arrays
zeros = np.zeros((2, 3))
ones = np.ones((3, 2))
r = np.arange(0, 10, 2) # 0,2,4,6,8
lin = np.linspace(0, 1, 5) # 5 evenly spaced numbers from 0 to 1
print("zeros:\n", zeros)
print("r:", r)
print("lin:", lin)Vectorized math and ufuncs
x = np.array([10, 20, 30, 40], dtype=np.float64)
print("x * 0.1:", x * 0.1) # Multiply every element
print("x + 5:", x + 5) # Add to every element
print("sqrt:", np.sqrt(x)) # Universal function
print("mean:", np.mean(x)) # AggregateAggregations and axes
M = np.array([[1, 2, 3],
[4, 5, 6]])
print("sum all:", M.sum())
print("sum by column (axis=0):", M.sum(axis=0)) # [1+4, 2+5, 3+6]
print("sum by row (axis=1):", M.sum(axis=1))Broadcasting (automatic shape matching)
# Column vector (3x1) + row vector (1x4) -> full 3x4 grid
col = np.array([[1],
[2],
[3]])
row = np.array([10, 20, 30, 40])
print("Broadcast result:\n", col + row)Indexing, slicing, boolean masks
arr = np.arange(1, 13).reshape(3, 4) # 3 rows, 4 cols
print("arr:\n", arr)
print("Element at row 1, col 2:", arr[1, 2]) # zero-based indexing
print("First row:", arr[0, :])
print("Last two columns:\n", arr[:, -2:])
# Boolean mask: pick even numbers
mask = (arr % 2 == 0)
print("Mask:\n", mask)
print("Even numbers:", arr[mask])Random numbers (reproducible)
np.random.seed(42) # same random values every run
rand_ints = np.random.randint(0, 100, size=(5, 3))
rand_norm = np.random.normal(loc=0, scale=1, size=5)
print("Random ints:\n", rand_ints)
print("Random normal:", rand_norm)Part 2 — Pandas: Data tables you can filter and summarize
Pandas sits on top of NumPy and adds labels (row index, column names) and lots of data tools. Think of a DataFrame like a spreadsheet you can control with Python.
Key ideas
- Series: one column (1D, labeled).
- DataFrame: many columns (2D table), each column is a Series.
- loc selects by labels; iloc selects by position.
- GroupBy summarizes data by categories.
- Missing values: NaN. Use isna, fillna, dropna.
Create a DataFrame and explore
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Ava", "Ben", "Cara", "Dan", "Elle"],
"class": ["Red", "Blue", "Red", "Blue", "Red"],
"math": [88, 92, np.nan, 75, 85],
"science": [91, 84, 89, 90, 88]
})
print(df.head()) # first rows
print(df.info()) # column types, non-null counts
print(df.describe()) # numeric summarySelecting and filtering
# Select columns
print(df["math"]) # Series
print(df[["name", "science"]]) # DataFrame
# loc: filter rows with a condition, and choose columns
smart = df.loc[df["math"] >= 85, ["name", "math"]]
print("Math >= 85:\n", smart)
# iloc: select by position [rows, cols]
print("First 3 rows, cols 1..2:\n", df.iloc[0:3, 1:3])New columns and vectorized operations
df["average"] = df[["math", "science"]].mean(axis=1)
# Grade letters using NumPy where
df["grade"] = np.where(df["average"] >= 90, "A",
np.where(df["average"] >= 80, "B",
np.where(df["average"] >= 70, "C", "D")))
print(df[["name", "average", "grade"]])GroupBy and aggregation
grouped = df.groupby("class").agg(
math_mean=("math", "mean"),
sci_mean=("science", "mean"),
count=("name", "count")
)
print(grouped)Handle missing values (NaN)
print("Missing per column:\n", df.isna().sum())
# Fill math NaN with the column mean
df["math"] = df["math"].fillna(df["math"].mean())
print("After fill:\n", df)Reading from and writing to CSV
df.to_csv("students.csv", index=False)
loaded = pd.read_csv("students.csv")
print("Loaded:\n", loaded.head())NumPy + Pandas together
- Pandas columns are NumPy arrays inside, so NumPy functions work on them.
- You can convert DataFrame parts to NumPy with
.to_numpy()when needed.
Examples
# Use NumPy ufuncs directly on Series
loaded["log_math"] = np.log(loaded["math"])
print(loaded[["math", "log_math"]].head())
# Convert to NumPy for custom computation (row-wise norm)
scores = loaded[["math", "science"]].to_numpy()
row_norm = np.sqrt((scores**2).sum(axis=1))
loaded["score_norm"] = row_norm
print(loaded[["name", "score_norm"]])Practical project — Student Scores Analyst
Goal: Use Pandas to load, clean, and summarize a small dataset, and NumPy to compute z-scores to spot unusually high or low scores.
What you'll build
- A CSV of student test scores across subjects and dates
- A cleaned DataFrame with missing values filled by subject mean
- Summaries: top students, subject stats
- Z-scores for each score to find outliers
- Saved results as CSVs
Step 0: Create a sample dataset (or replace with your own CSV)
import numpy as np
import pandas as pd
np.random.seed(7)
students = ["Ava", "Ben", "Cara", "Dan", "Elle", "Finn", "Gia", "Hugo"]
subjects = ["Math", "Science", "History"]
dates = pd.date_range("2025-01-01", periods=10, freq="W")
rows = []
for d in dates:
for s in subjects:
for st in students:
score = np.random.randint(50, 101) # 50..100
# Randomly make ~8% of scores missing
if np.random.rand() < 0.08:
score = np.nan
rows.append({"date": d, "student": st, "subject": s, "score": score})
data = pd.DataFrame(rows)
data.to_csv("scores.csv", index=False)
print("Saved scores.csv with", len(data), "rows")Step 1: Load and inspect
df = pd.read_csv("scores.csv", parse_dates=["date"])
print(df.head())
print(df.info())
print("Missing per column:\n", df.isna().sum())Step 2: Clean missing scores (fill with subject mean)
# Compute mean score per subject, aligned to each row
subject_means = df.groupby("subject")["score"].transform("mean")
df["score_filled"] = df["score"].fillna(subject_means)
# Check no missing remain in score_filled
print("Missing in score_filled:", df["score_filled"].isna().sum())Step 3: Summaries
# Top 5 students by average score
student_avg = df.groupby("student")["score_filled"].mean().sort_values(ascending=False)
print("Top students:\n", student_avg.head(5))
# Subject-level stats
subject_stats = df.groupby("subject")["score_filled"].agg(["count", "mean", "median", "std", "min", "max"])
print("Subject stats:\n", subject_stats)Step 4: Z-scores per subject (using NumPy)
Explanation: z = (value - mean) / std. This compares a score to others in the same subject.
mean_by_subj = df.groupby("subject")["score_filled"].transform("mean")
std_by_subj = df.groupby("subject")["score_filled"].transform("std")
df["zscore"] = (df["score_filled"] - mean_by_subj) / std_by_subj
# Find unusually high/low scores (|z| >= 2)
outliers = df.loc[df["zscore"].abs() >= 2, ["date", "student", "subject", "score_filled", "zscore"]]
print("Outliers:\n", outliers.sort_values("zscore"))Step 5: Pivot table for a dashboard-like view
pivot = pd.pivot_table(
df, index="student", columns="subject",
values="score_filled", aggfunc="mean"
)
print("Average score per student per subject:\n", pivot.round(1))Step 6: Save results
student_avg.to_csv("student_averages.csv", header=["avg_score"])
subject_stats.to_csv("subject_stats.csv")
outliers.to_csv("outliers.csv", index=False)
pivot.to_csv("student_subject_matrix.csv")
print("Saved analysis CSVs.")Optional challenges
- Filter by a date range to compare early vs. late performance.
- Add a new feature: consistency = std per student (lower is steadier).
- Use NumPy to compute the 90th percentile score overall:
p90 = np.percentile(df["score_filled"], 90)
Common tips and gotchas
- Use
df.loc[condition, columns]to filter and update safely. - When combining conditions, use parentheses and
&(and),|(or).
Example:df.loc[(df.math > 80) & (df.science > 80)] - Pay attention to shapes in NumPy; shape mismatches cause errors. Use reshape or keep dimensions aligned.
- Set a random seed for reproducible results when using
np.random.
Summary
- NumPy gives you fast, vectorized arrays with powerful math, broadcasting, and aggregation tools.
- Pandas builds labeled tables (DataFrames) that make real-world data cleaning, filtering, and summarizing easy.
- They work beautifully together: Pandas columns are NumPy arrays, so you can use NumPy functions directly.
- You practiced loading data, fixing missing values, grouping and aggregating, computing z-scores, and saving results.
Next step ideas
- Add plots with matplotlib or pandas.plot() to visualize your results.
- Try merging two DataFrames (pd.merge) like "scores" with "student_info" to enrich your analysis.