Cleaning and Organizing Data

Introduction

Imagine your school club collects sign-up info: names, grades, emails. Some students type " aLiCe ", some forget their email, and others type "Grade Nine" instead of 9. Messy! Cleaning and organizing data helps turn messy information into something clear and dependable, so you can answer real questions like: "How many members do we have?" or "Who still needs a permission form?"

What You'll Learn

How to load and peek at a dataset
How to fix common messes: extra spaces, weird capitalization, missing values, duplicates, and wrong data types
How to sort and save clean data

What You Need

Python 3
The pandas library
Tip: If you don't have pandas, install it in a terminal or in a notebook cell: pip install pandas

Step-by-Step: Cleaning and Organizing Data

Peek at your data (look before you clean) — Load the data. Look at a few rows. Check what columns you have and if anything is missing.
Fix column names — Clean column names so they are easy to read and use: lowercase, no spaces, no weird symbols.
Clean text values — Remove extra spaces. Make text consistent (e.g., "YES", "Yes", "yes" → all become "Yes" or "yes").
Handle missing values — Decide what to do if something is blank (missing). You can fill with a placeholder (like "Unknown") or drop the row if it's unusable.
Fix data types — Make sure numbers are numbers (not text). Make sure dates are dates.
Remove duplicates — If the same person/product appears twice by mistake, keep just one.
Sort and save — Put the data in a helpful order (e.g., by age or price). Save the cleaned data to a new file.

Python Code Examples

Code Example 1: Load and Peek at Messy Data

We'll make a small, messy dataset right in the code (no file needed).

# Code Example 1: Load and peek at messy data
import pandas as pd

# Create a small, messy dataset
data = {
    " Name ": [" alice ", "Bob", "BOB ", "Charlie", "dora  "],
    "Age": ["14", "15", None, "15", " 16 "],
    "Email": ["alice@example.com", "", "bob@example.com", "charlie@example.com", None],
    "Member": ["yes", "Yes", "NO", "yes ", " yes"],
    "Notes": ["", "likes chess", None, "  ", "likes chess"]  # for duplicates test later
}

df = pd.DataFrame(data)

# Peek at the data
print("First few rows:")
print(df.head(), "\n")

# Check columns and data types
print("Info about the data:")
print(df.info(), "\n")

# Count missing values per column
print("Missing values per column:")
print(df.isna().sum())

What's happening:

df.head() shows the top rows.
df.info() shows column types and non-missing counts.
df.isna().sum() shows how many values are missing in each column.

Code Example 2: Clean Column Names and Text

We'll make columns easier to use and fix text formatting.

# Code Example 2: Clean column names and text

# 1) Clean column names: lowercase, strip spaces, replace spaces with underscores
df.columns = (
    df.columns
      .str.strip()       # remove spaces around column names
      .str.lower()       # make lowercase
      .str.replace(" ", "_")  # replace spaces with underscores
)

# 2) Clean text values in specific columns
# Strip spaces from 'name' and set to Title Case (e.g., 'alice' -> 'Alice')
df["name"] = df["name"].str.strip().str.title()

# Standardize yes/no in 'member' column
# - Strip spaces
# - Lowercase
# - Map to consistent "Yes" or "No"
df["member"] = (
    df["member"]
      .str.strip()
      .str.lower()
      .map({"yes": "Yes", "no": "No"})
)

# Clean email: strip spaces; turn empty strings into actual missing values (NaN)
df["email"] = df["email"].astype(str).str.strip()
df["email"] = df["email"].replace({"": pd.NA, "none": pd.NA, "nan": pd.NA})

# Show results so far
print("After cleaning column names and text:")
print(df.head())

What's happening:

Column names become simple: name, age, email, member, notes.
Text is standardized so you don't have to remember every possible spelling.

Code Example 3: Missing Values, Data Types, Duplicates, Sort, Save

Now let's fix missing values, make sure numbers are numbers, remove duplicates, and save.

# Code Example 3: Handle missing values, types, duplicates, sort, and save

# 1) Convert age to numbers (some were strings with spaces)
df["age"] = df["age"].astype(str).str.strip()  # remove spaces
df["age"] = pd.to_numeric(df["age"], errors="coerce")  # convert to number; bad values -> NaN

# 2) Fill missing text with a helpful placeholder
df["email"] = df["email"].fillna("no-email@pyverse.com")

# 3) If age is missing, decide what to do:
# For this example, we will drop rows missing age (depends on your data needs)
df = df.dropna(subset=["age"])

# 4) Remove duplicates if all key columns match
# Choose columns that define a "duplicate" for your situation
df = df.drop_duplicates(subset=["name", "email", "member", "age"])

# 5) Optional: remove rows where 'notes' is empty or just spaces
# First turn pure-space notes into missing
df["notes"] = df["notes"].astype(str)
df["notes"] = df["notes"].where(df["notes"].str.strip() != "", pd.NA)

# 6) Sort by age (youngest first)
df = df.sort_values(by="age")

# 7) Save to a new CSV file
df.to_csv("clean_members.csv", index=False)

print("Cleaned data:")
print(df)
print("\nSaved to clean_members.csv")

What's happening:

pd.to_numeric turns age into numbers safely.
We fill missing emails with a clear placeholder.
We drop rows with missing age (your choice may differ).
We remove duplicated entries to avoid counting someone twice.
We sort and save the final clean dataset.

Small Practical Exercise 🛒

Your task: You are helping a small online store clean their product list. Fix the messy values and save the clean data.

Dataset (copy into your code):

You can paste this into your script and read it using pandas and io.StringIO.

import pandas as pd
import io

csv_text = """Product , Price , City , Email
"  water bottle  " , "$12.99" , " new york " , "store@example.com"
"Headphones" , "$ 29.50 " , "NEW YORK" , ""
"Phone Case" , "$9.00" , "  los angeles", "support@example.com"
"phone case" , "$9.00" , " Los Angeles " , "support@example.com"
"Sticker Pack" , "$3" , "seattle", 
"""

# Read the CSV from the text above (no file needed)
df = pd.read_csv(io.StringIO(csv_text))
print("Original data:")
print(df)

Your cleaning goals:

Clean column names (lowercase, underscores, no extra spaces).
City: strip spaces and use Title Case (e.g., "new york" → "New York").
Price: make it a number (remove the $ sign and extra spaces, convert to float).
Email: replace empty or missing emails with "no-email@pyverse.com".
Remove duplicate product rows (a duplicate is same product name, price, and city).
Sort by price from highest to lowest.
Save to clean_products.csv.

Starter code (fill the TODOs):

# 1) Clean column names
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(" ", "_")
)

# 2) Clean text columns
df["product"] = df["product"].astype(str).str.strip().str.title()
df["city"] = df["city"].astype(str).str.strip().str.title()

# 3) Clean price: remove '$' and spaces, convert to float
df["price"] = (
    df["price"]
      .astype(str)
      .str.replace("$", "", regex=False)
      .str.strip()
)
# TODO: convert price to a number (hint: pd.to_numeric with errors="coerce")
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# 4) Clean email: empty strings and missing -> "no-email@pyverse.com"
df["email"] = df["email"].astype(str).str.strip()
df["email"] = df["email"].replace({"": pd.NA, "nan": pd.NA, "None": pd.NA})
df["email"] = df["email"].fillna("no-email@pyverse.com")

# 5) Remove duplicates by product, price, city
df = df.drop_duplicates(subset=["product", "price", "city"])

# 6) Sort by price, highest to lowest
df = df.sort_values(by="price", ascending=False)

# 7) Save the clean file
df.to_csv("clean_products.csv", index=False)

print("Cleaned data:")
print(df)

Challenge extension (optional):

Add a new column price_with_tax that is price * 1.10 (10% tax), then round to 2 decimals.

Tips and Common Gotchas

If a number won't convert, check for hidden spaces or symbols like "$".
If your dates look weird, try: pd.to_datetime(df["date"], errors="coerce").
Decide on one style for text (Title Case or lowercase) and stick to it.
Keep the original file safe; save cleaned data to a new file.

Recap

Cleaning data means making it clear, consistent, and ready to use.
Steps: peek, fix column names, clean text, handle missing values, fix types, remove duplicates, sort, and save.
With pandas, a few lines of code can turn messy data into clean, organized info for real-world decisions.

data-science Topics

data-science Tutorial

Cleaning and Organizing Data

Introduction

What You'll Learn

What You Need

Step-by-Step: Cleaning and Organizing Data

Python Code Examples

Code Example 1: Load and Peek at Messy Data

What's happening:

Code Example 2: Clean Column Names and Text

What's happening:

Code Example 3: Missing Values, Data Types, Duplicates, Sort, Save

What's happening:

Small Practical Exercise 🛒

Dataset (copy into your code):

Your cleaning goals:

Starter code (fill the TODOs):

Challenge extension (optional):

Tips and Common Gotchas

Recap