🎉 Welcome to PyVerse! Start Learning Today

PYTHONDATA SCIENCE

Cleaning and Organizing Data

PyVerse Lesson: Cleaning and Organizing Data

Introduction

Imagine your school club collects sign-up info: names, grades, emails. Some students type " aLiCe ", some forget their email, and others type "Grade Nine" instead of 9. Messy! Cleaning and organizing data helps turn messy information into something clear and dependable, so you can answer real questions like: "How many members do we have?" or "Who still needs a permission form?"

What You'll Learn

  • How to load and peek at a dataset
  • How to fix common messes: extra spaces, weird capitalization, missing values, duplicates, and wrong data types
  • How to sort and save clean data

What You Need

  • Python 3
  • The pandas library
  • Tip: If you don't have pandas, install it in a terminal or in a notebook cell: pip install pandas

Step-by-Step: Cleaning and Organizing Data

  1. Peek at your data (look before you clean) — Load the data. Look at a few rows. Check what columns you have and if anything is missing.
  2. Fix column names — Clean column names so they are easy to read and use: lowercase, no spaces, no weird symbols.
  3. Clean text values — Remove extra spaces. Make text consistent (e.g., "YES", "Yes", "yes" → all become "Yes" or "yes").
  4. Handle missing values — Decide what to do if something is blank (missing). You can fill with a placeholder (like "Unknown") or drop the row if it's unusable.
  5. Fix data types — Make sure numbers are numbers (not text). Make sure dates are dates.
  6. Remove duplicates — If the same person/product appears twice by mistake, keep just one.
  7. Sort and save — Put the data in a helpful order (e.g., by age or price). Save the cleaned data to a new file.

Python Code Examples

Code Example 1: Load and Peek at Messy Data

We'll make a small, messy dataset right in the code (no file needed).

# Code Example 1: Load and peek at messy data import pandas as pd # Create a small, messy dataset data = { " Name ": [" alice ", "Bob", "BOB ", "Charlie", "dora "], "Age": ["14", "15", None, "15", " 16 "], "Email": ["alice@example.com", "", "bob@example.com", "charlie@example.com", None], "Member": ["yes", "Yes", "NO", "yes ", " yes"], "Notes": ["", "likes chess", None, " ", "likes chess"] # for duplicates test later } df = pd.DataFrame(data) # Peek at the data print("First few rows:") print(df.head(), "\n") # Check columns and data types print("Info about the data:") print(df.info(), "\n") # Count missing values per column print("Missing values per column:") print(df.isna().sum())

What's happening:

  • df.head() shows the top rows.
  • df.info() shows column types and non-missing counts.
  • df.isna().sum() shows how many values are missing in each column.

Code Example 2: Clean Column Names and Text

We'll make columns easier to use and fix text formatting.

# Code Example 2: Clean column names and text # 1) Clean column names: lowercase, strip spaces, replace spaces with underscores df.columns = ( df.columns .str.strip() # remove spaces around column names .str.lower() # make lowercase .str.replace(" ", "_") # replace spaces with underscores ) # 2) Clean text values in specific columns # Strip spaces from 'name' and set to Title Case (e.g., 'alice' -> 'Alice') df["name"] = df["name"].str.strip().str.title() # Standardize yes/no in 'member' column # - Strip spaces # - Lowercase # - Map to consistent "Yes" or "No" df["member"] = ( df["member"] .str.strip() .str.lower() .map({"yes": "Yes", "no": "No"}) ) # Clean email: strip spaces; turn empty strings into actual missing values (NaN) df["email"] = df["email"].astype(str).str.strip() df["email"] = df["email"].replace({"": pd.NA, "none": pd.NA, "nan": pd.NA}) # Show results so far print("After cleaning column names and text:") print(df.head())

What's happening:

  • Column names become simple: name, age, email, member, notes.
  • Text is standardized so you don't have to remember every possible spelling.

Code Example 3: Missing Values, Data Types, Duplicates, Sort, Save

Now let's fix missing values, make sure numbers are numbers, remove duplicates, and save.

# Code Example 3: Handle missing values, types, duplicates, sort, and save # 1) Convert age to numbers (some were strings with spaces) df["age"] = df["age"].astype(str).str.strip() # remove spaces df["age"] = pd.to_numeric(df["age"], errors="coerce") # convert to number; bad values -> NaN # 2) Fill missing text with a helpful placeholder df["email"] = df["email"].fillna("no-email@pyverse.com") # 3) If age is missing, decide what to do: # For this example, we will drop rows missing age (depends on your data needs) df = df.dropna(subset=["age"]) # 4) Remove duplicates if all key columns match # Choose columns that define a "duplicate" for your situation df = df.drop_duplicates(subset=["name", "email", "member", "age"]) # 5) Optional: remove rows where 'notes' is empty or just spaces # First turn pure-space notes into missing df["notes"] = df["notes"].astype(str) df["notes"] = df["notes"].where(df["notes"].str.strip() != "", pd.NA) # 6) Sort by age (youngest first) df = df.sort_values(by="age") # 7) Save to a new CSV file df.to_csv("clean_members.csv", index=False) print("Cleaned data:") print(df) print("\nSaved to clean_members.csv")

What's happening:

  • pd.to_numeric turns age into numbers safely.
  • We fill missing emails with a clear placeholder.
  • We drop rows with missing age (your choice may differ).
  • We remove duplicated entries to avoid counting someone twice.
  • We sort and save the final clean dataset.

Small Practical Exercise 🛒

Your task: You are helping a small online store clean their product list. Fix the messy values and save the clean data.

Dataset (copy into your code):

You can paste this into your script and read it using pandas and io.StringIO.

import pandas as pd import io csv_text = """Product , Price , City , Email " water bottle " , "$12.99" , " new york " , "store@example.com" "Headphones" , "$ 29.50 " , "NEW YORK" , "" "Phone Case" , "$9.00" , " los angeles", "support@example.com" "phone case" , "$9.00" , " Los Angeles " , "support@example.com" "Sticker Pack" , "$3" , "seattle", """ # Read the CSV from the text above (no file needed) df = pd.read_csv(io.StringIO(csv_text)) print("Original data:") print(df)

Your cleaning goals:

  • Clean column names (lowercase, underscores, no extra spaces).
  • City: strip spaces and use Title Case (e.g., "new york" → "New York").
  • Price: make it a number (remove the $ sign and extra spaces, convert to float).
  • Email: replace empty or missing emails with "no-email@pyverse.com".
  • Remove duplicate product rows (a duplicate is same product name, price, and city).
  • Sort by price from highest to lowest.
  • Save to clean_products.csv.

Starter code (fill the TODOs):

# 1) Clean column names df.columns = ( df.columns .str.strip() .str.lower() .str.replace(" ", "_") ) # 2) Clean text columns df["product"] = df["product"].astype(str).str.strip().str.title() df["city"] = df["city"].astype(str).str.strip().str.title() # 3) Clean price: remove '$' and spaces, convert to float df["price"] = ( df["price"] .astype(str) .str.replace("$", "", regex=False) .str.strip() ) # TODO: convert price to a number (hint: pd.to_numeric with errors="coerce") df["price"] = pd.to_numeric(df["price"], errors="coerce") # 4) Clean email: empty strings and missing -> "no-email@pyverse.com" df["email"] = df["email"].astype(str).str.strip() df["email"] = df["email"].replace({"": pd.NA, "nan": pd.NA, "None": pd.NA}) df["email"] = df["email"].fillna("no-email@pyverse.com") # 5) Remove duplicates by product, price, city df = df.drop_duplicates(subset=["product", "price", "city"]) # 6) Sort by price, highest to lowest df = df.sort_values(by="price", ascending=False) # 7) Save the clean file df.to_csv("clean_products.csv", index=False) print("Cleaned data:") print(df)

Challenge extension (optional):

Add a new column price_with_tax that is price * 1.10 (10% tax), then round to 2 decimals.

Tips and Common Gotchas

  • If a number won't convert, check for hidden spaces or symbols like "$".
  • If your dates look weird, try: pd.to_datetime(df["date"], errors="coerce").
  • Decide on one style for text (Title Case or lowercase) and stick to it.
  • Keep the original file safe; save cleaned data to a new file.

Recap

  • Cleaning data means making it clear, consistent, and ready to use.
  • Steps: peek, fix column names, clean text, handle missing values, fix types, remove duplicates, sort, and save.
  • With pandas, a few lines of code can turn messy data into clean, organized info for real-world decisions.

Loading quizzes...