🎉 Welcome to PyVerse! Start Learning Today

Regular Expressions

Regular Expressions in Python (Regex) 

Goal

  • Understand what regular expressions (regex) are and why they're powerful.
  • Learn the core regex building blocks and how to use Python's re module.
  • Practice with clear examples.
  • Build a small, practical project: a Contact Extractor and Cleaner.

What is a Regular Expression?

A regular expression (regex) is a tiny pattern language for finding, checking, or changing text. Think of it like a super-smart "find" tool. Instead of searching for exactly "cat," you can search for "any three-letter word," "an email," or "a phone number," and more.

In Python, you write regex with the re module. Important: always write patterns as raw strings (prefix with r) so backslashes behave correctly.

Example: r"\d+" means "one or more digits." Without r"\d" is a Python escape and can misbehave.

Regex cheat sheet (most-used)

  • Literals: cat matches "cat".
  • Character classes:
    • [abc] any one of a, b, or c
    • [a-z] any lowercase letter a to z
    • [^a-z] any character NOT a to z
    • \d digit [0-9], \w word [A-Za-z0-9_], \s whitespace; \D\W\S are the opposites
  • Quantifiers:
    • ? 0 or 1 times
    • * 0 or more
    • + 1 or more
    • {m} exactly m times; {m,n} between m and n
  • Anchors and boundaries:
    • ^ start of string/line
    • $ end of string/line
    • \b word boundary (between letter/number and not-letter/number)
  • Groups and choices:
    • (...) capture group
    • (?:...) non-capturing group
    • (?P<name>...) named group
    • | OR (alternation)
  • Greedy vs lazy:
    • * and + are greedy (grab as much as possible)
    • *? and +? are lazy (grab as little as possible)
  • Lookarounds (advanced, read-only checks):
    • (?=...) positive lookahead (must be followed by …)
    • (?!...) negative lookahead (must NOT be followed by …)

Python's re module essentials

  • re.search(pat, text): first match anywhere
  • re.match(pat, text): match at start only (use rarely; prefer fullmatch if validating)
  • re.fullmatch(pat, text): match the whole string (best for validation)
  • re.findall(pat, text): list of all matches (strings or tuples)
  • re.finditer(pat, text): iterator of match objects (gives spans and groups)
  • re.sub(pat, repl, text): replace matches
  • re.split(pat, text): split on a pattern
  • re.compile(pat, flags=...): pre-compile a pattern for speed/readability
  • Common flags: re.IGNORECASE (case-insensitive), re.MULTILINE (^/$ match per line), re.DOTALL (. matches newline), re.VERBOSE (allow comments and spaces in pattern)

1) Getting started: basic matches

import re text = "My cat has 9 lives." # Search anywhere m1 = re.search(r"cat", text) print(m1.group(), m1.span()) # cat (3, 6) # Full string validation: only digits? print(bool(re.fullmatch(r"\d+", "12345"))) # True print(bool(re.fullmatch(r"\d+", "123a5"))) # False # Find all numbers print(re.findall(r"\d+", text)) # ['9']

2) Character classes and boundaries

text = "Colors: red, reed, read; feed; lead." # Words starting with r and then e, then one more e or a, then d: r(e[ea])d print(re.findall(r"\bred\b|\breed\b|\bread\b", text)) # ['red', 'reed', 'read'] # Cleaner with grouping: print(re.findall(r"\br(e(?:e|a))d\b|\bred\b", text)) # ['ee', 'ea'] plus 'red' handled separately # Simpler: r followed by (ed|eed|ead) print(re.findall(r"\br(?:ed|eed|ead)\b", text)) # ['red', 'reed', 'read'] # Boundary example: match 'cat' as a whole word only print(re.findall(r"\bcat\b", "bobcat cat scatter catalog")) # -> ['cat'] only the standalone 'cat'

3) Quantifiers and greedy vs lazy

text = 'He said "hello" and then "bye".' # Greedy: grabs from first " to last " print(re.findall(r'"(.*)"', text)) # ['hello" and then "bye'] # Lazy: stop at the first closing " print(re.findall(r'"(.*?)"', text)) # ['hello', 'bye']

4) Groups, capturing, and named groups

text = "Date: 2025-10-05. Backup: 05/10/2025." # Named groups for readable access iso = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text) print(iso.group('year'), iso.group('month'), iso.group('day')) # 2025 10 05 # Multiple date formats using alternation and named groups pattern = re.compile(r""" (?: (?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2}) # YYYY-MM-DD | (?P<d2>\d{2})/(?P<m2>\d{2})/(?P<y2>\d{4}) # DD/MM/YYYY ) """, re.VERBOSE) for m in pattern.finditer(text): print(m.group(), m.groupdict())

5) Replacement (sub) and split

text = "Call me at 555-123-4567 or 555.999.0000!" # Replace all separators with a dash print(re.sub(r"[.\s]", "-", text)) # dots/spaces become dashes # Extract all words by splitting on non-word print(re.split(r"\W+", "This_is a test! Cool? Yes.")) # ['This_is', 'a', 'test', 'Cool', 'Yes', '']

6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE

text = "Apple\nbanana\nApricot" # Case-insensitive find print(re.findall(r"^a\w+", text, flags=re.IGNORECASE | re.MULTILINE)) # -> ['Apple', 'Apricot'] # DOTALL makes . match newlines print(re.findall(r"A.*t", text, flags=re.DOTALL)) # Apple\nbanana\nApricot

7) Lookarounds (quick peek checks)

text = "apple pie, apple tart, apple juice" # 'apple' not followed by ' pie' print(re.findall(r"apple(?! pie)", text)) # ['apple', 'apple'] for tart and juice # word followed by digits print(re.findall(r"\b\w+(?=\d)", "x9 y10 z 123 abc")) # ['x', 'y']

8) Safe dynamic patterns with re.escape

If you insert user text into a regex, escape it first to avoid accidental special characters.

user = "file.name[1].txt" safe = re.escape(user) print(safe) # file\.name\[1\]\.txt print(bool(re.search(safe, "open file.name[1].txt now"))) # True

Practical Project: Contact Extractor and Cleaner

Goal: From messy text, extract emails and US-style phone numbers, then normalize them. This simulates cleaning data from logs, chats, or copy-pasted documents.

What you'll practice:

  • Designing and compiling readable patterns
  • Using named groups and finditer
  • Using sub with a function to normalize
  • Deduplicating results

Step 1: Sample messy text

sample = """ Team: - Alex: alex.miller@example.com, Phone: (555) 123-4567 - SAM: SAM99@Example.COM, Phone: 555.987.0000 ext 23 - Lina: lina+school@edu.co | phone: +1 555 111 2222 Backup: Contact support@example.co.uk or 555-222-3333. """

Step 2: Build patterns

import re EMAIL_RE = re.compile(r""" \b [A-Za-z0-9._%+-]+ # username @ [A-Za-z0-9.-]+ # domain \.[A-Za-z]{2,} # TLD \b """, re.VERBOSE) PHONE_RE = re.compile(r""" \b (?:\+?1[\s.-]*)? # optional country code +1 (?:\(?(\d{3})\)?[\s.-]*) # area code (captured) (\d{3})[\s.-]* # prefix (\d{4}) # line number (?:\s*(?:ext\.?|x)\s*(\d+))? # optional extension """, re.VERBOSE | re.IGNORECASE)

Step 3: Extraction and normalization

def normalize_email(e): return e.lower() def normalize_phone(match): area, pref, line, ext = match.groups() # Assume +1 if 10-digit US pattern base = f"+1-{area}-{pref}-{line}" if ext: base += f" x{ext}" return base def extract_contacts(text): emails = {normalize_email(m.group()) for m in EMAIL_RE.finditer(text)} # Use sub with a function to normalize phones in-place and collect originals normalized_phones = set() for m in PHONE_RE.finditer(text): normalized_phones.add(normalize_phone(m)) return sorted(emails), sorted(normalized_phones) emails, phones = extract_contacts(sample) print("Emails:") for e in emails: print(" -", e) print("Phones:") for p in phones: print(" -", p)

Expected output (order may vary):

  • Emails:
    • alex.miller@example.com
    • lina+school@edu.co
    • sam99@example.com
    • support@example.co.uk
  • Phones:
    • +1-555-111-2222
    • +1-555-123-4567
    • +1-555-222-3333
    • +1-555-987-0000 x23

Step 4 (optional): Redact sensitive info in text

def redact(text): text = EMAIL_RE.sub(lambda m: m.group()[0] + "***" + "@***", text) text = PHONE_RE.sub(lambda m: normalize_phone(m)[:7] + "***-****", text) return text print(redact(sample))

Advanced polishing (optional)

  • Use re.VERBOSE to comment complex patterns.
  • Pre-compile patterns you reuse for speed.
  • If you accept user input to build patterns, always use re.escape.
  • Write unit tests for patterns using a list of "should match" and "should not match" cases.

Common pitfalls and pro tips

  • Always use raw strings: r"\d{3}" not "\\d{3}".
  • For validation, use re.fullmatch so extra characters don't sneak in.
  • Greedy vs lazy matters a lot with .* and .+; try .*? first when matching "inside" delimiters like quotes.
  • In multiline text, add re.MULTILINE if you want ^ and $ to match at each line's edges.
  • In dot matches across lines, use re.DOTALL.
  • Prefer named groups (?P<name>...) for clarity.
  • Keep patterns readable with re.VERBOSE and comments.

Mini practice exercises

  • Write a regex to validate a simple username: starts with a letter, then letters/digits/underscores, length 3–16.
    • Hint: r"^[A-Za-z]\w{2,15}$"
  • Extract all hashtags from a sentence like "I love #coding and #Python3!".
    • Hint: r"#\w+"
  • Replace repeated spaces and tabs with a single space.
    • Hint: r"[ \t]+"

Summary

  • Regex is a compact language to find, validate, and transform text.
  • In Python, use the re module with raw strings.
  • Master the basics: character classes, quantifiers, anchors, groups, and lookarounds.
  • Practice with findall/finditer for extraction, fullmatch for validation, and sub/split for cleaning.
  • The Contact Extractor project showed how to design real patterns, normalize results, and redact sensitive info.

Keep experimenting. Small, focused patterns are easier to test and combine. Build confidence with examples, then use VERBOSE and named groups for bigger patterns.

Loading quizzes...