Regular Expressions

Goal

Understand what regular expressions (regex) are and why they're powerful.
Learn the core regex building blocks and how to use Python's re module.
Practice with clear examples.
Build a small, practical project: a Contact Extractor and Cleaner.

What is a Regular Expression?

A regular expression (regex) is a tiny pattern language for finding, checking, or changing text. Think of it like a super-smart "find" tool. Instead of searching for exactly "cat," you can search for "any three-letter word," "an email," or "a phone number," and more.

In Python, you write regex with the re module. Important: always write patterns as raw strings (prefix with r) so backslashes behave correctly.

Example: r"\d+" means "one or more digits." Without r, "\d" is a Python escape and can misbehave.

Regex cheat sheet (most-used)

Literals: cat matches "cat".
Character classes:
- [abc] any one of a, b, or c
- [a-z] any lowercase letter a to z
- [^a-z] any character NOT a to z
- \d digit [0-9], \w word [A-Za-z0-9_], \s whitespace; \D, \W, \S are the opposites
Quantifiers:
- ? 0 or 1 times
- * 0 or more
- + 1 or more
- {m} exactly m times; {m,n} between m and n
Anchors and boundaries:
- ^ start of string/line
- $ end of string/line
- \b word boundary (between letter/number and not-letter/number)
Groups and choices:
- (...) capture group
- (?:...) non-capturing group
- (?P<name>...) named group
- | OR (alternation)
Greedy vs lazy:
- * and + are greedy (grab as much as possible)
- *? and +? are lazy (grab as little as possible)
Lookarounds (advanced, read-only checks):
- (?=...) positive lookahead (must be followed by …)
- (?!...) negative lookahead (must NOT be followed by …)

Python's re module essentials

re.search(pat, text): first match anywhere
re.match(pat, text): match at start only (use rarely; prefer fullmatch if validating)
re.fullmatch(pat, text): match the whole string (best for validation)
re.findall(pat, text): list of all matches (strings or tuples)
re.finditer(pat, text): iterator of match objects (gives spans and groups)
re.sub(pat, repl, text): replace matches
re.split(pat, text): split on a pattern
re.compile(pat, flags=...): pre-compile a pattern for speed/readability
Common flags: re.IGNORECASE (case-insensitive), re.MULTILINE (^/$ match per line), re.DOTALL (. matches newline), re.VERBOSE (allow comments and spaces in pattern)

1) Getting started: basic matches

import re

text = "My cat has 9 lives."

# Search anywhere
m1 = re.search(r"cat", text)
print(m1.group(), m1.span())  # cat (3, 6)

# Full string validation: only digits?
print(bool(re.fullmatch(r"\d+", "12345")))  # True
print(bool(re.fullmatch(r"\d+", "123a5")))  # False

# Find all numbers
print(re.findall(r"\d+", text))  # ['9']

2) Character classes and boundaries

text = "Colors: red, reed, read; feed; lead."

# Words starting with r and then e, then one more e or a, then d: r(e[ea])d
print(re.findall(r"\bred\b|\breed\b|\bread\b", text))  # ['red', 'reed', 'read']
# Cleaner with grouping:
print(re.findall(r"\br(e(?:e|a))d\b|\bred\b", text))   # ['ee', 'ea'] plus 'red' handled separately

# Simpler: r followed by (ed|eed|ead)
print(re.findall(r"\br(?:ed|eed|ead)\b", text))  # ['red', 'reed', 'read']

# Boundary example: match 'cat' as a whole word only
print(re.findall(r"\bcat\b", "bobcat cat scatter catalog"))
# -> ['cat'] only the standalone 'cat'

3) Quantifiers and greedy vs lazy

text = 'He said "hello" and then "bye".'

# Greedy: grabs from first " to last "
print(re.findall(r'"(.*)"', text))  # ['hello" and then "bye']

# Lazy: stop at the first closing "
print(re.findall(r'"(.*?)"', text))  # ['hello', 'bye']

4) Groups, capturing, and named groups

text = "Date: 2025-10-05. Backup: 05/10/2025."

# Named groups for readable access
iso = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
print(iso.group('year'), iso.group('month'), iso.group('day'))  # 2025 10 05

# Multiple date formats using alternation and named groups
pattern = re.compile(r"""
    (?:
        (?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})  # YYYY-MM-DD
      |
        (?P<d2>\d{2})/(?P<m2>\d{2})/(?P<y2>\d{4})  # DD/MM/YYYY
    )
""", re.VERBOSE)

for m in pattern.finditer(text):
    print(m.group(), m.groupdict())

5) Replacement (sub) and split

text = "Call me at 555-123-4567 or 555.999.0000!"

# Replace all separators with a dash
print(re.sub(r"[.\s]", "-", text))  # dots/spaces become dashes

# Extract all words by splitting on non-word
print(re.split(r"\W+", "This_is a test! Cool? Yes."))  # ['This_is', 'a', 'test', 'Cool', 'Yes', '']

6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE

text = "Apple\nbanana\nApricot"

# Case-insensitive find
print(re.findall(r"^a\w+", text, flags=re.IGNORECASE | re.MULTILINE))
# -> ['Apple', 'Apricot']

# DOTALL makes . match newlines
print(re.findall(r"A.*t", text, flags=re.DOTALL))  # Apple\nbanana\nApricot

7) Lookarounds (quick peek checks)

text = "apple pie, apple tart, apple juice"

# 'apple' not followed by ' pie'
print(re.findall(r"apple(?! pie)", text))  # ['apple', 'apple'] for tart and juice

# word followed by digits
print(re.findall(r"\b\w+(?=\d)", "x9 y10 z 123 abc"))  # ['x', 'y']

8) Safe dynamic patterns with re.escape

If you insert user text into a regex, escape it first to avoid accidental special characters.

user = "file.name[1].txt"
safe = re.escape(user)
print(safe)  # file\.name\[1\]\.txt
print(bool(re.search(safe, "open file.name[1].txt now")))  # True

Practical Project: Contact Extractor and Cleaner

Goal: From messy text, extract emails and US-style phone numbers, then normalize them. This simulates cleaning data from logs, chats, or copy-pasted documents.

What you'll practice:

Designing and compiling readable patterns
Using named groups and finditer
Using sub with a function to normalize
Deduplicating results

Step 1: Sample messy text

sample = """
Team:
- Alex: alex.miller@example.com, Phone: (555) 123-4567
- SAM: SAM99@Example.COM, Phone: 555.987.0000 ext 23
- Lina: lina+school@edu.co  | phone: +1 555 111 2222
Backup: Contact support@example.co.uk or 555-222-3333.
"""

Step 2: Build patterns

import re

EMAIL_RE = re.compile(r"""
    \b
    [A-Za-z0-9._%+-]+          # username
    @
    [A-Za-z0-9.-]+             # domain
    \.[A-Za-z]{2,}             # TLD
    \b
""", re.VERBOSE)

PHONE_RE = re.compile(r"""
    \b
    (?:\+?1[\s.-]*)?           # optional country code +1
    (?:\(?(\d{3})\)?[\s.-]*)   # area code (captured)
    (\d{3})[\s.-]*             # prefix
    (\d{4})                    # line number
    (?:\s*(?:ext\.?|x)\s*(\d+))?  # optional extension
""", re.VERBOSE | re.IGNORECASE)

Step 3: Extraction and normalization

def normalize_email(e):
    return e.lower()

def normalize_phone(match):
    area, pref, line, ext = match.groups()
    # Assume +1 if 10-digit US pattern
    base = f"+1-{area}-{pref}-{line}"
    if ext:
        base += f" x{ext}"
    return base

def extract_contacts(text):
    emails = {normalize_email(m.group()) for m in EMAIL_RE.finditer(text)}
    # Use sub with a function to normalize phones in-place and collect originals
    normalized_phones = set()
    for m in PHONE_RE.finditer(text):
        normalized_phones.add(normalize_phone(m))
    return sorted(emails), sorted(normalized_phones)

emails, phones = extract_contacts(sample)
print("Emails:")
for e in emails:
    print(" -", e)
print("Phones:")
for p in phones:
    print(" -", p)

Expected output (order may vary):

Emails:
- alex.miller@example.com
- lina+school@edu.co
- sam99@example.com
- support@example.co.uk
Phones:
- +1-555-111-2222
- +1-555-123-4567
- +1-555-222-3333
- +1-555-987-0000 x23

Step 4 (optional): Redact sensitive info in text

def redact(text):
    text = EMAIL_RE.sub(lambda m: m.group()[0] + "***" + "@***", text)
    text = PHONE_RE.sub(lambda m: normalize_phone(m)[:7] + "***-****", text)
    return text

print(redact(sample))

Advanced polishing (optional)

Use re.VERBOSE to comment complex patterns.
Pre-compile patterns you reuse for speed.
If you accept user input to build patterns, always use re.escape.
Write unit tests for patterns using a list of "should match" and "should not match" cases.

Common pitfalls and pro tips

Always use raw strings: r"\d{3}" not "\\d{3}".
For validation, use re.fullmatch so extra characters don't sneak in.
Greedy vs lazy matters a lot with .* and .+; try .*? first when matching "inside" delimiters like quotes.
In multiline text, add re.MULTILINE if you want ^ and $ to match at each line's edges.
In dot matches across lines, use re.DOTALL.
Prefer named groups (?P<name>...) for clarity.
Keep patterns readable with re.VERBOSE and comments.

Mini practice exercises

Write a regex to validate a simple username: starts with a letter, then letters/digits/underscores, length 3–16.
- Hint: r"^[A-Za-z]\w{2,15}$"
Extract all hashtags from a sentence like "I love #coding and #Python3!".
- Hint: r"#\w+"
Replace repeated spaces and tabs with a single space.
- Hint: r"[ \t]+"

Summary

Regex is a compact language to find, validate, and transform text.
In Python, use the re module with raw strings.
Master the basics: character classes, quantifiers, anchors, groups, and lookarounds.
Practice with findall/finditer for extraction, fullmatch for validation, and sub/split for cleaning.
The Contact Extractor project showed how to design real patterns, normalize results, and redact sensitive info.

Keep experimenting. Small, focused patterns are easier to test and combine. Build confidence with examples, then use VERBOSE and named groups for bigger patterns.

Regular Expressions

Goal

Understand what regular expressions (regex) are and why they're powerful.
Learn the core regex building blocks and how to use Python's re module.
Practice with clear examples.
Build a small, practical project: a Contact Extractor and Cleaner.

What is a Regular Expression?

In Python, you write regex with the re module. Important: always write patterns as raw strings (prefix with r) so backslashes behave correctly.

Example: r"\d+" means "one or more digits." Without r, "\d" is a Python escape and can misbehave.

Regex cheat sheet (most-used)

Literals: cat matches "cat".
Character classes:
- [abc] any one of a, b, or c
- [a-z] any lowercase letter a to z
- [^a-z] any character NOT a to z
- \d digit [0-9], \w word [A-Za-z0-9_], \s whitespace; \D, \W, \S are the opposites
Quantifiers:
- ? 0 or 1 times
- * 0 or more
- + 1 or more
- {m} exactly m times; {m,n} between m and n
Anchors and boundaries:
- ^ start of string/line
- $ end of string/line
- \b word boundary (between letter/number and not-letter/number)
Groups and choices:
- (...) capture group
- (?:...) non-capturing group
- (?P<name>...) named group
- | OR (alternation)
Greedy vs lazy:
- * and + are greedy (grab as much as possible)
- *? and +? are lazy (grab as little as possible)
Lookarounds (advanced, read-only checks):
- (?=...) positive lookahead (must be followed by …)
- (?!...) negative lookahead (must NOT be followed by …)

Python's re module essentials

re.search(pat, text): first match anywhere
re.match(pat, text): match at start only (use rarely; prefer fullmatch if validating)
re.fullmatch(pat, text): match the whole string (best for validation)
re.findall(pat, text): list of all matches (strings or tuples)
re.finditer(pat, text): iterator of match objects (gives spans and groups)
re.sub(pat, repl, text): replace matches
re.split(pat, text): split on a pattern
re.compile(pat, flags=...): pre-compile a pattern for speed/readability
Common flags: re.IGNORECASE (case-insensitive), re.MULTILINE (^/$ match per line), re.DOTALL (. matches newline), re.VERBOSE (allow comments and spaces in pattern)

1) Getting started: basic matches

import re

text = "My cat has 9 lives."

# Search anywhere
m1 = re.search(r"cat", text)
print(m1.group(), m1.span())  # cat (3, 6)

# Full string validation: only digits?
print(bool(re.fullmatch(r"\d+", "12345")))  # True
print(bool(re.fullmatch(r"\d+", "123a5")))  # False

# Find all numbers
print(re.findall(r"\d+", text))  # ['9']

2) Character classes and boundaries

text = "Colors: red, reed, read; feed; lead."

# Words starting with r and then e, then one more e or a, then d: r(e[ea])d
print(re.findall(r"\bred\b|\breed\b|\bread\b", text))  # ['red', 'reed', 'read']
# Cleaner with grouping:
print(re.findall(r"\br(e(?:e|a))d\b|\bred\b", text))   # ['ee', 'ea'] plus 'red' handled separately

# Simpler: r followed by (ed|eed|ead)
print(re.findall(r"\br(?:ed|eed|ead)\b", text))  # ['red', 'reed', 'read']

# Boundary example: match 'cat' as a whole word only
print(re.findall(r"\bcat\b", "bobcat cat scatter catalog"))
# -> ['cat'] only the standalone 'cat'

3) Quantifiers and greedy vs lazy

text = 'He said "hello" and then "bye".'

# Greedy: grabs from first " to last "
print(re.findall(r'"(.*)"', text))  # ['hello" and then "bye']

# Lazy: stop at the first closing "
print(re.findall(r'"(.*?)"', text))  # ['hello', 'bye']

4) Groups, capturing, and named groups

text = "Date: 2025-10-05. Backup: 05/10/2025."

# Named groups for readable access
iso = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
print(iso.group('year'), iso.group('month'), iso.group('day'))  # 2025 10 05

# Multiple date formats using alternation and named groups
pattern = re.compile(r"""
    (?:
        (?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})  # YYYY-MM-DD
      |
        (?P<d2>\d{2})/(?P<m2>\d{2})/(?P<y2>\d{4})  # DD/MM/YYYY
    )
""", re.VERBOSE)

for m in pattern.finditer(text):
    print(m.group(), m.groupdict())

5) Replacement (sub) and split

text = "Call me at 555-123-4567 or 555.999.0000!"

# Replace all separators with a dash
print(re.sub(r"[.\s]", "-", text))  # dots/spaces become dashes

# Extract all words by splitting on non-word
print(re.split(r"\W+", "This_is a test! Cool? Yes."))  # ['This_is', 'a', 'test', 'Cool', 'Yes', '']

6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE

text = "Apple\nbanana\nApricot"

# Case-insensitive find
print(re.findall(r"^a\w+", text, flags=re.IGNORECASE | re.MULTILINE))
# -> ['Apple', 'Apricot']

# DOTALL makes . match newlines
print(re.findall(r"A.*t", text, flags=re.DOTALL))  # Apple\nbanana\nApricot

7) Lookarounds (quick peek checks)

text = "apple pie, apple tart, apple juice"

# 'apple' not followed by ' pie'
print(re.findall(r"apple(?! pie)", text))  # ['apple', 'apple'] for tart and juice

# word followed by digits
print(re.findall(r"\b\w+(?=\d)", "x9 y10 z 123 abc"))  # ['x', 'y']

8) Safe dynamic patterns with re.escape

If you insert user text into a regex, escape it first to avoid accidental special characters.

user = "file.name[1].txt"
safe = re.escape(user)
print(safe)  # file\.name\[1\]\.txt
print(bool(re.search(safe, "open file.name[1].txt now")))  # True

Practical Project: Contact Extractor and Cleaner

Goal: From messy text, extract emails and US-style phone numbers, then normalize them. This simulates cleaning data from logs, chats, or copy-pasted documents.

What you'll practice:

Designing and compiling readable patterns
Using named groups and finditer
Using sub with a function to normalize
Deduplicating results

Step 1: Sample messy text

sample = """
Team:
- Alex: alex.miller@example.com, Phone: (555) 123-4567
- SAM: SAM99@Example.COM, Phone: 555.987.0000 ext 23
- Lina: lina+school@edu.co  | phone: +1 555 111 2222
Backup: Contact support@example.co.uk or 555-222-3333.
"""

Step 2: Build patterns

import re

EMAIL_RE = re.compile(r"""
    \b
    [A-Za-z0-9._%+-]+          # username
    @
    [A-Za-z0-9.-]+             # domain
    \.[A-Za-z]{2,}             # TLD
    \b
""", re.VERBOSE)

PHONE_RE = re.compile(r"""
    \b
    (?:\+?1[\s.-]*)?           # optional country code +1
    (?:\(?(\d{3})\)?[\s.-]*)   # area code (captured)
    (\d{3})[\s.-]*             # prefix
    (\d{4})                    # line number
    (?:\s*(?:ext\.?|x)\s*(\d+))?  # optional extension
""", re.VERBOSE | re.IGNORECASE)

Step 3: Extraction and normalization

def normalize_email(e):
    return e.lower()

def normalize_phone(match):
    area, pref, line, ext = match.groups()
    # Assume +1 if 10-digit US pattern
    base = f"+1-{area}-{pref}-{line}"
    if ext:
        base += f" x{ext}"
    return base

def extract_contacts(text):
    emails = {normalize_email(m.group()) for m in EMAIL_RE.finditer(text)}
    # Use sub with a function to normalize phones in-place and collect originals
    normalized_phones = set()
    for m in PHONE_RE.finditer(text):
        normalized_phones.add(normalize_phone(m))
    return sorted(emails), sorted(normalized_phones)

emails, phones = extract_contacts(sample)
print("Emails:")
for e in emails:
    print(" -", e)
print("Phones:")
for p in phones:
    print(" -", p)

Expected output (order may vary):

Emails:
- alex.miller@example.com
- lina+school@edu.co
- sam99@example.com
- support@example.co.uk
Phones:
- +1-555-111-2222
- +1-555-123-4567
- +1-555-222-3333
- +1-555-987-0000 x23

Step 4 (optional): Redact sensitive info in text

def redact(text):
    text = EMAIL_RE.sub(lambda m: m.group()[0] + "***" + "@***", text)
    text = PHONE_RE.sub(lambda m: normalize_phone(m)[:7] + "***-****", text)
    return text

print(redact(sample))

Advanced polishing (optional)

Use re.VERBOSE to comment complex patterns.
Pre-compile patterns you reuse for speed.
If you accept user input to build patterns, always use re.escape.
Write unit tests for patterns using a list of "should match" and "should not match" cases.

Common pitfalls and pro tips

Always use raw strings: r"\d{3}" not "\\d{3}".
For validation, use re.fullmatch so extra characters don't sneak in.
Greedy vs lazy matters a lot with .* and .+; try .*? first when matching "inside" delimiters like quotes.
In multiline text, add re.MULTILINE if you want ^ and $ to match at each line's edges.
In dot matches across lines, use re.DOTALL.
Prefer named groups (?P<name>...) for clarity.
Keep patterns readable with re.VERBOSE and comments.

Mini practice exercises

Write a regex to validate a simple username: starts with a letter, then letters/digits/underscores, length 3–16.
- Hint: r"^[A-Za-z]\w{2,15}$"
Extract all hashtags from a sentence like "I love #coding and #Python3!".
- Hint: r"#\w+"
Replace repeated spaces and tabs with a single space.
- Hint: r"[ \t]+"

Summary

Regex is a compact language to find, validate, and transform text.
In Python, use the re module with raw strings.
Master the basics: character classes, quantifiers, anchors, groups, and lookarounds.
Practice with findall/finditer for extraction, fullmatch for validation, and sub/split for cleaning.
The Contact Extractor project showed how to design real patterns, normalize results, and redact sensitive info.

Keep experimenting. Small, focused patterns are easier to test and combine. Build confidence with examples, then use VERBOSE and named groups for bigger patterns.

python Topics

python Tutorial

Regular Expressions

Goal

What is a Regular Expression?

Regex cheat sheet (most-used)

Python's re module essentials

1) Getting started: basic matches

2) Character classes and boundaries

3) Quantifiers and greedy vs lazy

4) Groups, capturing, and named groups

5) Replacement (sub) and split

6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE

7) Lookarounds (quick peek checks)

8) Safe dynamic patterns with re.escape

Practical Project: Contact Extractor and Cleaner

What you'll practice:

Step 1: Sample messy text

Step 2: Build patterns

Step 3: Extraction and normalization

Expected output (order may vary):

Step 4 (optional): Redact sensitive info in text

Advanced polishing (optional)

Common pitfalls and pro tips

Mini practice exercises

Summary

python Topics

python Tutorial

Regular Expressions

Goal

What is a Regular Expression?

Regex cheat sheet (most-used)

Python's re module essentials

1) Getting started: basic matches

2) Character classes and boundaries

3) Quantifiers and greedy vs lazy

4) Groups, capturing, and named groups

5) Replacement (sub) and split

6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE

7) Lookarounds (quick peek checks)

8) Safe dynamic patterns with re.escape

Practical Project: Contact Extractor and Cleaner

What you'll practice:

Step 1: Sample messy text

Step 2: Build patterns

Step 3: Extraction and normalization

Expected output (order may vary):

Step 4 (optional): Redact sensitive info in text

Advanced polishing (optional)

Common pitfalls and pro tips

Mini practice exercises

Summary