Regular Expressions in Python (Regex)
Goal
- Understand what regular expressions (regex) are and why they're powerful.
- Learn the core regex building blocks and how to use Python's re module.
- Practice with clear examples.
- Build a small, practical project: a Contact Extractor and Cleaner.
What is a Regular Expression?
A regular expression (regex) is a tiny pattern language for finding, checking, or changing text. Think of it like a super-smart "find" tool. Instead of searching for exactly "cat," you can search for "any three-letter word," "an email," or "a phone number," and more.
In Python, you write regex with the re module. Important: always write patterns as raw strings (prefix with r) so backslashes behave correctly.
Example: r"\d+" means "one or more digits." Without r, "\d" is a Python escape and can misbehave.
Regex cheat sheet (most-used)
- Literals:
catmatches "cat". - Character classes:
[abc]any one of a, b, or c[a-z]any lowercase letter a to z[^a-z]any character NOT a to z\ddigit [0-9],\wword [A-Za-z0-9_],\swhitespace;\D,\W,\Sare the opposites
- Quantifiers:
?0 or 1 times*0 or more+1 or more{m}exactly m times;{m,n}between m and n
- Anchors and boundaries:
^start of string/line$end of string/line\bword boundary (between letter/number and not-letter/number)
- Groups and choices:
(...)capture group(?:...)non-capturing group(?P<name>...)named group|OR (alternation)
- Greedy vs lazy:
*and+are greedy (grab as much as possible)*?and+?are lazy (grab as little as possible)
- Lookarounds (advanced, read-only checks):
(?=...)positive lookahead (must be followed by …)(?!...)negative lookahead (must NOT be followed by …)
Python's re module essentials
re.search(pat, text): first match anywherere.match(pat, text): match at start only (use rarely; prefer fullmatch if validating)re.fullmatch(pat, text): match the whole string (best for validation)re.findall(pat, text): list of all matches (strings or tuples)re.finditer(pat, text): iterator of match objects (gives spans and groups)re.sub(pat, repl, text): replace matchesre.split(pat, text): split on a patternre.compile(pat, flags=...): pre-compile a pattern for speed/readability- Common flags:
re.IGNORECASE(case-insensitive),re.MULTILINE(^/$ match per line),re.DOTALL(. matches newline),re.VERBOSE(allow comments and spaces in pattern)
1) Getting started: basic matches
import re
text = "My cat has 9 lives."
# Search anywhere
m1 = re.search(r"cat", text)
print(m1.group(), m1.span()) # cat (3, 6)
# Full string validation: only digits?
print(bool(re.fullmatch(r"\d+", "12345"))) # True
print(bool(re.fullmatch(r"\d+", "123a5"))) # False
# Find all numbers
print(re.findall(r"\d+", text)) # ['9']2) Character classes and boundaries
text = "Colors: red, reed, read; feed; lead."
# Words starting with r and then e, then one more e or a, then d: r(e[ea])d
print(re.findall(r"\bred\b|\breed\b|\bread\b", text)) # ['red', 'reed', 'read']
# Cleaner with grouping:
print(re.findall(r"\br(e(?:e|a))d\b|\bred\b", text)) # ['ee', 'ea'] plus 'red' handled separately
# Simpler: r followed by (ed|eed|ead)
print(re.findall(r"\br(?:ed|eed|ead)\b", text)) # ['red', 'reed', 'read']
# Boundary example: match 'cat' as a whole word only
print(re.findall(r"\bcat\b", "bobcat cat scatter catalog"))
# -> ['cat'] only the standalone 'cat'3) Quantifiers and greedy vs lazy
text = 'He said "hello" and then "bye".'
# Greedy: grabs from first " to last "
print(re.findall(r'"(.*)"', text)) # ['hello" and then "bye']
# Lazy: stop at the first closing "
print(re.findall(r'"(.*?)"', text)) # ['hello', 'bye']4) Groups, capturing, and named groups
text = "Date: 2025-10-05. Backup: 05/10/2025."
# Named groups for readable access
iso = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
print(iso.group('year'), iso.group('month'), iso.group('day')) # 2025 10 05
# Multiple date formats using alternation and named groups
pattern = re.compile(r"""
(?:
(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2}) # YYYY-MM-DD
|
(?P<d2>\d{2})/(?P<m2>\d{2})/(?P<y2>\d{4}) # DD/MM/YYYY
)
""", re.VERBOSE)
for m in pattern.finditer(text):
print(m.group(), m.groupdict())5) Replacement (sub) and split
text = "Call me at 555-123-4567 or 555.999.0000!"
# Replace all separators with a dash
print(re.sub(r"[.\s]", "-", text)) # dots/spaces become dashes
# Extract all words by splitting on non-word
print(re.split(r"\W+", "This_is a test! Cool? Yes.")) # ['This_is', 'a', 'test', 'Cool', 'Yes', '']6) Flags: IGNORECASE, MULTILINE, DOTALL, VERBOSE
text = "Apple\nbanana\nApricot"
# Case-insensitive find
print(re.findall(r"^a\w+", text, flags=re.IGNORECASE | re.MULTILINE))
# -> ['Apple', 'Apricot']
# DOTALL makes . match newlines
print(re.findall(r"A.*t", text, flags=re.DOTALL)) # Apple\nbanana\nApricot7) Lookarounds (quick peek checks)
text = "apple pie, apple tart, apple juice"
# 'apple' not followed by ' pie'
print(re.findall(r"apple(?! pie)", text)) # ['apple', 'apple'] for tart and juice
# word followed by digits
print(re.findall(r"\b\w+(?=\d)", "x9 y10 z 123 abc")) # ['x', 'y']8) Safe dynamic patterns with re.escape
If you insert user text into a regex, escape it first to avoid accidental special characters.
user = "file.name[1].txt"
safe = re.escape(user)
print(safe) # file\.name\[1\]\.txt
print(bool(re.search(safe, "open file.name[1].txt now"))) # TruePractical Project: Contact Extractor and Cleaner
Goal: From messy text, extract emails and US-style phone numbers, then normalize them. This simulates cleaning data from logs, chats, or copy-pasted documents.
What you'll practice:
- Designing and compiling readable patterns
- Using named groups and finditer
- Using sub with a function to normalize
- Deduplicating results
Step 1: Sample messy text
sample = """
Team:
- Alex: alex.miller@example.com, Phone: (555) 123-4567
- SAM: SAM99@Example.COM, Phone: 555.987.0000 ext 23
- Lina: lina+school@edu.co | phone: +1 555 111 2222
Backup: Contact support@example.co.uk or 555-222-3333.
"""Step 2: Build patterns
import re
EMAIL_RE = re.compile(r"""
\b
[A-Za-z0-9._%+-]+ # username
@
[A-Za-z0-9.-]+ # domain
\.[A-Za-z]{2,} # TLD
\b
""", re.VERBOSE)
PHONE_RE = re.compile(r"""
\b
(?:\+?1[\s.-]*)? # optional country code +1
(?:\(?(\d{3})\)?[\s.-]*) # area code (captured)
(\d{3})[\s.-]* # prefix
(\d{4}) # line number
(?:\s*(?:ext\.?|x)\s*(\d+))? # optional extension
""", re.VERBOSE | re.IGNORECASE)Step 3: Extraction and normalization
def normalize_email(e):
return e.lower()
def normalize_phone(match):
area, pref, line, ext = match.groups()
# Assume +1 if 10-digit US pattern
base = f"+1-{area}-{pref}-{line}"
if ext:
base += f" x{ext}"
return base
def extract_contacts(text):
emails = {normalize_email(m.group()) for m in EMAIL_RE.finditer(text)}
# Use sub with a function to normalize phones in-place and collect originals
normalized_phones = set()
for m in PHONE_RE.finditer(text):
normalized_phones.add(normalize_phone(m))
return sorted(emails), sorted(normalized_phones)
emails, phones = extract_contacts(sample)
print("Emails:")
for e in emails:
print(" -", e)
print("Phones:")
for p in phones:
print(" -", p)Expected output (order may vary):
- Emails:
- alex.miller@example.com
- lina+school@edu.co
- sam99@example.com
- support@example.co.uk
- Phones:
- +1-555-111-2222
- +1-555-123-4567
- +1-555-222-3333
- +1-555-987-0000 x23
Step 4 (optional): Redact sensitive info in text
def redact(text):
text = EMAIL_RE.sub(lambda m: m.group()[0] + "***" + "@***", text)
text = PHONE_RE.sub(lambda m: normalize_phone(m)[:7] + "***-****", text)
return text
print(redact(sample))Advanced polishing (optional)
- Use
re.VERBOSEto comment complex patterns. - Pre-compile patterns you reuse for speed.
- If you accept user input to build patterns, always use
re.escape. - Write unit tests for patterns using a list of "should match" and "should not match" cases.
Common pitfalls and pro tips
- Always use raw strings:
r"\d{3}"not"\\d{3}". - For validation, use
re.fullmatchso extra characters don't sneak in. - Greedy vs lazy matters a lot with
.*and.+; try.*?first when matching "inside" delimiters like quotes. - In multiline text, add
re.MULTILINEif you want^and$to match at each line's edges. - In dot matches across lines, use
re.DOTALL. - Prefer named groups
(?P<name>...)for clarity. - Keep patterns readable with
re.VERBOSEand comments.
Mini practice exercises
- Write a regex to validate a simple username: starts with a letter, then letters/digits/underscores, length 3–16.
- Hint:
r"^[A-Za-z]\w{2,15}$"
- Hint:
- Extract all hashtags from a sentence like "I love #coding and #Python3!".
- Hint:
r"#\w+"
- Hint:
- Replace repeated spaces and tabs with a single space.
- Hint:
r"[ \t]+"
- Hint:
Summary
- Regex is a compact language to find, validate, and transform text.
- In Python, use the
remodule with raw strings. - Master the basics: character classes, quantifiers, anchors, groups, and lookarounds.
- Practice with
findall/finditerfor extraction,fullmatchfor validation, andsub/splitfor cleaning. - The Contact Extractor project showed how to design real patterns, normalize results, and redact sensitive info.
Keep experimenting. Small, focused patterns are easier to test and combine. Build confidence with examples, then use VERBOSE and named groups for bigger patterns.