Guide
Regular expressions (regex) explained: patterns, groups and pitfalls
A regular expression is a compact pattern language for matching text. Need to validate an email field, extract dates from log lines, or rename a thousand files? Regex is often the fastest answer — until it becomes the hardest-to-debug part of your codebase. This guide explains the syntax that works across most engines, walks through patterns you will actually use, and covers the traps (greedy quantifiers, catastrophic backtracking, engine differences) that turn a one-liner into a production incident. If you are new to programming strings in general, start with our Python fundamentals primer or JavaScript event loop guide for language context; this page focuses on regex itself.
What regex is — and what it is not
Regex engines scan input left to right, trying to match your pattern against substrings. A match succeeds when the entire pattern (or a flagged global search) aligns with part of the text. Regex is excellent for structured-ish text: log parsing, input validation, search-and-replace in editors, and lightweight extraction.
Regex is not a HTML parser, JSON validator, or general programming
language. Patterns that try to match nested tags or arbitrary JSON quickly become
unreadable and brittle. For hierarchical data, use a DOM parser, JSON.parse,
or a proper grammar tool. Regex belongs in the "quick filter" layer — not the core data
model.
Core building blocks
Literals and escaping
Most characters match themselves: cat matches the substring "cat".
Metacharacters — . * + ? ^ $ ( ) [ ] { } | \ — need a backslash to match
literally: \. matches a period. In JavaScript string literals you often
write \\d because the string itself consumes one backslash.
The dot and character classes
. matches any single character except newline (unless the s
flag is set). \d matches a digit; \w matches word characters
(letters, digits, underscore); \s matches whitespace. Uppercase versions
(\D, \W, \S) invert the set.
Square brackets define custom classes: [aeiou] matches one vowel;
[a-z0-9_-] matches a slug character. A leading caret inside brackets
negates: [^0-9] matches any non-digit. Ranges are inclusive; put
- first or last if you need a literal hyphen.
Quantifiers: how many times
*— zero or more (greedy)+— one or more (greedy)?— zero or one (greedy){3}— exactly three{2,5}— between two and five{4,}— four or more
Append ? after a quantifier for lazy (non-greedy) matching:
.*? matches as few characters as possible. This matters when extracting
content between delimiters — greedy .* often swallows too much.
Anchors and word boundaries
^ anchors to the start of a line (or string); $ to the end.
\b is a word boundary — the gap between a word character and a non-word
character. \bcat\b matches "cat" in "the cat sat" but not "category".
Multiline mode (m flag) makes ^ and $ match line
breaks inside the string.
Groups, alternation and backreferences
Parentheses create capturing groups: (\d{4})-(\d{2})-(\d{2})
on "2026-06-07" captures year, month, and day in groups 1, 2, and 3. Most APIs expose
these as match[1], match[2], etc., or named groups
(?<year>\d{4}) in modern engines.
Non-capturing groups (?:...) group without storing a
capture — useful when you need precedence but not extraction. The pipe
| is alternation: cat|dog matches either word.
colou?r is quantifier shorthand for "color" or "colour".
Backreferences like \1 repeat whatever group 1 matched —
handy for finding doubled words (\b(\w+)\s+\1\b) but easy to misuse across
engine versions.
Lookahead and lookbehind
Lookahead asserts what comes next without consuming it:
(?=...)— positive lookahead:\d+(?=px)matches digits only when followed by "px".(?!...)— negative lookahead:\b\w+\b(?!\s+is\b)matches words not followed by " is".
Lookbehind asserts what came before: (?<=\$)\d+ matches
digits immediately after a dollar sign. Lookbehind must be fixed-width in some engines
(JavaScript until recently limited this). These constructs help with password rules
("at least one digit") without splitting the string manually.
Flags and engine flavors
Flags modify matching behavior. Common ones:
g— global: find all matches, not just the firsti— case-insensitivem— multiline:^and$match line boundariess— dotall:.includes newlinesu— Unicode-aware (important for emoji and non-Latin scripts)
JavaScript uses /pattern/flags or new RegExp('pattern', 'flags').
Python uses re.compile(r'pattern', re.I). PCRE (PHP, Perl,
many CLI tools) adds features like recursive patterns. Always check which dialect your
runtime supports before copying a Stack Overflow answer — subtle differences in
\b, lookbehind, and Unicode classes cause cross-language bugs.
Practical patterns (with honest caveats)
Email-ish validation
A pragmatic filter: ^[^\s@]+@[^\s@]+\.[^\s@]+$ — catches obvious typos,
not RFC 5322 compliance. Real email validation needs DNS MX checks and mailbox
confirmation; regex is a first gate, not proof of deliverability.
ISO dates in logs
\b(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2}) extracts timestamps
from structured logs. Pair with a datetime library for validation — regex will happily
match 2026-99-99.
URL path segments
^/[a-z0-9]+(?:/[a-z0-9-]+)*$ validates simple slug paths. Full URL
parsing belongs in urllib or the URL constructor, not regex.
Hex colors and hashes
^#(?:[0-9a-fA-F]{3}){1,2}$ for CSS hex colors;
\b[0-9a-f]{64}\b for SHA-256 hex strings (see our
hashing guide
for what those digests mean).
Search and replace
Capture groups in replacement strings: JavaScript
'$1-$2'.replace(/(\d{4})(\d{2})(\d{2})/, '$1-$2-$3') turns
"20260607" into "2026-06-07". Python uses \1 in the replacement string
with re.sub. Named groups make replacements readable in long pipelines.
Greedy vs lazy: a classic bug
Consider extracting HTML-ish text between tags (already a smell — use a parser in production):
Greedy: <div>(.*)</div> on
<div>a</div><div>b</div> captures
a</div><div>b — the .* ate everything up to the
last closing tag.
Lazy: <div>(.*?)</div> stops at the first closing tag. For
nested structures, neither works reliably — that is the parser argument again.
Performance: catastrophic backtracking
Nested quantifiers on overlapping character classes can make engines explore
exponentially many paths. The pattern (a+)+$ against a long string of
"a"s ending in "b" may hang your server. This is catastrophic backtracking
(ReDoS — regular expression denial of service).
Mitigations:
- Avoid nested quantifiers: prefer
a+$over(a+)+$. - Use possessive quantifiers or atomic groups where supported (
a++in PCRE). - Set match timeouts — JavaScript has none natively; run untrusted patterns in a worker with a deadline or use a safe subset library.
- Never compile user-supplied regex without review and length limits.
- Test edge cases in your test suite — include long inputs and near-miss strings.
Regex vs other tools
| Task | Better tool |
|---|---|
| Parse JSON or XML | JSON.parse, DOMParser, dedicated parser |
| Query database text | SQL LIKE / full-text index / ILIKE with indexes (see SQL fundamentals) |
| Validate complex forms | Schema validators (Zod, Pydantic) plus regex for simple fields |
| Tokenize code | Lexer generator (lexer, tree-sitter) |
| Extract fields from logs at scale | Structured logging (JSON lines) plus a log aggregator |
Regex shines when the pattern is stable, the input is flat text, and a false negative costs less than spinning up a parser. Reach for something heavier when nesting, context, or schema evolution enters the picture.
Production checklist
- Document what the pattern accepts and rejects with three positive and three negative examples.
- Anchor validation patterns (
^...$) so partial matches do not slip through. - Prefer character classes over
.when you know the allowed alphabet. - Use named capture groups for anything you will read six months later.
- Run the pattern against fuzzed or property-based inputs if it touches user data.
- Log match failures with the input snippet redacted — not the full regex on every request.
- Centralize patterns in one module; copy-pasted regex diverges silently.
- When a pattern exceeds ~80 characters, add a comment or split into named sub-patterns.
Key takeaways
- Regex matches text via literals, classes, quantifiers, anchors, and groups — flags change global behavior.
- Lazy quantifiers and lookahead solve many "match the smallest slice" problems greedy defaults miss.
- Engine dialects differ; test in the runtime you ship, not just in an online tester.
- Catastrophic backtracking is a real DoS vector — simplify nested quantifiers and timeout untrusted patterns.
- Regex validates shape, not semantics — pair patterns with libraries for dates, URLs, and email.
- When structure gets nested, stop regex and parse properly.
Related reading
- Python fundamentals explained — the
remodule, raw strings, and stdlib habits - JavaScript event loop explained — string work, Promises, and keeping regex off the hot path
- Software testing fundamentals explained — unit tests for validators and edge-case inputs
- SQL fundamentals explained — when
LIKEand indexes beat application-side filtering