Guide

Regular expressions (regex) explained: patterns, groups and pitfalls

A regular expression is a compact pattern language for matching text. Need to validate an email field, extract dates from log lines, or rename a thousand files? Regex is often the fastest answer — until it becomes the hardest-to-debug part of your codebase. This guide explains the syntax that works across most engines, walks through patterns you will actually use, and covers the traps (greedy quantifiers, catastrophic backtracking, engine differences) that turn a one-liner into a production incident. If you are new to programming strings in general, start with our Python fundamentals primer or JavaScript event loop guide for language context; this page focuses on regex itself.

What regex is — and what it is not

Regex engines scan input left to right, trying to match your pattern against substrings. A match succeeds when the entire pattern (or a flagged global search) aligns with part of the text. Regex is excellent for structured-ish text: log parsing, input validation, search-and-replace in editors, and lightweight extraction.

Regex is not a HTML parser, JSON validator, or general programming language. Patterns that try to match nested tags or arbitrary JSON quickly become unreadable and brittle. For hierarchical data, use a DOM parser, JSON.parse, or a proper grammar tool. Regex belongs in the "quick filter" layer — not the core data model.

Core building blocks

Literals and escaping

Most characters match themselves: cat matches the substring "cat". Metacharacters — . * + ? ^ $ ( ) [ ] { } | \ — need a backslash to match literally: \. matches a period. In JavaScript string literals you often write \\d because the string itself consumes one backslash.

The dot and character classes

. matches any single character except newline (unless the s flag is set). \d matches a digit; \w matches word characters (letters, digits, underscore); \s matches whitespace. Uppercase versions (\D, \W, \S) invert the set.

Square brackets define custom classes: [aeiou] matches one vowel; [a-z0-9_-] matches a slug character. A leading caret inside brackets negates: [^0-9] matches any non-digit. Ranges are inclusive; put - first or last if you need a literal hyphen.

Quantifiers: how many times

  • * — zero or more (greedy)
  • + — one or more (greedy)
  • ? — zero or one (greedy)
  • {3} — exactly three
  • {2,5} — between two and five
  • {4,} — four or more

Append ? after a quantifier for lazy (non-greedy) matching: .*? matches as few characters as possible. This matters when extracting content between delimiters — greedy .* often swallows too much.

Anchors and word boundaries

^ anchors to the start of a line (or string); $ to the end. \b is a word boundary — the gap between a word character and a non-word character. \bcat\b matches "cat" in "the cat sat" but not "category". Multiline mode (m flag) makes ^ and $ match line breaks inside the string.

Groups, alternation and backreferences

Parentheses create capturing groups: (\d{4})-(\d{2})-(\d{2}) on "2026-06-07" captures year, month, and day in groups 1, 2, and 3. Most APIs expose these as match[1], match[2], etc., or named groups (?<year>\d{4}) in modern engines.

Non-capturing groups (?:...) group without storing a capture — useful when you need precedence but not extraction. The pipe | is alternation: cat|dog matches either word. colou?r is quantifier shorthand for "color" or "colour".

Backreferences like \1 repeat whatever group 1 matched — handy for finding doubled words (\b(\w+)\s+\1\b) but easy to misuse across engine versions.

Lookahead and lookbehind

Lookahead asserts what comes next without consuming it:

  • (?=...) — positive lookahead: \d+(?=px) matches digits only when followed by "px".
  • (?!...) — negative lookahead: \b\w+\b(?!\s+is\b) matches words not followed by " is".

Lookbehind asserts what came before: (?<=\$)\d+ matches digits immediately after a dollar sign. Lookbehind must be fixed-width in some engines (JavaScript until recently limited this). These constructs help with password rules ("at least one digit") without splitting the string manually.

Flags and engine flavors

Flags modify matching behavior. Common ones:

  • g — global: find all matches, not just the first
  • i — case-insensitive
  • m — multiline: ^ and $ match line boundaries
  • s — dotall: . includes newlines
  • u — Unicode-aware (important for emoji and non-Latin scripts)

JavaScript uses /pattern/flags or new RegExp('pattern', 'flags'). Python uses re.compile(r'pattern', re.I). PCRE (PHP, Perl, many CLI tools) adds features like recursive patterns. Always check which dialect your runtime supports before copying a Stack Overflow answer — subtle differences in \b, lookbehind, and Unicode classes cause cross-language bugs.

Practical patterns (with honest caveats)

Email-ish validation

A pragmatic filter: ^[^\s@]+@[^\s@]+\.[^\s@]+$ — catches obvious typos, not RFC 5322 compliance. Real email validation needs DNS MX checks and mailbox confirmation; regex is a first gate, not proof of deliverability.

ISO dates in logs

\b(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2}) extracts timestamps from structured logs. Pair with a datetime library for validation — regex will happily match 2026-99-99.

URL path segments

^/[a-z0-9]+(?:/[a-z0-9-]+)*$ validates simple slug paths. Full URL parsing belongs in urllib or the URL constructor, not regex.

Hex colors and hashes

^#(?:[0-9a-fA-F]{3}){1,2}$ for CSS hex colors; \b[0-9a-f]{64}\b for SHA-256 hex strings (see our hashing guide for what those digests mean).

Search and replace

Capture groups in replacement strings: JavaScript '$1-$2'.replace(/(\d{4})(\d{2})(\d{2})/, '$1-$2-$3') turns "20260607" into "2026-06-07". Python uses \1 in the replacement string with re.sub. Named groups make replacements readable in long pipelines.

Greedy vs lazy: a classic bug

Consider extracting HTML-ish text between tags (already a smell — use a parser in production):

Greedy: <div>(.*)</div> on <div>a</div><div>b</div> captures a</div><div>b — the .* ate everything up to the last closing tag.

Lazy: <div>(.*?)</div> stops at the first closing tag. For nested structures, neither works reliably — that is the parser argument again.

Performance: catastrophic backtracking

Nested quantifiers on overlapping character classes can make engines explore exponentially many paths. The pattern (a+)+$ against a long string of "a"s ending in "b" may hang your server. This is catastrophic backtracking (ReDoS — regular expression denial of service).

Mitigations:

  • Avoid nested quantifiers: prefer a+$ over (a+)+$.
  • Use possessive quantifiers or atomic groups where supported (a++ in PCRE).
  • Set match timeouts — JavaScript has none natively; run untrusted patterns in a worker with a deadline or use a safe subset library.
  • Never compile user-supplied regex without review and length limits.
  • Test edge cases in your test suite — include long inputs and near-miss strings.

Regex vs other tools

TaskBetter tool
Parse JSON or XMLJSON.parse, DOMParser, dedicated parser
Query database textSQL LIKE / full-text index / ILIKE with indexes (see SQL fundamentals)
Validate complex formsSchema validators (Zod, Pydantic) plus regex for simple fields
Tokenize codeLexer generator (lexer, tree-sitter)
Extract fields from logs at scaleStructured logging (JSON lines) plus a log aggregator

Regex shines when the pattern is stable, the input is flat text, and a false negative costs less than spinning up a parser. Reach for something heavier when nesting, context, or schema evolution enters the picture.

Production checklist

  • Document what the pattern accepts and rejects with three positive and three negative examples.
  • Anchor validation patterns (^...$) so partial matches do not slip through.
  • Prefer character classes over . when you know the allowed alphabet.
  • Use named capture groups for anything you will read six months later.
  • Run the pattern against fuzzed or property-based inputs if it touches user data.
  • Log match failures with the input snippet redacted — not the full regex on every request.
  • Centralize patterns in one module; copy-pasted regex diverges silently.
  • When a pattern exceeds ~80 characters, add a comment or split into named sub-patterns.

Key takeaways

  • Regex matches text via literals, classes, quantifiers, anchors, and groups — flags change global behavior.
  • Lazy quantifiers and lookahead solve many "match the smallest slice" problems greedy defaults miss.
  • Engine dialects differ; test in the runtime you ship, not just in an online tester.
  • Catastrophic backtracking is a real DoS vector — simplify nested quantifiers and timeout untrusted patterns.
  • Regex validates shape, not semantics — pair patterns with libraries for dates, URLs, and email.
  • When structure gets nested, stop regex and parse properly.

Related reading