☕ JavaStrings

Regular Expressions

Regular expressions are a domain-specific language for describing patterns in text. Java's regex support is built into the java.util.regex package (Pattern and Matcher) and exposed through convenience methods on String (matches(), replaceAll(), split()). Regex enables validation, extraction, transformation, and splitting operations that would require many lines of procedural code to implement manually. This entry covers the complete Java regex syntax — character classes, quantifiers, anchors, groups, backreferences, lookaheads — and the practical knowledge needed to write correct, readable, and performant expressions.

Regex Syntax — Building Blocks

A regular expression describes a set of strings. Literal characters match themselves. The dot (.) matches any character except newline by default. Character classes enclosed in square brackets match any one character from the set: [aeiou] matches any vowel, [a-z] matches any lowercase ASCII letter, [^0-9] matches any character that is not a digit. Predefined character classes provide shorthand: \d matches [0-9], \w matches [a-zA-Z0-9_], \s matches whitespace, and their uppercase counterparts (\D, \W, \S) match the complement. Quantifiers specify how many times a preceding element can match. The star (*) means zero or more. The plus (+) means one or more. The question mark (?) means zero or one (optional). Braces specify exact counts: {n} matches exactly n times, {n,} matches n or more times, {n,m} matches between n and m times inclusive. By default all quantifiers are greedy — they match as many characters as possible while still allowing the overall pattern to succeed. Adding ? after a quantifier makes it reluctant (lazy), matching as few characters as possible. Anchors do not match characters; they assert positions. The caret (^) asserts the start of the string (or start of line in MULTILINE mode). The dollar ($) asserts the end of the string (or end of line). \b asserts a word boundary — the position between a word character and a non-word character. \B asserts a non-word boundary. Correct use of anchors is critical for validation: an email pattern without ^ and $ anchors would match any string containing a valid email, not only strings that are entirely valid emails.

Java

// ── Character classes ────────────────────────────────────────────────
"cat".matches("[abc]at")          // true  — [abc] matches 'c'
"bat".matches("[abc]at")          // true  — [abc] matches 'b'
"hat".matches("[abc]at")          // false — 'h' not in [abc]

"Hello".matches("[A-Za-z]+")      // true  — all letters
"Hello2".matches("[A-Za-z]+")     // false — contains digit

// ── Predefined classes ────────────────────────────────────────────────
"hello123".matches("\\w+")        // true  — word chars only
"hello 123".matches("\\w+")       // false — space is not \w

"  \t\n".matches("\\s+")          // true  — all whitespace
"42".matches("\\d+")              // true  — all digits
"42abc".matches("\\d+")           // false — contains non-digits

// ── Quantifiers ───────────────────────────────────────────────────────
"colour".matches("colou?r")       // true  — 'u' is optional (?)
"color".matches("colou?r")        // true  — 'u' absent, ? allows zero

"aaa".matches("a{3}")             // true  — exactly 3
"aa".matches("a{3}")              // false — only 2
"aaaa".matches("a{2,4}")          // true  — between 2 and 4
"a".matches("a{2,4}")             // false — less than 2

// ── Anchors ───────────────────────────────────────────────────────────
// matches() implicitly anchors to full string:
"hello world".matches("hello")    // false — not the full string
"hello world".matches(".*hello.*")// true  — .* allows any prefix/suffix

// In Matcher, use ^ and $ explicitly:
Pattern p = Pattern.compile("^\\d{5}$");  // exactly 5 digits, full string
p.matcher("12345").matches()      // true
p.matcher("1234").matches()       // false
p.matcher("123456").matches()     // false
p.matcher("12 345").matches()     // false

// ── Greedy vs reluctant quantifiers ──────────────────────────────────
// Input: "<a>hello</a><b>world</b>"
Pattern greedy    = Pattern.compile("<.*>");    // greedy — matches as much as possible
Pattern reluctant = Pattern.compile("<.*?>");   // reluctant — matches as little as possible

Matcher mg = greedy.matcher("<a>hello</a><b>world</b>");
mg.find();
System.out.println(mg.group());  // <a>hello</a><b>world</b> — whole string

Matcher mr = reluctant.matcher("<a>hello</a><b>world</b>");
mr.find();
System.out.println(mr.group());  // <a> — just the first tag

Groups, Backreferences, and Alternation

Parentheses create capturing groups that serve two purposes: they group part of the pattern so quantifiers can apply to the whole group, and they capture the matched text for extraction. Groups are numbered left-to-right starting at 1 based on the position of their opening parenthesis. Group 0 always refers to the entire match. Captured groups are available through Matcher.group(n) after a successful match. Named capturing groups — (?<name>pattern) — allow extracting groups by name rather than number. Named groups make patterns self-documenting: group("year") is more readable than group(1) in a date extraction pattern, and the code is robust against reordering groups. Non-capturing groups (?:pattern) group without capturing. They are useful when parentheses are needed for grouping or quantification but the captured text is not needed — using non-capturing groups avoids the overhead of capturing and keeps group numbers aligned with the semantically meaningful captures. Backreferences (\1, \2, etc.) match the same text captured by a group earlier in the pattern. They are used to detect repeated content: (\w+)\s+\1 matches a word that is immediately repeated (like "the the" in text). The alternation operator | acts like logical OR: cat|dog matches either "cat" or "dog".

Java

// ── Capturing groups ─────────────────────────────────────────────────
Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher m = datePattern.matcher("Born on 1990-03-15, graduated 2012-06-20");

while (m.find()) {
    System.out.printf("Full: %s  Year: %s  Month: %s  Day: %s%n",
        m.group(0),   // full match: "1990-03-15"
        m.group(1),   // group 1: "1990" (year)
        m.group(2),   // group 2: "03"   (month)
        m.group(3));  // group 3: "15"   (day)
}

// ── Named capturing groups ────────────────────────────────────────────
Pattern namedDate = Pattern.compile(
    "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher mn = namedDate.matcher("2024-03-15");
if (mn.matches()) {
    System.out.println(mn.group("year"));   // 2024
    System.out.println(mn.group("month"));  // 03
    System.out.println(mn.group("day"));    // 15
}

// ── Non-capturing groups ──────────────────────────────────────────────
Pattern p = Pattern.compile("(?:Mr|Ms|Dr)\\.\\s+(\\w+)");
Matcher mp = p.matcher("Hello Dr. Smith and Ms. Jones");
while (mp.find()) {
    System.out.println(mp.group(1));   // Smith, Jones
    // group(1) is the name — title captured in (?:) is not group 1
}

// ── Backreferences — detect repeated words ────────────────────────────
Pattern repeated = Pattern.compile("\\b(\\w+)\\s+\\1\\b");
Matcher mr = repeated.matcher("the the quick brown brown fox");
while (mr.find()) {
    System.out.println("Repeated: " + mr.group(1));
}
// Repeated: the
// Repeated: brown

// ── Alternation ───────────────────────────────────────────────────────
Pattern colours = Pattern.compile("red|green|blue");
Matcher mc = colours.matcher("I like red and blue");
while (mc.find()) {
    System.out.println("Found: " + mc.group());
}
// Found: red
// Found: blue

// Alternation with groups
Pattern protocol = Pattern.compile("^(https?|ftp)://(.+)$");
Matcher mp2 = protocol.matcher("https://example.com/path");
if (mp2.matches()) {
    System.out.println(mp2.group(1));  // https
    System.out.println(mp2.group(2));  // example.com/path
}

Lookaheads, Lookbehinds, and Flags

Lookaheads and lookbehinds are zero-width assertions that test what comes before or after the current position without consuming characters. Positive lookahead (?=pattern) asserts that pattern matches immediately after the current position. Negative lookahead (?!pattern) asserts that pattern does not match immediately after. Positive lookbehind (?<=pattern) asserts that pattern matches immediately before the current position. Negative lookbehind (?<!pattern) asserts that pattern does not match immediately before. Lookaheads and lookbehinds are powerful for context-sensitive matching — extracting a number followed by a currency symbol without including the symbol in the match, splitting on commas only outside of quoted strings, or finding words not preceded by certain prefixes. Regex flags modify the matching behaviour. CASE_INSENSITIVE makes the pattern case-insensitive. MULTILINE makes ^ and $ match start and end of each line rather than the whole string. DOTALL makes . match newline characters as well as other characters (useful for multi-line patterns). UNICODE_CASE makes CASE_INSENSITIVE respect Unicode case folding. Flags are passed as a second argument to Pattern.compile() or inline in the pattern with (?flags) syntax.

Java

// ── Lookaheads ───────────────────────────────────────────────────────
// Positive lookahead: match digits followed by "px"
Pattern pixelValue = Pattern.compile("\\d+(?=px)");
Matcher m = pixelValue.matcher("margin: 20px padding: 10px");
while (m.find()) {
    System.out.println(m.group());  // 20, 10 (without "px")
}

// Negative lookahead: match "colour" not followed by "ful"
Pattern colour = Pattern.compile("colour(?!ful)");
System.out.println("colour".matches("colour(?!ful)")); // true
System.out.println("colourful".matches("colour(?!ful)")); // false (full string)

// ── Lookbehinds ──────────────────────────────────────────────────────
// Positive lookbehind: match digits preceded by "$"
Pattern price = Pattern.compile("(?<=\\$)\\d+\\.?\\d*");
Matcher mp = price.matcher("Cost: $42.99 and $15.00");
while (mp.find()) {
    System.out.println(mp.group());  // 42.99, 15.00 (without "$")
}

// Negative lookbehind: match "java" not preceded by "not "
Pattern javaRef = Pattern.compile("(?<!not )java");
Matcher mj = javaRef.matcher("I like java but not javascript");
while (mj.find()) {
    System.out.println(mj.group() + " at " + mj.start());  // java at 7
}

// ── Regex flags ───────────────────────────────────────────────────────
// Case insensitive
Pattern ci = Pattern.compile("hello", Pattern.CASE_INSENSITIVE);
System.out.println(ci.matcher("HELLO").matches());  // true
System.out.println(ci.matcher("Hello").matches());  // true

// Multiline — ^ and $ match per line
Pattern ml = Pattern.compile("^\\d+", Pattern.MULTILINE);
Matcher mm = ml.matcher("123 abc\n456 def\n789 ghi");
while (mm.find()) {
    System.out.println(mm.group());   // 123, 456, 789
}

// DOTALL — . matches newline too
Pattern ds = Pattern.compile("<p>.*?</p>", Pattern.DOTALL);
Matcher md = ds.matcher("<p>line1\nline2\nline3</p>");
System.out.println(md.find());   // true — . crossed the newline

// Inline flags in pattern string
Pattern inline = Pattern.compile("(?i)hello");  // case insensitive
Pattern combined = Pattern.compile("(?im)^start"); // multiline + case insensitive

Common Validation and Extraction Patterns

Certain regex patterns appear repeatedly in Java applications. Email validation, URL parsing, IP address validation, phone number matching, date extraction, and password policy enforcement all have well-established patterns. However, regex-based email validation in particular is notoriously difficult to get perfectly right per the RFC specification — the practical approach is to use a reasonable approximation and validate further by sending a confirmation email. The most important performance consideration for regex is compile-once-reuse-many. Pattern.compile() is expensive — it parses and compiles the regex to an internal NFA representation. Calling String.matches() or String.replaceAll() with a regex string compiles the pattern on every call. For patterns used in loops or request handlers, compile the pattern once as a static final field and reuse the compiled Pattern object. This can be orders of magnitude faster for high-frequency usage.

Java

// ── Compile once — reuse many ─────────────────────────────────────────
// WRONG — compiles on every call:
public boolean isValidEmail(String email) {
    return email.matches("[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}");
}

// CORRECT — compile once as static final:
private static final Pattern EMAIL_PATTERN = Pattern.compile(
    "^[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}$");

public boolean isValidEmail(String email) {
    return EMAIL_PATTERN.matcher(email).matches();
}

// ── Common validation patterns ────────────────────────────────────────
// Email (reasonable approximation)
Pattern EMAIL = Pattern.compile(
    "^[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}$");

// URL
Pattern URL = Pattern.compile(
    "^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",
    Pattern.CASE_INSENSITIVE);

// IPv4 address
Pattern IPV4 = Pattern.compile(
    "^((25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}" +
    "(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)$");

// Postal code (US ZIP)
Pattern ZIP = Pattern.compile("^\\d{5}(-\\d{4})?$");

// Password: 8+ chars, at least one upper, lower, digit, special
Pattern PASSWORD = Pattern.compile(
    "^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@#$!%*?&])[A-Za-z\\d@#$!%*?&]{8,}$");

// ISO date: YYYY-MM-DD
Pattern ISO_DATE = Pattern.compile(
    "^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$");

// ── Extraction pattern ─────────────────────────────────────────────────
public static List<String> extractAllEmails(String text) {
    List<String> emails = new ArrayList<>();
    Matcher m = EMAIL.matcher(text);   // reuse compiled pattern
    while (m.find()) emails.add(m.group());
    return emails;
}

// ── Replacing with groups ─────────────────────────────────────────────
// Reformat date: YYYY-MM-DD → DD/MM/YYYY
String input    = "Report date: 2024-03-15, Filed: 2024-06-01";
String reformatted = input.replaceAll(
    "(\\d{4})-(\\d{2})-(\\d{2})",
    "$3/$2/$1");   // $n references captured groups in replacement
System.out.println(reformatted);
// Report date: 15/03/2024, Filed: 01/06/2024

Regular Expressions

Regex Syntax — Building Blocks

Groups, Backreferences, and Alternation

Lookaheads, Lookbehinds, and Flags

Common Validation and Extraction Patterns

Related Topics in Strings