☕ Java

Pattern Class

The Pattern class in java.util.regex is the compiled representation of a regular expression. Compiling a regex string into a Pattern object performs the expensive parsing and analysis once; the resulting Pattern can then be used repeatedly to create Matcher objects that perform the actual matching against specific input strings. Pattern is immutable and thread-safe — a single Pattern instance can be shared across threads and reused across thousands of match operations. This entry covers the full Pattern and Matcher API, the performance implications of compile-once-reuse, all match operations (matches, find, lookingAt), group extraction, find-and-replace, splitting, and the Java 8+ stream integration of Pattern.

Pattern.compile() and the Matcher Lifecycle

Pattern.compile(regex) parses the regular expression string and builds an internal finite automaton representation. This is a non-trivial computation involving lexing, parsing, optimisation, and NFA construction. For a complex pattern this may take microseconds — negligible in isolation but significant when called in a tight loop or on every HTTP request. The compiled Pattern is then used to create Matcher objects via pattern.matcher(input). A Matcher is stateful: it maintains a position within the input string and advances as find() is called repeatedly. The Matcher is not thread-safe — multiple threads must not share a single Matcher. However, multiple threads can safely share a single Pattern and each create their own Matcher. The Matcher's three primary match operations have distinct semantics. matches() attempts to match the entire input against the pattern — it is the equivalent of anchoring with ^ and $. lookingAt() matches from the beginning of the input but does not require the match to extend to the end. find() scans through the input looking for the next region that matches the pattern, starting from where the previous find() left off (or from the beginning on first call). These three operations cover the three fundamental use cases: validation (matches), prefix matching (lookingAt), and searching (find).
Java
// ── Pattern.compile() — parse once ───────────────────────────────────
// CORRECT: compiled once, reused for every validation call
private static final Pattern PHONE_PATTERN = Pattern.compile(
    "^(\\+\\d{1,3}[- ]?)?" +       // optional country code
    "(\\(?\\d{3}\\)?[- ]?)" +       // area code
    "\\d{3}[- ]?\\d{4}$");          // number

public boolean isValidPhone(String phone) {
    return PHONE_PATTERN.matcher(phone).matches(); // fast — no recompile
}

// ── Matcher lifecycle ─────────────────────────────────────────────────
Pattern p = Pattern.compile("\\d+");
String input = "abc 123 def 456 ghi 789";

// Create a new Matcher for this input
Matcher m = p.matcher(input);

// ── matches() — full string must match the pattern ────────────────────
System.out.println(p.matcher("12345").matches());      // true  — all digits
System.out.println(p.matcher("123 45").matches());     // false — space inside

// ── lookingAt() — match from start, not required to reach end ─────────
System.out.println(p.matcher("123abc").lookingAt());   // true  — starts with digits
System.out.println(p.matcher("abc123").lookingAt());   // false — does not start with digit

// ── find() — search through input for next match ─────────────────────
while (m.find()) {
    System.out.printf("Found '%s' at [%d, %d)%n",
        m.group(), m.start(), m.end());
}
// Found '123' at [4, 7)
// Found '456' at [12, 15)
// Found '789' at [20, 23)

// ── reset() — restart the scan from the beginning ────────────────────
m.reset();                   // resets position to start of input
m.find();                    // finds '123' again
System.out.println(m.group()); // 123

// ── reset(newInput) — reuse Matcher with different input ──────────────
m.reset("999 and 888");
while (m.find()) System.out.print(m.group() + " ");
// 999 888

Group Extraction and Match Metadata

After a successful match (matches(), lookingAt(), or find() returning true), the Matcher provides rich metadata about what was matched. group() or group(0) returns the entire match. group(n) returns the text captured by group n. group(name) returns the text captured by a named group. start() and end() return the start (inclusive) and end (exclusive) indices of the last match within the input string. start(n) and end(n) give the same indices for group n. When a group is part of an alternation or optional section that was not visited in the match, group(n) returns null and start(n) and end(n) return -1. Code extracting optional groups must null-check before using the result. The Matcher can also be queried during scanning for the remaining unmatched text. appendReplacement() and appendTail() are used together to build a replacement string, giving fine control over how each match is replaced. This is more flexible than replaceAll() when the replacement is computed dynamically based on the match content.
Java
// ── Group metadata after find() ──────────────────────────────────────
Pattern logLine = Pattern.compile(
    "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})" +
    "\\s+(?<level>\\w+)" +
    "\\s+\\[(?<thread>[^\\]]+)\\]" +
    "\\s+(?<logger>[\\w.]+)" +
    "\\s+:\\s+(?<message>.+)");

String line = "2024-03-15 10:30:45 INFO [main] com.myapp.OrderService : Order 42 created";
Matcher m = logLine.matcher(line);

if (m.matches()) {
    System.out.println("Timestamp: " + m.group("timestamp")); // 2024-03-15 10:30:45
    System.out.println("Level:     " + m.group("level"));     // INFO
    System.out.println("Thread:    " + m.group("thread"));    // main
    System.out.println("Logger:    " + m.group("logger"));    // com.myapp.OrderService
    System.out.println("Message:   " + m.group("message"));   // Order 42 created

    // Position metadata
    System.out.println("Message starts at: " + m.start("message"));
    System.out.println("Message ends at:   " + m.end("message"));
}

// ── Optional groups return null when not matched ──────────────────────
Pattern optional = Pattern.compile("(\\+\\d+)? (\\d{10})");
Matcher mo = optional.matcher(" 5551234567");   // no country code
if (mo.matches()) {
    String countryCode = mo.group(1);  // null — optional group not matched
    String number      = mo.group(2);  // "5551234567"
    System.out.println(countryCode != null ? countryCode : "(no code)");
}

// ── appendReplacement / appendTail — dynamic replacement ─────────────
Pattern censor = Pattern.compile("\\b(password|secret|token)\\b",
    Pattern.CASE_INSENSITIVE);
StringBuffer result = new StringBuffer();
Matcher mc = censor.matcher("My password is secret123, token=abc");

while (mc.find()) {
    String replacement = "*".repeat(mc.group().length());
    mc.appendReplacement(result, replacement);
}
mc.appendTail(result);
System.out.println(result);  // My ******** is ******123, *****=abc

Replacement and Splitting

Pattern provides replaceAll() and replaceFirst() methods that mirror the String convenience methods but use the already-compiled pattern instead of recompiling on every call. The replacement string can reference captured groups with $1, $2, or ${name} for named groups. The $ and \ characters in the replacement are special — use Matcher.quoteReplacement() to escape a literal replacement string that may contain these characters. Pattern.split() divides an input string around matches of the pattern and returns a String array. This is equivalent to String.split() but with a compiled pattern, which is more efficient for repeated use. The optional limit parameter controls the number of splits: a positive limit caps the array size; a negative limit keeps trailing empty strings; zero (default) discards trailing empty strings. Java 8 added two stream methods to Pattern. splitAsStream() splits the input as a lazy stream of strings, avoiding the array allocation of split(). asPredicate() returns a Predicate<String> that tests whether the pattern matches any substring of the input. asMatchPredicate() (Java 11+) returns a Predicate<String> that tests whether the pattern matches the entire input, equivalent to matcher.matches().
Java
// ── replaceAll() and replaceFirst() on Pattern ────────────────────────
private static final Pattern WHITESPACE = Pattern.compile("\\s+");
private static final Pattern DIGITS     = Pattern.compile("\\d+");

// More efficient than String.replaceAll() in a loop:
public String normalise(String input) {
    return WHITESPACE.matcher(input).replaceAll(" ").strip();
}

// Computed replacement using find/appendReplacement:
public String redactNumbers(String input) {
    Matcher m = DIGITS.matcher(input);
    StringBuilder sb = new StringBuilder();
    while (m.find()) {
        m.appendReplacement(sb,
            Matcher.quoteReplacement("X".repeat(m.group().length())));
    }
    m.appendTail(sb);
    return sb.toString();
}

System.out.println(redactNumbers("Order 42 total $99.95"));
// Order XX total $XX.XX

// ── Group reference in replacement ───────────────────────────────────
Pattern dateReformat = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
String  input        = "Event: 2024-03-15";
String  reformatted  = dateReformat.matcher(input).replaceAll("$3/$2/$1");
System.out.println(reformatted);  // Event: 15/03/2024

// Named group reference in replacement
Pattern namedDate = Pattern.compile(
    "(?<y>\\d{4})-(?<m>\\d{2})-(?<d>\\d{2})");
String  result    = namedDate.matcher("2024-03-15")
    .replaceAll("${d}/${m}/${y}");
System.out.println(result);  // 15/03/2024

// ── split() on Pattern ────────────────────────────────────────────────
private static final Pattern CSV_SPLIT   = Pattern.compile(",");
private static final Pattern MULTI_SPACE = Pattern.compile("\\s+");

String csv  = "alice,bob,,carol,";
String[] parts = CSV_SPLIT.split(csv);          // ["alice","bob","","carol"]  trailing empty dropped
String[] all   = CSV_SPLIT.split(csv, -1);      // ["alice","bob","","carol",""]  keep trailing

String sentence = "  hello   world   java  ";
String[] words  = MULTI_SPACE.matcher(sentence.strip()).split(sentence.strip());
// ["hello", "world", "java"]

// ── Java 8+ stream methods ────────────────────────────────────────────
// splitAsStream() — lazy, no intermediate array
Pattern.compile(",").splitAsStream("alice,bob,carol")
    .filter(s -> s.startsWith("a"))
    .forEach(System.out::println);   // alice

// asPredicate() — matches any substring
Predicate<String> containsDigit = Pattern.compile("\\d").asPredicate();
List.of("hello", "hello1", "world", "world2")
    .stream()
    .filter(containsDigit)
    .forEach(System.out::println);   // hello1, world2

// asMatchPredicate() (Java 11+) — matches entire string
Predicate<String> isAllDigits = Pattern.compile("\\d+").asMatchPredicate();
System.out.println(isAllDigits.test("12345"));  // true
System.out.println(isAllDigits.test("123ab"));  // false

Performance and Common Pitfalls

Regex performance pitfalls can turn O(n) matching into catastrophic exponential backtracking. Catastrophic backtracking occurs when a poorly written pattern with nested quantifiers forces the regex engine to explore an exponential number of paths before concluding no match exists. The classic example is patterns like (a+)+ on input like "aaaaaaaaab" — the engine tries every way of splitting the a's among the nested groups before failing, taking time exponential in the length of the a sequence. Preventing catastrophic backtracking requires understanding the input and writing specific patterns rather than general ones. Replace nested quantifiers with possessive quantifiers (a++) or atomic groups (?>a+) which, once matched, never give characters back to the engine. Alternatively, express the same constraint without nesting: instead of (a+)+ use a+, which is a correct equivalent for the non-backtracking intent. Several other performance guidelines apply to production regex code. Prefer specific character classes over the dot: [a-z] is faster than . because the engine can quickly test membership without case-by-case backtracking. Use anchors to fail early: ^pattern fails immediately on strings that don't start correctly, without scanning. Use non-capturing groups (?:pattern) instead of capturing groups when capture is not needed — it eliminates the capture bookkeeping overhead. Pre-compile patterns as static final fields, never inside methods that are called frequently.
Java
// ── Catastrophic backtracking — dangerous pattern ───────────────────
// DANGEROUS: nested quantifiers on same character class
Pattern dangerous = Pattern.compile("(a+)+b");
// On "aaaaaaaac" (no 'b' at end): exponential backtracking
// Each 'a' can be in group 1 independently or together
// Engine tries all 2^n distributions before failing

// SAFE: equivalent meaning, no backtracking
Pattern safe = Pattern.compile("a+b");  // same practical match, no nesting
// Or possessive: Pattern.compile("a++b");  // possessive quantifier

// ── Possessive quantifiers — prevent backtracking ─────────────────────
Pattern greedy    = Pattern.compile("\\w+:");    // greedy
Pattern possessive = Pattern.compile("\\w++:"); // possessive (never gives back)

// Possessive fails faster on non-matching input
// greedy might try shorter matches; possessive commits and fails immediately

// ── Atomic groups — equivalent to possessive quantifiers ─────────────
Pattern atomic = Pattern.compile("(?>\\w+):");  // atomic group

// ── Performance rules ─────────────────────────────────────────────────

// Rule 1: Compile once, reuse (10-1000x faster for repeated use)
// BAD:
for (String item : items) {
    if (item.matches("\\d{5}(-\\d{4})?")) { ... }  // recompiles each time
}
// GOOD:
Pattern ZIP = Pattern.compile("^\\d{5}(-\\d{4})?$");
for (String item : items) {
    if (ZIP.matcher(item).matches()) { ... }
}

// Rule 2: Use anchors to fail early
Pattern anchored    = Pattern.compile("^http://");   // fails immediately if no 'h'
Pattern unanchored  = Pattern.compile("http://");    // must scan whole string

// Rule 3: Prefer specific classes over dot
Pattern specific    = Pattern.compile("[a-z0-9]+@[a-z0-9.]+");  // precise
Pattern vague       = Pattern.compile(".+@.+");                  // over-general

// Rule 4: Non-capturing groups for grouping without capture
Pattern ncg = Pattern.compile("(?:Mr|Ms|Dr)\\.\\s+(\\w+)");
// group(1) = name; title group has no number

// ── Timeout for untrusted input ───────────────────────────────────────
// Java does not have built-in regex timeout
// For untrusted input, run matching on a separate thread with Future.get(timeout)
ExecutorService exec = Executors.newSingleThreadExecutor();
Future<Boolean> future = exec.submit(() ->
    RISKY_PATTERN.matcher(untrustedInput).matches());
try {
    boolean result = future.get(100, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
    future.cancel(true);
    throw new InputRejectedException("Input caused regex timeout");
}

Related Topics in Strings

String Class
String is one of the most fundamental classes in Java — used in virtually every program, yet deeply misunderstood by many developers. A String represents an immutable sequence of Unicode characters. It is not a primitive type but a full class in java.lang, automatically imported into every Java file. Understanding String means understanding how it is stored in memory, why it is immutable, how the string pool works, what the difference between == and equals() means for strings, and how to use the class efficiently. This entry covers String's nature as a class, its internal representation, the critical distinction between reference equality and value equality, String's place in the type hierarchy, and the design decisions that make String behave the way it does.
String Pool
The string pool (also called the string intern pool or string constant pool) is a special memory region maintained by the JVM that stores a single copy of each unique string value. When two string literals have the same content, they refer to the same object in the pool rather than two separate objects. The pool is a flyweight pattern applied at the language level — it dramatically reduces memory consumption in applications that use many repeated string values, which is nearly every application. This entry covers how the pool works, where it lives in JVM memory, how to interact with it programmatically, the intern() method, performance implications, and when to use or avoid pool entries.
Immutable String
String immutability is the most important design decision in Java's String class. Once a String object is created, its character sequence can never change. No method on String modifies the string; every method that appears to modify returns a new String object containing the result. This design decision drives thread safety, enables the string pool, makes strings safe hash map keys, and simplifies reasoning about string values. Understanding why String is immutable, how immutability is enforced, and what the consequences of immutability are clarifies the behaviour of virtually every piece of Java code that handles strings.
Mutable String
Java provides two mutable string classes for scenarios where String's immutability would be inefficient: StringBuilder and StringBuffer. Both maintain an internal character buffer that can be modified in place — characters can be appended, inserted, deleted, and replaced without creating new objects. StringBuilder is the modern choice for single-threaded use; StringBuffer is the legacy thread-safe version with synchronised methods. This entry covers the internal buffer mechanics, the full API of both classes, performance characteristics, when to use each, thread safety implications, and the complete patterns for efficient string construction.