☕ Java

Regular Expressions

Regular expressions are a domain-specific language for describing patterns in text. Java's regex support is built into the java.util.regex package (Pattern and Matcher) and exposed through convenience methods on String (matches(), replaceAll(), split()). Regex enables validation, extraction, transformation, and splitting operations that would require many lines of procedural code to implement manually. This entry covers the complete Java regex syntax — character classes, quantifiers, anchors, groups, backreferences, lookaheads — and the practical knowledge needed to write correct, readable, and performant expressions.

Regex Syntax — Building Blocks

A regular expression describes a set of strings. Literal characters match themselves. The dot (.) matches any character except newline by default. Character classes enclosed in square brackets match any one character from the set: [aeiou] matches any vowel, [a-z] matches any lowercase ASCII letter, [^0-9] matches any character that is not a digit. Predefined character classes provide shorthand: \d matches [0-9], \w matches [a-zA-Z0-9_], \s matches whitespace, and their uppercase counterparts (\D, \W, \S) match the complement. Quantifiers specify how many times a preceding element can match. The star (*) means zero or more. The plus (+) means one or more. The question mark (?) means zero or one (optional). Braces specify exact counts: {n} matches exactly n times, {n,} matches n or more times, {n,m} matches between n and m times inclusive. By default all quantifiers are greedy — they match as many characters as possible while still allowing the overall pattern to succeed. Adding ? after a quantifier makes it reluctant (lazy), matching as few characters as possible. Anchors do not match characters; they assert positions. The caret (^) asserts the start of the string (or start of line in MULTILINE mode). The dollar ($) asserts the end of the string (or end of line). \b asserts a word boundary — the position between a word character and a non-word character. \B asserts a non-word boundary. Correct use of anchors is critical for validation: an email pattern without ^ and $ anchors would match any string containing a valid email, not only strings that are entirely valid emails.
Java
// ── Character classes ────────────────────────────────────────────────
"cat".matches("[abc]at")          // true  — [abc] matches 'c'
"bat".matches("[abc]at")          // true  — [abc] matches 'b'
"hat".matches("[abc]at")          // false'h' not in [abc]

"Hello".matches("[A-Za-z]+")      // true  — all letters
"Hello2".matches("[A-Za-z]+")     // false — contains digit

// ── Predefined classes ────────────────────────────────────────────────
"hello123".matches("\\w+")        // true  — word chars only
"hello 123".matches("\\w+")       // false — space is not \w

"  \t\n".matches("\\s+")          // true  — all whitespace
"42".matches("\\d+")              // true  — all digits
"42abc".matches("\\d+")           // false — contains non-digits

// ── Quantifiers ───────────────────────────────────────────────────────
"colour".matches("colou?r")       // true'u' is optional (?)
"color".matches("colou?r")        // true'u' absent, ? allows zero

"aaa".matches("a{3}")             // true  — exactly 3
"aa".matches("a{3}")              // false — only 2
"aaaa".matches("a{2,4}")          // true  — between 2 and 4
"a".matches("a{2,4}")             // false — less than 2

// ── Anchors ───────────────────────────────────────────────────────────
// matches() implicitly anchors to full string:
"hello world".matches("hello")    // false — not the full string
"hello world".matches(".*hello.*")// true  — .* allows any prefix/suffix

// In Matcher, use ^ and $ explicitly:
Pattern p = Pattern.compile("^\\d{5}$");  // exactly 5 digits, full string
p.matcher("12345").matches()      // true
p.matcher("1234").matches()       // false
p.matcher("123456").matches()     // false
p.matcher("12 345").matches()     // false

// ── Greedy vs reluctant quantifiers ──────────────────────────────────
// Input: "<a>hello</a><b>world</b>"
Pattern greedy    = Pattern.compile("<.*>");    // greedy — matches as much as possible
Pattern reluctant = Pattern.compile("<.*?>");   // reluctant — matches as little as possible

Matcher mg = greedy.matcher("<a>hello</a><b>world</b>");
mg.find();
System.out.println(mg.group());  // <a>hello</a><b>world</b> — whole string

Matcher mr = reluctant.matcher("<a>hello</a><b>world</b>");
mr.find();
System.out.println(mr.group());  // <a> — just the first tag

Groups, Backreferences, and Alternation

Parentheses create capturing groups that serve two purposes: they group part of the pattern so quantifiers can apply to the whole group, and they capture the matched text for extraction. Groups are numbered left-to-right starting at 1 based on the position of their opening parenthesis. Group 0 always refers to the entire match. Captured groups are available through Matcher.group(n) after a successful match. Named capturing groups — (?<name>pattern) — allow extracting groups by name rather than number. Named groups make patterns self-documenting: group("year") is more readable than group(1) in a date extraction pattern, and the code is robust against reordering groups. Non-capturing groups (?:pattern) group without capturing. They are useful when parentheses are needed for grouping or quantification but the captured text is not needed — using non-capturing groups avoids the overhead of capturing and keeps group numbers aligned with the semantically meaningful captures. Backreferences (\1, \2, etc.) match the same text captured by a group earlier in the pattern. They are used to detect repeated content: (\w+)\s+\1 matches a word that is immediately repeated (like "the the" in text). The alternation operator | acts like logical OR: cat|dog matches either "cat" or "dog".
Java
// ── Capturing groups ─────────────────────────────────────────────────
Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher m = datePattern.matcher("Born on 1990-03-15, graduated 2012-06-20");

while (m.find()) {
    System.out.printf("Full: %s  Year: %s  Month: %s  Day: %s%n",
        m.group(0),   // full match: "1990-03-15"
        m.group(1),   // group 1: "1990" (year)
        m.group(2),   // group 2: "03"   (month)
        m.group(3));  // group 3: "15"   (day)
}

// ── Named capturing groups ────────────────────────────────────────────
Pattern namedDate = Pattern.compile(
    "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher mn = namedDate.matcher("2024-03-15");
if (mn.matches()) {
    System.out.println(mn.group("year"));   // 2024
    System.out.println(mn.group("month"));  // 03
    System.out.println(mn.group("day"));    // 15
}

// ── Non-capturing groups ──────────────────────────────────────────────
Pattern p = Pattern.compile("(?:Mr|Ms|Dr)\\.\\s+(\\w+)");
Matcher mp = p.matcher("Hello Dr. Smith and Ms. Jones");
while (mp.find()) {
    System.out.println(mp.group(1));   // Smith, Jones
    // group(1) is the name — title captured in (?:) is not group 1
}

// ── Backreferences — detect repeated words ────────────────────────────
Pattern repeated = Pattern.compile("\\b(\\w+)\\s+\\1\\b");
Matcher mr = repeated.matcher("the the quick brown brown fox");
while (mr.find()) {
    System.out.println("Repeated: " + mr.group(1));
}
// Repeated: the
// Repeated: brown

// ── Alternation ───────────────────────────────────────────────────────
Pattern colours = Pattern.compile("red|green|blue");
Matcher mc = colours.matcher("I like red and blue");
while (mc.find()) {
    System.out.println("Found: " + mc.group());
}
// Found: red
// Found: blue

// Alternation with groups
Pattern protocol = Pattern.compile("^(https?|ftp)://(.+)$");
Matcher mp2 = protocol.matcher("https://example.com/path");
if (mp2.matches()) {
    System.out.println(mp2.group(1));  // https
    System.out.println(mp2.group(2));  // example.com/path
}

Lookaheads, Lookbehinds, and Flags

Lookaheads and lookbehinds are zero-width assertions that test what comes before or after the current position without consuming characters. Positive lookahead (?=pattern) asserts that pattern matches immediately after the current position. Negative lookahead (?!pattern) asserts that pattern does not match immediately after. Positive lookbehind (?<=pattern) asserts that pattern matches immediately before the current position. Negative lookbehind (?<!pattern) asserts that pattern does not match immediately before. Lookaheads and lookbehinds are powerful for context-sensitive matching — extracting a number followed by a currency symbol without including the symbol in the match, splitting on commas only outside of quoted strings, or finding words not preceded by certain prefixes. Regex flags modify the matching behaviour. CASE_INSENSITIVE makes the pattern case-insensitive. MULTILINE makes ^ and $ match start and end of each line rather than the whole string. DOTALL makes . match newline characters as well as other characters (useful for multi-line patterns). UNICODE_CASE makes CASE_INSENSITIVE respect Unicode case folding. Flags are passed as a second argument to Pattern.compile() or inline in the pattern with (?flags) syntax.
Java
// ── Lookaheads ───────────────────────────────────────────────────────
// Positive lookahead: match digits followed by "px"
Pattern pixelValue = Pattern.compile("\\d+(?=px)");
Matcher m = pixelValue.matcher("margin: 20px padding: 10px");
while (m.find()) {
    System.out.println(m.group());  // 20, 10 (without "px")
}

// Negative lookahead: match "colour" not followed by "ful"
Pattern colour = Pattern.compile("colour(?!ful)");
System.out.println("colour".matches("colour(?!ful)")); // true
System.out.println("colourful".matches("colour(?!ful)")); // false (full string)

// ── Lookbehinds ──────────────────────────────────────────────────────
// Positive lookbehind: match digits preceded by "$"
Pattern price = Pattern.compile("(?<=\\$)\\d+\\.?\\d*");
Matcher mp = price.matcher("Cost: $42.99 and $15.00");
while (mp.find()) {
    System.out.println(mp.group());  // 42.99, 15.00 (without "$")
}

// Negative lookbehind: match "java" not preceded by "not "
Pattern javaRef = Pattern.compile("(?<!not )java");
Matcher mj = javaRef.matcher("I like java but not javascript");
while (mj.find()) {
    System.out.println(mj.group() + " at " + mj.start());  // java at 7
}

// ── Regex flags ───────────────────────────────────────────────────────
// Case insensitive
Pattern ci = Pattern.compile("hello", Pattern.CASE_INSENSITIVE);
System.out.println(ci.matcher("HELLO").matches());  // true
System.out.println(ci.matcher("Hello").matches());  // true

// Multiline — ^ and $ match per line
Pattern ml = Pattern.compile("^\\d+", Pattern.MULTILINE);
Matcher mm = ml.matcher("123 abc\n456 def\n789 ghi");
while (mm.find()) {
    System.out.println(mm.group());   // 123, 456, 789
}

// DOTALL — . matches newline too
Pattern ds = Pattern.compile("<p>.*?</p>", Pattern.DOTALL);
Matcher md = ds.matcher("<p>line1\nline2\nline3</p>");
System.out.println(md.find());   // true — . crossed the newline

// Inline flags in pattern string
Pattern inline = Pattern.compile("(?i)hello");  // case insensitive
Pattern combined = Pattern.compile("(?im)^start"); // multiline + case insensitive

Common Validation and Extraction Patterns

Certain regex patterns appear repeatedly in Java applications. Email validation, URL parsing, IP address validation, phone number matching, date extraction, and password policy enforcement all have well-established patterns. However, regex-based email validation in particular is notoriously difficult to get perfectly right per the RFC specification — the practical approach is to use a reasonable approximation and validate further by sending a confirmation email. The most important performance consideration for regex is compile-once-reuse-many. Pattern.compile() is expensive — it parses and compiles the regex to an internal NFA representation. Calling String.matches() or String.replaceAll() with a regex string compiles the pattern on every call. For patterns used in loops or request handlers, compile the pattern once as a static final field and reuse the compiled Pattern object. This can be orders of magnitude faster for high-frequency usage.
Java
// ── Compile once — reuse many ─────────────────────────────────────────
// WRONG — compiles on every call:
public boolean isValidEmail(String email) {
    return email.matches("[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}");
}

// CORRECT — compile once as static final:
private static final Pattern EMAIL_PATTERN = Pattern.compile(
    "^[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}$");

public boolean isValidEmail(String email) {
    return EMAIL_PATTERN.matcher(email).matches();
}

// ── Common validation patterns ────────────────────────────────────────
// Email (reasonable approximation)
Pattern EMAIL = Pattern.compile(
    "^[\\w.+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}$");

// URL
Pattern URL = Pattern.compile(
    "^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",
    Pattern.CASE_INSENSITIVE);

// IPv4 address
Pattern IPV4 = Pattern.compile(
    "^((25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}" +
    "(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)$");

// Postal code (US ZIP)
Pattern ZIP = Pattern.compile("^\\d{5}(-\\d{4})?$");

// Password: 8+ chars, at least one upper, lower, digit, special
Pattern PASSWORD = Pattern.compile(
    "^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@#$!%*?&])[A-Za-z\\d@#$!%*?&]{8,}$");

// ISO date: YYYY-MM-DD
Pattern ISO_DATE = Pattern.compile(
    "^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$");

// ── Extraction pattern ─────────────────────────────────────────────────
public static List<String> extractAllEmails(String text) {
    List<String> emails = new ArrayList<>();
    Matcher m = EMAIL.matcher(text);   // reuse compiled pattern
    while (m.find()) emails.add(m.group());
    return emails;
}

// ── Replacing with groups ─────────────────────────────────────────────
// Reformat date: YYYY-MM-DD → DD/MM/YYYY
String input    = "Report date: 2024-03-15, Filed: 2024-06-01";
String reformatted = input.replaceAll(
    "(\\d{4})-(\\d{2})-(\\d{2})",
    "$3/$2/$1");   // $n references captured groups in replacement
System.out.println(reformatted);
// Report date: 15/03/2024, Filed: 01/06/2024

Related Topics in Strings

String Class
String is one of the most fundamental classes in Java — used in virtually every program, yet deeply misunderstood by many developers. A String represents an immutable sequence of Unicode characters. It is not a primitive type but a full class in java.lang, automatically imported into every Java file. Understanding String means understanding how it is stored in memory, why it is immutable, how the string pool works, what the difference between == and equals() means for strings, and how to use the class efficiently. This entry covers String's nature as a class, its internal representation, the critical distinction between reference equality and value equality, String's place in the type hierarchy, and the design decisions that make String behave the way it does.
String Pool
The string pool (also called the string intern pool or string constant pool) is a special memory region maintained by the JVM that stores a single copy of each unique string value. When two string literals have the same content, they refer to the same object in the pool rather than two separate objects. The pool is a flyweight pattern applied at the language level — it dramatically reduces memory consumption in applications that use many repeated string values, which is nearly every application. This entry covers how the pool works, where it lives in JVM memory, how to interact with it programmatically, the intern() method, performance implications, and when to use or avoid pool entries.
Immutable String
String immutability is the most important design decision in Java's String class. Once a String object is created, its character sequence can never change. No method on String modifies the string; every method that appears to modify returns a new String object containing the result. This design decision drives thread safety, enables the string pool, makes strings safe hash map keys, and simplifies reasoning about string values. Understanding why String is immutable, how immutability is enforced, and what the consequences of immutability are clarifies the behaviour of virtually every piece of Java code that handles strings.
Mutable String
Java provides two mutable string classes for scenarios where String's immutability would be inefficient: StringBuilder and StringBuffer. Both maintain an internal character buffer that can be modified in place — characters can be appended, inserted, deleted, and replaced without creating new objects. StringBuilder is the modern choice for single-threaded use; StringBuffer is the legacy thread-safe version with synchronised methods. This entry covers the internal buffer mechanics, the full API of both classes, performance characteristics, when to use each, thread safety implications, and the complete patterns for efficient string construction.