☕ Java

Unicode Handling

Java was designed from the ground up for Unicode. The char type holds a UTF-16 code unit, String stores characters in UTF-16 encoding, and the entire I/O system supports character encoding conversion. However, Unicode's growth beyond the original 65,536-character Basic Multilingual Plane means that char is no longer sufficient to represent every Unicode character — supplementary characters require two chars (a surrogate pair). Understanding this distinction, knowing how to correctly process Unicode text, handling character encodings, and working with Unicode-aware string operations are essential for any application that handles international text. This entry covers the Unicode standard essentials, Java's char vs code point model, encoding handling, normalisation, and correct Unicode-aware string operations.

Unicode Fundamentals — Code Points and Planes

Unicode assigns a unique integer — a code point — to every character in every writing system, emoji, symbol, and control code. Code points are written as U+XXXX (four hex digits for the common range) or U+XXXXX to U+10FFFF (five or six digits for the full range). The full Unicode range spans 1,114,112 code points across 17 planes. The Basic Multilingual Plane (BMP) is Plane 0, containing code points U+0000 to U+FFFF. It includes the Latin alphabet, Greek, Cyrillic, Arabic, Hebrew, CJK Unified Ideographs, and thousands more characters used by most of the world's writing systems. Every BMP character fits in a single Java char (16-bit UTF-16 code unit). Planes 1 through 16 are the supplementary planes, containing code points U+10000 to U+10FFFF. They include historic scripts, musical notation, mathematical symbols, emoji, and CJK extension characters. Java represents each supplementary character with two char values called a surrogate pair. The first char is a high surrogate (U+D800 to U+DBFF) and the second is a low surrogate (U+DC00 to U+DFFF). This means a Java String containing one emoji may have length() == 2 even though it visually contains one character. The distinction between char (a UTF-16 code unit) and code point (a Unicode character) is the foundation of correct Unicode processing in Java. Code that works with char is broken for supplementary characters. Code that works with code points is correct for all Unicode characters.
Java
// ── Code point vs char ───────────────────────────────────────────────
// BMP character — one code point, one char
char latinA = 'A';              // U+0041 — fits in char
System.out.println((int) 'A'); // 65 = 0x0041

// Supplementary character — one code point, TWO chars (surrogate pair)
String emoji = "😀";            // U+1F600 GRINNING FACE
System.out.println(emoji.length());               // 2 — two chars!
System.out.println(emoji.codePointCount(0, emoji.length())); // 1 — one character

// ── The surrogate pair that represents U+1F600 ────────────────────────
int codePoint = 0x1F600;                              // 128512 decimal
char high = Character.highSurrogate(codePoint);       // '\uD83D' (0xD83D)
char low  = Character.lowSurrogate(codePoint);         // '\uDE00' (0xDE00)

System.out.printf("High surrogate: U+%04X%n", (int) high);  // D83D
System.out.printf("Low surrogate:  U+%04X%n", (int) low);   // DE00

// Reconstruct code point from surrogates
int reconstructed = Character.toCodePoint(high, low);
System.out.println(reconstructed == codePoint);   // true

// ── String with mixed BMP and supplementary chars ─────────────────────
String mixed = "Hi 🌍";   // 3 BMP chars + 1 supplementary (Earth emoji)
System.out.println(mixed.length());               // 53 chars + 2 surrogates
System.out.println(mixed.codePointCount(0, mixed.length())); // 44 characters

// ── Unicode planes summary ────────────────────────────────────────────
// Plane 0  (BMP)       U+0000   – U+FFFF    Latin, Greek, CJK, etc.
// Plane 1  (SMP)       U+10000  – U+1FFFF   Emoji, historic scripts, music
// Plane 2  (SIP)       U+20000  – U+2FFFF   CJK extensions
// Planes 3-13          (mostly unassigned)
// Plane 14 (SSP)       U+E0000  – U+EFFFF   Tags
// Planes 15-16         (private use areas)

Code Point API — Processing Unicode Correctly

Java provides code point-aware methods alongside the older char-based methods. The code point API treats supplementary characters as single units, which is the correct level of abstraction for most string processing tasks. Using the char-based API on strings that may contain emoji or supplementary characters produces subtle bugs: counting length() overcounts, charAt() may return half a surrogate pair, and string manipulation may split surrogate pairs. The codePoints() stream method returns an IntStream of Unicode code points — each supplementary character appears as a single int value (its code point), not as two separate values. This is the correct way to iterate a string's characters. The codePointAt(index) and codePointBefore(index) methods return the full code point at or before the given index, correctly handling surrogates. The offsetByCodePoints(index, count) method advances count code points from the given index, correctly skipping over surrogate pairs. The Character class provides static utility methods for code point properties: Character.isLetter(codePoint), Character.isDigit(codePoint), Character.isWhitespace(codePoint), Character.toUpperCase(codePoint), Character.toLowerCase(codePoint), and many more. These methods accept int code points and work correctly for all Unicode characters including supplementary ones. The char-based overloads (Character.isLetter(char)) are limited to BMP characters.
Java
// ── Code point iteration — correct for all Unicode ───────────────────
String text = "Hello 😀 World 🌍";

// WRONG — char-based iteration splits emoji:
for (int i = 0; i < text.length(); i++) {
    char c = text.charAt(i);
    // c may be half a surrogate pair for emoji
    System.out.print(c + " ");  // garbage for emoji positions
}

// CORRECT — code point iteration:
text.codePoints().forEach(cp -> {
    System.out.printf("U+%04X (%s)  ",
        cp, new String(Character.toChars(cp)));
});

// ── codePointCount vs length ──────────────────────────────────────────
String withEmoji = "Hello 😀!";
System.out.println(withEmoji.length());                               // 9
System.out.println(withEmoji.codePointCount(0, withEmoji.length())); // 8

// ── Character class code point methods ───────────────────────────────
int cpA     = 'A';
int cpAlpha = 0x03B1;    // α (Greek small letter alpha)
int cpEmoji = 0x1F600;   // 😀
int cpCJK   = 0x4E2D;    // 中 (Chinese character for "middle")

System.out.println(Character.isLetter(cpA));       // true
System.out.println(Character.isLetter(cpAlpha));   // true
System.out.println(Character.isLetter(cpCJK));     // true
System.out.println(Character.isLetter(cpEmoji));   // false

System.out.println(Character.isEmoji(cpEmoji));    // true (Java 19+)

// Case conversion for ALL Unicode:
System.out.println(Character.toLowerCase(0x03A3));  // σ (Σ → σ)
// String-level case conversion handles multi-char cases like 'ß'"SS":
System.out.println("straße".toUpperCase(Locale.GERMANY)); // STRASSE

// ── Reverse a string correctly ────────────────────────────────────────
// WRONG — StringBuilder.reverse() handles surrogates since Java 1.5 but:
public static String reverseChars(String s) {
    return new StringBuilder(s).reverse().toString();  // works correctly
}

// Manual CORRECT reversal via code points:
public static String reverseByCodePoints(String s) {
    int[] codePoints = s.codePoints().toArray();
    // Reverse the code point array
    for (int i = 0, j = codePoints.length - 1; i < j; i++, j--) {
        int tmp = codePoints[i];
        codePoints[i] = codePoints[j];
        codePoints[j] = tmp;
    }
    return new String(codePoints, 0, codePoints.length);
}

System.out.println(reverseByCodePoints("Hello 😀"));  // 😀 olleH

Character Encodings — Charset and I/O

A character encoding maps between Unicode code points and bytes. UTF-8 is the dominant encoding — it uses 1 byte for ASCII characters, 2-3 bytes for most other characters, and 4 bytes for supplementary characters. UTF-16 uses 2 bytes for BMP characters and 4 bytes for supplementary characters — it is what Java uses internally for String and char. UTF-32 uses 4 bytes for every character. Legacy encodings like Latin-1 (ISO-8859-1), Windows-1252, and Shift-JIS can only represent a subset of Unicode and must be handled carefully. Java's java.nio.charset.StandardCharsets class provides constants for the six required charsets: UTF_8, UTF_16, UTF_16BE, UTF_16LE, US_ASCII, and ISO_8859_1. Always prefer StandardCharsets constants over string names like "UTF-8" to avoid UnsupportedEncodingException and typos. The most common source of garbled text in Java applications is encoding mismatch: reading a file or HTTP response in the wrong encoding. The rule is: always specify the encoding explicitly on every Reader, Writer, InputStreamReader, and OutputStreamWriter. Relying on the platform default encoding (Charset.defaultCharset()) produces code that works on the developer's machine but fails on servers with different default encodings.
Java
// ── StandardCharsets — always use these constants ────────────────────
import java.nio.charset.StandardCharsets;

byte[] utf8Bytes  = "Hello 世界".getBytes(StandardCharsets.UTF_8);
byte[] latin1Bytes = "Hello".getBytes(StandardCharsets.ISO_8859_1);

String fromUtf8 = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(fromUtf8);   // Hello 世界

// ── Encoding mismatch — common source of garbled text ─────────────────
String original  = "Héllo Wörld";
byte[] wrongBytes = original.getBytes(StandardCharsets.UTF_8);
String garbled    = new String(wrongBytes, StandardCharsets.ISO_8859_1);
System.out.println(garbled);   // Héllo Wörld — garbled!

String correct    = new String(wrongBytes, StandardCharsets.UTF_8);
System.out.println(correct);   // Héllo Wörld — correct

// ── Always specify encoding for I/O ──────────────────────────────────
// WRONG — platform default encoding (varies by OS and JVM):
BufferedReader badReader = new BufferedReader(
    new FileReader("data.txt"));

// CORRECT — explicit UTF-8:
BufferedReader goodReader = new BufferedReader(
    new InputStreamReader(
        new FileInputStream("data.txt"),
        StandardCharsets.UTF_8));

// Or Java 11+ Files API (always specify charset):
String content = Files.readString(Path.of("data.txt"),
    StandardCharsets.UTF_8);
Files.writeString(Path.of("out.txt"), content,
    StandardCharsets.UTF_8);

// ── Detecting encoding — Charset.forName() with fallback ──────────────
String charsetName = response.getContentType()
    .replaceAll(".*;\\s*charset\\s*=\\s*", "")
    .trim();
Charset charset;
try {
    charset = Charset.forName(charsetName);
} catch (IllegalArgumentException e) {
    charset = StandardCharsets.UTF_8;  // default to UTF-8 on unknown
}

// ── UTF-8 BOM handling ────────────────────────────────────────────────
// Some UTF-8 files start with a BOM (U+FEFF, bytes EF BB BF)
// Java does not strip BOM automatically
byte[] withBom = Files.readAllBytes(Path.of("with-bom.txt"));
String text = new String(withBom, StandardCharsets.UTF_8);
if (text.startsWith("\uFEFF")) {
    text = text.substring(1);   // strip BOM manually
}

Unicode Normalisation

Unicode normalisation addresses the fact that some characters can be represented in multiple ways that are visually identical but have different byte sequences. The letter é can be represented as a single precomposed character (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or as a decomposed sequence of two characters (U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT). Both render identically, but they are different code point sequences and compare as unequal without normalisation. Java provides Normalizer.normalize() with four normalisation forms from the Unicode standard. NFC (Canonical Decomposition followed by Canonical Composition) is the most common: it decomposes characters and then recomposes them into the canonical precomposed form. NFD (Canonical Decomposition) decomposes but does not recompose. NFKC and NFKD are compatibility normalisation forms that additionally normalise compatibility characters (such as ligatures and fullwidth forms) to their canonical equivalents. The practical implication: any application that compares user-supplied strings against stored strings must normalise both to the same form before comparing. Search functionality, username deduplication, and password hashing are all affected. Failing to normalise can produce security vulnerabilities (two strings that look identical but compare as unequal) and data quality problems.
Java
// ── The normalisation problem ─────────────────────────────────────────
import java.text.Normalizer;

// é as precomposed single character (U+00E9)
String precomposed  = "\u00E9";  // é — NFC form

// é as base e + combining acute (U+0065 + U+0301)
String decomposed   = "\u0065\u0301";  // e + ́  — NFD form

System.out.println(precomposed);           // é — looks the same
System.out.println(decomposed);            // é — looks the same
System.out.println(precomposed.length());  // 1 — one code unit
System.out.println(decomposed.length());   // 2 — base + combining

System.out.println(precomposed.equals(decomposed));  // FALSE — different bytes!

// ── Normalise before comparing ────────────────────────────────────────
String nfc1 = Normalizer.normalize(precomposed, Normalizer.Form.NFC);
String nfc2 = Normalizer.normalize(decomposed,  Normalizer.Form.NFC);

System.out.println(nfc1.equals(nfc2));   // TRUE — both now NFC

// ── The four normalisation forms ──────────────────────────────────────
String original = "\u00e9\ufb01";   // é + fi (fi ligature)

String nfc  = Normalizer.normalize(original, Normalizer.Form.NFC);
String nfd  = Normalizer.normalize(original, Normalizer.Form.NFD);
String nfkc = Normalizer.normalize(original, Normalizer.Form.NFKC);
String nfkd = Normalizer.normalize(original, Normalizer.Form.NFKD);

System.out.println(nfc.length());   // 2: é (precomposed) + fi (ligature preserved)
System.out.println(nfd.length());   // 3: e + ́  + fi (e decomposed, ligature preserved)
System.out.println(nfkc.length());  // 3: é (composed) + f + i (ligature decomposed)
System.out.println(nfkd.length());  // 4: e + ́  + f + i (both decomposed)

// ── Username normalisation for deduplication ──────────────────────────
public static String normaliseUsername(String username) {
    // NFC: canonical composition (handles accented chars)
    // toLowerCase: case folding
    // strip: remove leading/trailing whitespace
    return Normalizer
        .normalize(username, Normalizer.Form.NFC)
        .toLowerCase(Locale.ROOT)
        .strip();
}

System.out.println(
    normaliseUsername("Al\u0069\u0301ce").equals(
    normaliseUsername("Al\u00EDce")));   // true — same user

Related Topics in Strings

String Class
String is one of the most fundamental classes in Java — used in virtually every program, yet deeply misunderstood by many developers. A String represents an immutable sequence of Unicode characters. It is not a primitive type but a full class in java.lang, automatically imported into every Java file. Understanding String means understanding how it is stored in memory, why it is immutable, how the string pool works, what the difference between == and equals() means for strings, and how to use the class efficiently. This entry covers String's nature as a class, its internal representation, the critical distinction between reference equality and value equality, String's place in the type hierarchy, and the design decisions that make String behave the way it does.
String Pool
The string pool (also called the string intern pool or string constant pool) is a special memory region maintained by the JVM that stores a single copy of each unique string value. When two string literals have the same content, they refer to the same object in the pool rather than two separate objects. The pool is a flyweight pattern applied at the language level — it dramatically reduces memory consumption in applications that use many repeated string values, which is nearly every application. This entry covers how the pool works, where it lives in JVM memory, how to interact with it programmatically, the intern() method, performance implications, and when to use or avoid pool entries.
Immutable String
String immutability is the most important design decision in Java's String class. Once a String object is created, its character sequence can never change. No method on String modifies the string; every method that appears to modify returns a new String object containing the result. This design decision drives thread safety, enables the string pool, makes strings safe hash map keys, and simplifies reasoning about string values. Understanding why String is immutable, how immutability is enforced, and what the consequences of immutability are clarifies the behaviour of virtually every piece of Java code that handles strings.
Mutable String
Java provides two mutable string classes for scenarios where String's immutability would be inefficient: StringBuilder and StringBuffer. Both maintain an internal character buffer that can be modified in place — characters can be appended, inserted, deleted, and replaced without creating new objects. StringBuilder is the modern choice for single-threaded use; StringBuffer is the legacy thread-safe version with synchronised methods. This entry covers the internal buffer mechanics, the full API of both classes, performance characteristics, when to use each, thread safety implications, and the complete patterns for efficient string construction.