☕ JavaStrings

Unicode Handling

Java was designed from the ground up for Unicode. The char type holds a UTF-16 code unit, String stores characters in UTF-16 encoding, and the entire I/O system supports character encoding conversion. However, Unicode's growth beyond the original 65,536-character Basic Multilingual Plane means that char is no longer sufficient to represent every Unicode character — supplementary characters require two chars (a surrogate pair). Understanding this distinction, knowing how to correctly process Unicode text, handling character encodings, and working with Unicode-aware string operations are essential for any application that handles international text. This entry covers the Unicode standard essentials, Java's char vs code point model, encoding handling, normalisation, and correct Unicode-aware string operations.

Unicode Fundamentals — Code Points and Planes

Unicode assigns a unique integer — a code point — to every character in every writing system, emoji, symbol, and control code. Code points are written as U+XXXX (four hex digits for the common range) or U+XXXXX to U+10FFFF (five or six digits for the full range). The full Unicode range spans 1,114,112 code points across 17 planes. The Basic Multilingual Plane (BMP) is Plane 0, containing code points U+0000 to U+FFFF. It includes the Latin alphabet, Greek, Cyrillic, Arabic, Hebrew, CJK Unified Ideographs, and thousands more characters used by most of the world's writing systems. Every BMP character fits in a single Java char (16-bit UTF-16 code unit). Planes 1 through 16 are the supplementary planes, containing code points U+10000 to U+10FFFF. They include historic scripts, musical notation, mathematical symbols, emoji, and CJK extension characters. Java represents each supplementary character with two char values called a surrogate pair. The first char is a high surrogate (U+D800 to U+DBFF) and the second is a low surrogate (U+DC00 to U+DFFF). This means a Java String containing one emoji may have length() == 2 even though it visually contains one character. The distinction between char (a UTF-16 code unit) and code point (a Unicode character) is the foundation of correct Unicode processing in Java. Code that works with char is broken for supplementary characters. Code that works with code points is correct for all Unicode characters.

Java

// ── Code point vs char ───────────────────────────────────────────────
// BMP character — one code point, one char
char latinA = 'A';              // U+0041 — fits in char
System.out.println((int) 'A'); // 65 = 0x0041

// Supplementary character — one code point, TWO chars (surrogate pair)
String emoji = "😀";            // U+1F600 GRINNING FACE
System.out.println(emoji.length());               // 2 — two chars!
System.out.println(emoji.codePointCount(0, emoji.length())); // 1 — one character

// ── The surrogate pair that represents U+1F600 ────────────────────────
int codePoint = 0x1F600;                              // 128512 decimal
char high = Character.highSurrogate(codePoint);       // '\uD83D' (0xD83D)
char low  = Character.lowSurrogate(codePoint);         // '\uDE00' (0xDE00)

System.out.printf("High surrogate: U+%04X%n", (int) high);  // D83D
System.out.printf("Low surrogate:  U+%04X%n", (int) low);   // DE00

// Reconstruct code point from surrogates
int reconstructed = Character.toCodePoint(high, low);
System.out.println(reconstructed == codePoint);   // true

// ── String with mixed BMP and supplementary chars ─────────────────────
String mixed = "Hi 🌍";   // 3 BMP chars + 1 supplementary (Earth emoji)
System.out.println(mixed.length());               // 5 — 3 chars + 2 surrogates
System.out.println(mixed.codePointCount(0, mixed.length())); // 4 — 4 characters

// ── Unicode planes summary ────────────────────────────────────────────
// Plane 0  (BMP)       U+0000   – U+FFFF    Latin, Greek, CJK, etc.
// Plane 1  (SMP)       U+10000  – U+1FFFF   Emoji, historic scripts, music
// Plane 2  (SIP)       U+20000  – U+2FFFF   CJK extensions
// Planes 3-13          (mostly unassigned)
// Plane 14 (SSP)       U+E0000  – U+EFFFF   Tags
// Planes 15-16         (private use areas)

Code Point API — Processing Unicode Correctly

Java provides code point-aware methods alongside the older char-based methods. The code point API treats supplementary characters as single units, which is the correct level of abstraction for most string processing tasks. Using the char-based API on strings that may contain emoji or supplementary characters produces subtle bugs: counting length() overcounts, charAt() may return half a surrogate pair, and string manipulation may split surrogate pairs. The codePoints() stream method returns an IntStream of Unicode code points — each supplementary character appears as a single int value (its code point), not as two separate values. This is the correct way to iterate a string's characters. The codePointAt(index) and codePointBefore(index) methods return the full code point at or before the given index, correctly handling surrogates. The offsetByCodePoints(index, count) method advances count code points from the given index, correctly skipping over surrogate pairs. The Character class provides static utility methods for code point properties: Character.isLetter(codePoint), Character.isDigit(codePoint), Character.isWhitespace(codePoint), Character.toUpperCase(codePoint), Character.toLowerCase(codePoint), and many more. These methods accept int code points and work correctly for all Unicode characters including supplementary ones. The char-based overloads (Character.isLetter(char)) are limited to BMP characters.

Java

// ── Code point iteration — correct for all Unicode ───────────────────
String text = "Hello 😀 World 🌍";

// WRONG — char-based iteration splits emoji:
for (int i = 0; i < text.length(); i++) {
    char c = text.charAt(i);
    // c may be half a surrogate pair for emoji
    System.out.print(c + " ");  // garbage for emoji positions
}

// CORRECT — code point iteration:
text.codePoints().forEach(cp -> {
    System.out.printf("U+%04X (%s)  ",
        cp, new String(Character.toChars(cp)));
});

// ── codePointCount vs length ──────────────────────────────────────────
String withEmoji = "Hello 😀!";
System.out.println(withEmoji.length());                               // 9
System.out.println(withEmoji.codePointCount(0, withEmoji.length())); // 8

// ── Character class code point methods ───────────────────────────────
int cpA     = 'A';
int cpAlpha = 0x03B1;    // α (Greek small letter alpha)
int cpEmoji = 0x1F600;   // 😀
int cpCJK   = 0x4E2D;    // 中 (Chinese character for "middle")

System.out.println(Character.isLetter(cpA));       // true
System.out.println(Character.isLetter(cpAlpha));   // true
System.out.println(Character.isLetter(cpCJK));     // true
System.out.println(Character.isLetter(cpEmoji));   // false

System.out.println(Character.isEmoji(cpEmoji));    // true (Java 19+)

// Case conversion for ALL Unicode:
System.out.println(Character.toLowerCase(0x03A3));  // σ (Σ → σ)
// String-level case conversion handles multi-char cases like 'ß' → "SS":
System.out.println("straße".toUpperCase(Locale.GERMANY)); // STRASSE

// ── Reverse a string correctly ────────────────────────────────────────
// WRONG — StringBuilder.reverse() handles surrogates since Java 1.5 but:
public static String reverseChars(String s) {
    return new StringBuilder(s).reverse().toString();  // works correctly
}

// Manual CORRECT reversal via code points:
public static String reverseByCodePoints(String s) {
    int[] codePoints = s.codePoints().toArray();
    // Reverse the code point array
    for (int i = 0, j = codePoints.length - 1; i < j; i++, j--) {
        int tmp = codePoints[i];
        codePoints[i] = codePoints[j];
        codePoints[j] = tmp;
    }
    return new String(codePoints, 0, codePoints.length);
}

System.out.println(reverseByCodePoints("Hello 😀"));  // 😀 olleH

Character Encodings — Charset and I/O

A character encoding maps between Unicode code points and bytes. UTF-8 is the dominant encoding — it uses 1 byte for ASCII characters, 2-3 bytes for most other characters, and 4 bytes for supplementary characters. UTF-16 uses 2 bytes for BMP characters and 4 bytes for supplementary characters — it is what Java uses internally for String and char. UTF-32 uses 4 bytes for every character. Legacy encodings like Latin-1 (ISO-8859-1), Windows-1252, and Shift-JIS can only represent a subset of Unicode and must be handled carefully. Java's java.nio.charset.StandardCharsets class provides constants for the six required charsets: UTF_8, UTF_16, UTF_16BE, UTF_16LE, US_ASCII, and ISO_8859_1. Always prefer StandardCharsets constants over string names like "UTF-8" to avoid UnsupportedEncodingException and typos. The most common source of garbled text in Java applications is encoding mismatch: reading a file or HTTP response in the wrong encoding. The rule is: always specify the encoding explicitly on every Reader, Writer, InputStreamReader, and OutputStreamWriter. Relying on the platform default encoding (Charset.defaultCharset()) produces code that works on the developer's machine but fails on servers with different default encodings.

Java

// ── StandardCharsets — always use these constants ────────────────────
import java.nio.charset.StandardCharsets;

byte[] utf8Bytes  = "Hello 世界".getBytes(StandardCharsets.UTF_8);
byte[] latin1Bytes = "Hello".getBytes(StandardCharsets.ISO_8859_1);

String fromUtf8 = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(fromUtf8);   // Hello 世界

// ── Encoding mismatch — common source of garbled text ─────────────────
String original  = "Héllo Wörld";
byte[] wrongBytes = original.getBytes(StandardCharsets.UTF_8);
String garbled    = new String(wrongBytes, StandardCharsets.ISO_8859_1);
System.out.println(garbled);   // HÃ©llo WÃ¶rld — garbled!

String correct    = new String(wrongBytes, StandardCharsets.UTF_8);
System.out.println(correct);   // Héllo Wörld — correct

// ── Always specify encoding for I/O ──────────────────────────────────
// WRONG — platform default encoding (varies by OS and JVM):
BufferedReader badReader = new BufferedReader(
    new FileReader("data.txt"));

// CORRECT — explicit UTF-8:
BufferedReader goodReader = new BufferedReader(
    new InputStreamReader(
        new FileInputStream("data.txt"),
        StandardCharsets.UTF_8));

// Or Java 11+ Files API (always specify charset):
String content = Files.readString(Path.of("data.txt"),
    StandardCharsets.UTF_8);
Files.writeString(Path.of("out.txt"), content,
    StandardCharsets.UTF_8);

// ── Detecting encoding — Charset.forName() with fallback ──────────────
String charsetName = response.getContentType()
    .replaceAll(".*;\\s*charset\\s*=\\s*", "")
    .trim();
Charset charset;
try {
    charset = Charset.forName(charsetName);
} catch (IllegalArgumentException e) {
    charset = StandardCharsets.UTF_8;  // default to UTF-8 on unknown
}

// ── UTF-8 BOM handling ────────────────────────────────────────────────
// Some UTF-8 files start with a BOM (U+FEFF, bytes EF BB BF)
// Java does not strip BOM automatically
byte[] withBom = Files.readAllBytes(Path.of("with-bom.txt"));
String text = new String(withBom, StandardCharsets.UTF_8);
if (text.startsWith("\uFEFF")) {
    text = text.substring(1);   // strip BOM manually
}

Unicode Normalisation

Unicode normalisation addresses the fact that some characters can be represented in multiple ways that are visually identical but have different byte sequences. The letter é can be represented as a single precomposed character (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or as a decomposed sequence of two characters (U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT). Both render identically, but they are different code point sequences and compare as unequal without normalisation. Java provides Normalizer.normalize() with four normalisation forms from the Unicode standard. NFC (Canonical Decomposition followed by Canonical Composition) is the most common: it decomposes characters and then recomposes them into the canonical precomposed form. NFD (Canonical Decomposition) decomposes but does not recompose. NFKC and NFKD are compatibility normalisation forms that additionally normalise compatibility characters (such as ligatures and fullwidth forms) to their canonical equivalents. The practical implication: any application that compares user-supplied strings against stored strings must normalise both to the same form before comparing. Search functionality, username deduplication, and password hashing are all affected. Failing to normalise can produce security vulnerabilities (two strings that look identical but compare as unequal) and data quality problems.

Java

// ── The normalisation problem ─────────────────────────────────────────
import java.text.Normalizer;

// é as precomposed single character (U+00E9)
String precomposed  = "\u00E9";  // é — NFC form

// é as base e + combining acute (U+0065 + U+0301)
String decomposed   = "\u0065\u0301";  // e + ́  — NFD form

System.out.println(precomposed);           // é — looks the same
System.out.println(decomposed);            // é — looks the same
System.out.println(precomposed.length());  // 1 — one code unit
System.out.println(decomposed.length());   // 2 — base + combining

System.out.println(precomposed.equals(decomposed));  // FALSE — different bytes!

// ── Normalise before comparing ────────────────────────────────────────
String nfc1 = Normalizer.normalize(precomposed, Normalizer.Form.NFC);
String nfc2 = Normalizer.normalize(decomposed,  Normalizer.Form.NFC);

System.out.println(nfc1.equals(nfc2));   // TRUE — both now NFC

// ── The four normalisation forms ──────────────────────────────────────
String original = "\u00e9\ufb01";   // é + ﬁ (fi ligature)

String nfc  = Normalizer.normalize(original, Normalizer.Form.NFC);
String nfd  = Normalizer.normalize(original, Normalizer.Form.NFD);
String nfkc = Normalizer.normalize(original, Normalizer.Form.NFKC);
String nfkd = Normalizer.normalize(original, Normalizer.Form.NFKD);

System.out.println(nfc.length());   // 2: é (precomposed) + ﬁ (ligature preserved)
System.out.println(nfd.length());   // 3: e + ́  + ﬁ (e decomposed, ligature preserved)
System.out.println(nfkc.length());  // 3: é (composed) + f + i (ligature decomposed)
System.out.println(nfkd.length());  // 4: e + ́  + f + i (both decomposed)

// ── Username normalisation for deduplication ──────────────────────────
public static String normaliseUsername(String username) {
    // NFC: canonical composition (handles accented chars)
    // toLowerCase: case folding
    // strip: remove leading/trailing whitespace
    return Normalizer
        .normalize(username, Normalizer.Form.NFC)
        .toLowerCase(Locale.ROOT)
        .strip();
}

System.out.println(
    normaliseUsername("Al\u0069\u0301ce").equals(
    normaliseUsername("Al\u00EDce")));   // true — same user

Unicode Handling

Unicode Fundamentals — Code Points and Planes

Code Point API — Processing Unicode Correctly

Character Encodings — Charset and I/O

Unicode Normalisation

Related Topics in Strings