☕ JavaIntroduction

Java Unicode System

Java was built with internationalization in mind from day one. Every string, every character, every source file is Unicode. Understanding how Java handles Unicode — from char and String internals to escape sequences and encoding — is essential for building software that works correctly in every language.

Why Java Chose Unicode

When Java was designed in 1995, most languages used ASCII — a 7-bit encoding covering only 128 characters, barely enough for English. Building global software meant layering complex character encoding conversions on top, with endless bugs when systems disagreed on encodings. Java made a different choice: Unicode as the native character system, from the ground up. Every Java char is a Unicode character. Every String is a Unicode string. Every .java source file is treated as Unicode. This means Java programs work correctly with English, Chinese, Arabic, Hindi, Japanese, emoji, and mathematical symbols — out of the box, without external libraries.

char — A 16-bit Unicode Character

Java's char primitive is a 16-bit unsigned integer representing a Unicode code point. It can hold any Unicode character in the Basic Multilingual Plane (U+0000 to U+FFFF).

Java

// char holds a single Unicode character:
char letter = 'A';          // U+0041 — Latin capital A
char digit  = '5';          // U+0035
char space  = ' ';          // U+0020
char newline = '
';        // U+000A — escape sequence

// Unicode escape sequences — \uXXXX (4 hex digits):
char omega = 'Ω';      // Ω — Greek capital Omega
char rupee = '₹';      // ₹ — Indian Rupee sign
char snowman = '☃';    // ☃

// char is numerically a 16-bit integer:
char c = 'A';
int code = c;               // code = 65 (Unicode code point for 'A')
char next = (char)(c + 1);  // next = 'B'

// Printing Unicode:
System.out.println('Ω');   // prints: Ω
System.out.println('₹');   // prints: ₹

// char arithmetic:
for (char ch = 'A'; ch <= 'Z'; ch++) {
    System.out.print(ch);       // prints: ABCDEFGHIJKLMNOPQRSTUVWXYZ
}

Unicode Escape Sequences

Java supports Unicode escapes using the \uXXXX syntax — four hexadecimal digits representing a Unicode code point. These are processed by the compiler before any other syntax processing — which means they work in string literals, character literals, identifiers, and even comments.

Java

// In string literals:
String greeting = "中文";    // 中文 — Chinese characters
String hello = "Hello";  // "Hello"

// In char literals:
char euro = '€';     // €
char pi = 'π';       // π

// In identifiers (unusual but valid):
int Age = 25;        // same as: int Age = 25;

// Unicode escapes are processed BEFORE compilation:
// This means 
 (newline) in source code is a real newline:
// System.out.println("line1
line2");  // same as 


// Common Unicode escapes worth knowing:
// 	 — horizontal tab (same as 	)
// 
 — line feed / newline (same as 
)
// 
 — carriage return (same as 
)
// " — double quote (same as ")
// ' — single quote (same as ')
// \ — backslash (same as \)

String and Unicode — Internal Representation

Java String is a sequence of char values — Unicode UTF-16 code units. For characters in the Basic Multilingual Plane (the vast majority of characters), one char = one character. For supplementary characters (emoji, rare CJK characters, ancient scripts) outside the BMP, Java uses surrogate pairs — two char values together to represent one character.

Java

// String length returns the number of char values, not visual characters:
String s = "Hello";
System.out.println(s.length());   // 5 — as expected

// Emoji are supplementary characters — stored as surrogate pairs:
String emoji = "😀";
System.out.println(emoji.length());         // 2 — two char values!
System.out.println(emoji.codePointCount(0, emoji.length()));  // 1 — one character

// Safe iteration over code points (handles surrogates correctly):
String text = "Hello 😀";
text.codePoints().forEach(cp -> {
    System.out.println(Character.toString(cp));  // correctly handles emoji
});

// String.chars() returns char values — may split surrogate pairs:
// String.codePoints() returns code points — handles supplementary characters
String mixed = "A😀B";   // A + 😀 (surrogate pair) + B
System.out.println(mixed.length());              // 4 (A + 2 surrogates + B)
System.out.println(mixed.codePointCount(0, 4));  // 3 (A + 😀 + B)

File Encoding — Reading and Writing Unicode

When reading or writing files, encoding matters. Java's default file encoding can vary by platform — always specify UTF-8 explicitly to guarantee correct Unicode handling across all operating systems.

Java

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.*;

// ALWAYS specify charset explicitly — never rely on the default:

// Writing UTF-8 file:
Path path = Path.of("output.txt");
Files.writeString(path, "Hello 世界 🌍", StandardCharsets.UTF_8);

// Reading UTF-8 file:
String content = Files.readString(path, StandardCharsets.UTF_8);

// PrintWriter with explicit encoding:
PrintWriter writer = new PrintWriter(
    new OutputStreamWriter(
        new FileOutputStream("file.txt"), StandardCharsets.UTF_8));

// BufferedReader with explicit encoding:
BufferedReader reader = new BufferedReader(
    new InputStreamReader(
        new FileInputStream("file.txt"), StandardCharsets.UTF_8));

// Java 18+ — UTF-8 is the default charset (finally!)
// Before Java 18: the default was platform-dependent (Cp1252 on Windows, UTF-8 on Linux)
// Best practice: still specify StandardCharsets.UTF_8 explicitly for clarity

The Character Class — Unicode Utilities

Java's Character class provides a rich set of static methods for working with Unicode characters — testing categories, converting case, and handling supplementary characters.

Java

// Testing character categories:
Character.isLetter('A')          // true
Character.isLetter('5')          // false
Character.isDigit('7')           // true
Character.isLetterOrDigit('_')   // false
Character.isWhitespace(' ')      // true
Character.isUpperCase('A')       // true
Character.isLowerCase('a')       // true

// Unicode-aware:
Character.isLetter('α')          // true — Greek alpha
Character.isLetter('中')          // true — Chinese character

// Case conversion:
Character.toUpperCase('a')       // 'A'
Character.toLowerCase('Ω')      // 'ω'

// Supplementary character support (code point versions):
int codePoint = "😀".codePointAt(0);
Character.isLetterOrDigit(codePoint)     // false — emoji is not a letter
Character.getType(codePoint)             // 28 = OTHER_SYMBOL

// Get Unicode block and category:
Character.UnicodeBlock block = Character.UnicodeBlock.of('α');
// block = Character.UnicodeBlock.GREEK

Java Unicode System

Why Java Chose Unicode

char — A 16-bit Unicode Character

Unicode Escape Sequences

String and Unicode — Internal Representation

File Encoding — Reading and Writing Unicode

The Character Class — Unicode Utilities

Related Topics in Introduction