☕ JavaCollections Framework

HashSet

HashSet is the most commonly used Set implementation in Java. It stores elements in a hash table, providing O(1) average-case performance for add, remove, and contains operations regardless of the set's size. HashSet makes no guarantee about the order in which elements are returned during iteration — the order depends on hash codes and internal bucket structure, which can change as the set is modified. Understanding HashSet means understanding the hash table beneath it, the critical equals/hashCode contract that determines correctness, load factor and resizing behaviour, and when the unordered O(1) semantics are the right trade-off. This entry covers the hash table internals, the equals/hashCode contract and its consequences, capacity and load factor tuning, null element handling, and practical patterns for correct HashSet usage.

Hash Table Internals — How HashSet Works

HashSet is backed by a HashMap internally — it stores elements as keys in a HashMap with a constant dummy value. Every operation on HashSet delegates to the underlying HashMap. This design means that understanding HashMap is identical to understanding HashSet. The hash table divides its storage into an array of buckets. When an element is added, the JVM computes its hash code, applies a supplemental hash function to improve distribution, and takes the result modulo the number of buckets to determine which bucket the element belongs to. If the bucket is empty, the element is placed there. If the bucket already contains elements (a hash collision), the new element is appended to a chain (a linked list before Java 8, a balanced red-black tree for chains exceeding 8 elements in Java 8+). Lookup follows the same process: hash, find bucket, scan the chain for an element equal to the target. The treeification threshold of 8 means that for well-distributed hash codes, chains rarely become long enough to trigger tree conversion. A chain of 8 elements has the same expected frequency as flipping 8 heads in a row with a fair coin — rare. When hash codes cluster badly (poorly implemented hashCode or adversarial inputs), chains can become long and treeification converts them to balanced trees, degrading O(1) expected operations to O(log n) worst case rather than O(n). The table is resized when the number of elements exceeds capacity × loadFactor. Resizing doubles the table, rehashes all existing elements into the new table, and resets the threshold. The default load factor of 0.75 balances memory efficiency (the table is not too sparse) against performance (chains are short). Higher load factors reduce memory at the cost of longer chains and more collisions; lower load factors use more memory for shorter chains.

Java

// ── HashSet construction ─────────────────────────────────────────────
HashSet<String> set1 = new HashSet<>();                // default: capacity=16, LF=0.75
HashSet<String> set2 = new HashSet<>(100);             // initial capacity hint
HashSet<String> set3 = new HashSet<>(100, 0.5f);       // capacity + load factor
HashSet<String> set4 = new HashSet<>(existingCollection); // copy constructor

// ── Core operations — all O(1) average ───────────────────────────────
Set<String> set = new HashSet<>();

boolean added   = set.add("Alice");     // true — new element
boolean dup     = set.add("Alice");     // false — already present (no duplicate)
boolean removed = set.remove("Alice");  // true — was present
boolean missing = set.remove("Dave");   // false — was not present
boolean has     = set.contains("Bob");  // true/false — O(1) average

// ── Internal structure visualisation ─────────────────────────────────
//
// HashSet<String> with "Alice", "Bob", "Carol", "Dave"
// (simplified — actual bucket count is a power of 2)
//
// Bucket 0:  ["Alice"]
// Bucket 1:  []
// Bucket 2:  ["Bob", "Dave"]   ← collision: Bob and Dave have same bucket
// Bucket 3:  ["Carol"]
// ...
//
// contains("Bob"):
//   hash("Bob") → 2 → scan bucket 2 → Bob.equals("Bob") → true   O(1)
//
// contains("Eve"):
//   hash("Eve") → 5 → bucket 5 is empty → false   O(1)

// ── No ordering guarantee ─────────────────────────────────────────────
Set<Integer> numbers = new HashSet<>();
for (int i = 1; i <= 10; i++) numbers.add(i);

// Iteration order is NOT 1 2 3 4 5 6 7 8 9 10:
for (int n : numbers) System.out.print(n + " ");
// Might print: 1 2 3 4 5 6 7 8 9 10
// Or might print: 3 7 1 8 4 9 2 5 6 10
// Order depends on hashCode and internal bucket structure
// NEVER rely on HashSet iteration order

// ── Null element — HashSet allows exactly one null ────────────────────
Set<String> withNull = new HashSet<>();
withNull.add(null);     // allowed — null maps to bucket 0
withNull.add(null);     // false — only one null allowed
System.out.println(withNull.contains(null));  // true
System.out.println(withNull.size());          // 1

The equals/hashCode Contract — Correctness Foundation

The correctness of HashSet depends entirely on the equals/hashCode contract. The contract has two parts. First, if two objects are equal according to equals(), they must have the same hash code — equal objects must hash to the same bucket, or the set would fail to detect duplicates. Second, if two objects have the same hash code, they may or may not be equal — hash collisions are permitted. Violating the contract causes silent, hard-to-debug failures. If two logically equal objects have different hash codes, a HashSet treats them as different — you can add the same logical value multiple times and contains() returns false for an element that is clearly in the set. This failure mode is particularly dangerous because it can occur transiently when objects are mutable: if an object in a HashSet has its hash-relevant fields mutated after insertion, its hash code changes, but the set still stores it in the bucket corresponding to the original hash code. The set will then fail to find it, fail to remove it, and appear to contain a duplicate when the original is re-added. The general recommendation is: never store mutable objects in a HashSet when the mutable fields affect equals or hashCode. Use immutable objects as set elements wherever possible. If mutable objects must be used, ensure that mutations do not affect the fields used in equals/hashCode while the objects are in the set. Objects that do not override equals and hashCode use the defaults from Object: identity equality (==) and identity hash code (based on memory address). For such objects, only the exact same instance is considered equal to itself, which means a HashSet of them cannot detect duplicates at the logical level, only at the reference level.

Java

// ── equals/hashCode contract — correct implementation ────────────────
public class Point {
    private final int x;
    private final int y;

    public Point(int x, int y) { this.x = x; this.y = y; }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Point p)) return false;
        return x == p.x && y == p.y;
    }

    @Override
    public int hashCode() {
        return Objects.hash(x, y);  // combines x and y into one hash code
    }
}

// ── Correct behaviour with proper contract ────────────────────────────
Set<Point> points = new HashSet<>();
points.add(new Point(3, 4));
points.add(new Point(3, 4));  // same logical value → not added (duplicate)
System.out.println(points.size());                   // 1
System.out.println(points.contains(new Point(3, 4))); // true — correct!

// ── BROKEN: equal objects with different hashCodes ────────────────────
public class BrokenPoint {
    int x, y;
    public BrokenPoint(int x, int y) { this.x = x; this.y = y; }

    @Override
    public boolean equals(Object o) {
        if (!(o instanceof BrokenPoint p)) return false;
        return x == p.x && y == p.y;
    }
    // MISSING hashCode() — uses Object's identity hash code!
    // Two equal BrokenPoints will have DIFFERENT hash codes (different objects)
}

Set<BrokenPoint> broken = new HashSet<>();
broken.add(new BrokenPoint(3, 4));
broken.add(new BrokenPoint(3, 4));  // treated as different — WRONG
System.out.println(broken.size());                       // 2 — BUG!
System.out.println(broken.contains(new BrokenPoint(3, 4))); // false — BUG!

// ── Mutable object in HashSet — extremely dangerous ───────────────────
class MutableKey {
    int value;
    MutableKey(int v) { this.value = v; }
    @Override public boolean equals(Object o) {
        return o instanceof MutableKey m && value == m.value;
    }
    @Override public int hashCode() { return value; }
}

Set<MutableKey> mutableSet = new HashSet<>();
MutableKey key = new MutableKey(5);
mutableSet.add(key);
System.out.println(mutableSet.contains(key));  // true

key.value = 10;   // MUTATE THE KEY — hash code changes!
System.out.println(mutableSet.contains(key));  // false — SET IS BROKEN
System.out.println(mutableSet.size());          // 1 — it's "in" the set but unfindable
mutableSet.add(key);                            // re-adds as if new — now size=2!

Capacity, Load Factor, and Performance Tuning

HashSet performance degrades when many elements hash to the same bucket, creating long chains that must be traversed linearly. Two factors control this: the initial capacity (the initial number of buckets) and the load factor (the ratio of elements to buckets above which the table is resized). The product capacity × loadFactor is the threshold at which resizing occurs. The default initial capacity of 16 and load factor of 0.75 are well-chosen defaults for general use. When the final size of the set is known in advance, providing a suitable initial capacity avoids multiple resize operations during population. The formula for the right initial capacity is ceil(expectedSize / loadFactor) — for 1000 expected elements with load factor 0.75, use new HashSet<>(1334) to avoid any resize. Load factor 0.75 is a careful balance. A lower load factor (0.5) means fewer collisions and faster lookups but uses twice as much memory for the bucket array. A higher load factor (0.9) uses less memory but increases average chain length and lookup time. The sweet spot is application-dependent: read-heavy sets benefit from lower load factors; memory-constrained applications might use higher ones. HashSet performance can degrade to O(n) per operation when hashCode() has a pathological implementation that returns the same value for many distinct elements, or when an adversary crafts input to maximise collisions. Java 8's treeification (converting long chains to balanced trees) mitigates this to O(log n) worst case. For security-sensitive applications processing untrusted input, randomising hash codes with a seed (as Java's String hash randomisation option -Dhashcode.random.seed provides for some uses) prevents collision attacks.

Java

// ── Capacity tuning for known sizes ──────────────────────────────────
// WRONG: default capacity causes 6 resize operations for 1000 elements
// 16 → 32 → 64 → 128 → 256 → 512 → 1024
Set<String> slow = new HashSet<>();   // will resize multiple times

// CORRECT: provide capacity hint — ceil(1000 / 0.75) = 1334
// No resizing needed during population
Set<String> fast = new HashSet<>(1334);

// Or use the common formula:
int expectedSize = 1000;
int initialCapacity = (int) Math.ceil(expectedSize / 0.75) + 1;
Set<String> tuned = new HashSet<>(initialCapacity);

// ── Load factor trade-offs ────────────────────────────────────────────
// Low load factor (0.25): fewer collisions, more memory
// Use when: read-heavy set, memory not a constraint
Set<String> fastLookup = new HashSet<>(100, 0.25f);

// High load factor (0.9): more collisions, less memory
// Use when: write-once-read-rarely, memory is tight
Set<String> compact = new HashSet<>(100, 0.9f);

// ── HashSet with poor hashCode — performance collapse ─────────────────
// If all elements return the same hashCode:
class BadHash {
    String value;
    BadHash(String v) { this.value = v; }
    @Override public int hashCode() { return 42; }  // TERRIBLE
    @Override public boolean equals(Object o) {
        return o instanceof BadHash b && value.equals(b.value);
    }
}

Set<BadHash> bad = new HashSet<>();
for (int i = 0; i < 1000; i++) bad.add(new BadHash("item" + i));
// All 1000 elements in ONE bucket → Java 8+ converts to tree → O(log n)
// Still far worse than O(1) with good hash codes

// ── Bulk operations ───────────────────────────────────────────────────
Set<String> a = new HashSet<>(Set.of("A", "B", "C", "D"));
Set<String> b = new HashSet<>(Set.of("C", "D", "E", "F"));

// Union: a ∪ b
Set<String> union = new HashSet<>(a);
union.addAll(b);     // {A, B, C, D, E, F}

// Intersection: a ∩ b
Set<String> intersection = new HashSet<>(a);
intersection.retainAll(b);   // {C, D}

// Difference: a  b
Set<String> difference = new HashSet<>(a);
difference.removeAll(b);     // {A, B}

// Is a subset of b?
boolean isSubset = b.containsAll(intersection);  // true

HashSet in Practice — Patterns and Pitfalls

HashSet is the right choice when membership testing is the primary operation and ordering does not matter. Common patterns include: deduplication (converting a List to a HashSet removes duplicates), visited tracking in graph traversal (O(1) contains avoids revisiting nodes), and set operations (union, intersection, difference) on collections. The deduplication pattern is one of the most common: new HashSet<>(list) converts a list to a set, removing duplicates. The element order in the resulting set is unspecified. If order matters (preserve insertion order, natural order), LinkedHashSet or TreeSet should be used instead. The visited set in BFS and DFS is the most performance-critical use case for HashSet. Graph algorithms visit each node at most once by checking if the node is in a visited set before processing. With a HashSet, this check is O(1). Using a List for the visited set would make each check O(n), degrading the entire BFS/DFS from O(V+E) to O(V²). HashSet is not thread-safe. Concurrent reads are safe (reads do not modify the structure), but concurrent writes or concurrent reads with writes require external synchronisation. Collections.synchronizedSet(new HashSet<>()) provides a synchronised wrapper but serialises all access. ConcurrentHashMap.newKeySet() provides a fully concurrent set with better throughput under contention.

Java

// ── Deduplication ────────────────────────────────────────────────────
List<String> withDuplicates = List.of("apple", "banana", "apple",
                                       "cherry", "banana", "date");
Set<String> unique = new HashSet<>(withDuplicates);
System.out.println(unique.size());   // 4 — duplicates removed
// Order not guaranteed: might be {date, cherry, apple, banana}

// ── Visited set in graph traversal ───────────────────────────────────
Set<Integer> visited = new HashSet<>();

void bfs(Graph graph, int start) {
    Queue<Integer> queue = new ArrayDeque<>();
    queue.offer(start);
    visited.add(start);

    while (!queue.isEmpty()) {
        int node = queue.poll();
        process(node);
        for (int neighbour : graph.neighbours(node)) {
            if (visited.add(neighbour)) {  // add returns false if already present
                queue.offer(neighbour);     // only enqueue if newly visited
            }
        }
    }
}

// ── Set operations for data analysis ──────────────────────────────────
Set<Long> activeUsers    = getActiveUsersLastWeek();
Set<Long> purchasedUsers = getUsersWhoMadePurchase();

// Users who were active but did not purchase:
Set<Long> nonPurchasers = new HashSet<>(activeUsers);
nonPurchasers.removeAll(purchasedUsers);   // active  purchased

// Are there users who purchased without being in the active set?
boolean anomaly = !activeUsers.containsAll(purchasedUsers);

// ── Thread-safe HashSet options ────────────────────────────────────────
// Option 1: synchronised wrapper (simple, low concurrency)
Set<String> syncSet =
    Collections.synchronizedSet(new HashSet<>());
// All methods synchronised — but iterator still needs external sync:
synchronized (syncSet) {
    for (String s : syncSet) { ... }  // must hold lock during iteration
}

// Option 2: ConcurrentHashMap.newKeySet() (better concurrency)
Set<String> concurrentSet = ConcurrentHashMap.newKeySet();
concurrentSet.add("value");           // lock-free in many cases
concurrentSet.contains("value");      // lock-free read

// Option 3: CopyOnWriteArraySet (read-heavy, writes are expensive)
Set<String> cowSet = new CopyOnWriteArraySet<>();
// All reads are lock-free; writes copy the entire array
// Suitable only when reads vastly outnumber writes

HashSet

Hash Table Internals — How HashSet Works

The equals/hashCode Contract — Correctness Foundation

Capacity, Load Factor, and Performance Tuning

HashSet in Practice — Patterns and Pitfalls

Related Topics in Collections Framework