☕ Java

Parallel Streams

Parallel streams execute stream operations concurrently using the common ForkJoinPool, transparently splitting the data source into chunks and processing each chunk on a separate thread. A sequential stream becomes parallel by calling parallelStream() on a collection or parallel() on an existing stream; calling sequential() converts it back. The Fork/Join framework divides the work recursively — a Spliterator splits the data into halves, each half is processed independently, and results are combined — enabling efficient parallelism for data-parallel computations on multi-core hardware. Parallel streams are most effective when the computation is CPU-bound, the data set is large, the operation is independently executable on each element (no shared mutable state, no ordering dependencies), and the overhead of splitting and combining is small relative to the per-element computation. This entry covers the splitting and combining model in depth, the common ForkJoinPool and how to use a custom pool, the effect of encounter order and ordered vs unordered operations, thread safety requirements for reduction and collection, performance pitfalls (autoboxing, small data sets, stateful operations, sequential sources), and measurement-driven guidelines for when parallel streams help vs hurt.

How Parallel Streams Work — Splitting, Processing, and Combining

When a parallel stream processes a terminal operation, the Stream framework uses the source's Spliterator to recursively split the data into sub-problems. The split continues until sub-problems are small enough to process directly (determined by the ForkJoinPool's work-stealing threshold and the Spliterator's size estimate). Each sub-problem is submitted as a ForkJoinTask to the common ForkJoinPool (by default) or a custom pool. Worker threads process their assigned chunks and produce partial results. Finally, the partial results are combined according to the combining function of the terminal operation — a reduction accumulates them with the BinaryOperator combiner, collect() uses the Collector's combiner, and forEach() has no result to combine. The Spliterator's characteristics drive the efficiency of parallel decomposition. A Spliterator with SIZED and SUBSIZED characteristics allows the framework to compute exact split sizes and pre-allocate result containers. A Spliterator with ORDERED requires the framework to maintain encounter order during splitting and combining, which constrains parallelism and adds overhead. Removing the ORDERED constraint (via stream.unordered()) can significantly improve parallel performance for operations like distinct() and limit() that are otherwise expensive to parallelize with ordering. The common ForkJoinPool has parallelism equal to Runtime.getRuntime().availableProcessors() - 1 (the minus-one reserves the submitting thread as a worker). This means parallel stream tasks compete for threads with all other parallel streams and ForkJoinPool.commonPool() tasks running in the same JVM — including those from other components, libraries, and frameworks. Blocking operations inside parallel stream lambdas starve the common pool: if 8 out of 8 workers are blocked on I/O, no other parallel work can proceed. For I/O-bound tasks, either use a custom ForkJoinPool or use CompletableFuture with a separate thread pool. The merge/combine step is a hidden cost. A reduction with a BinaryOperator must combine N partial results from N threads, which may require O(N) combine operations even if each is cheap. A collect() with a Collector that has a combining cost proportional to the partial result size (e.g., joining a list of strings into one string) can be more expensive in parallel than sequential because of this combining overhead.
Java
// ── Creating parallel streams ─────────────────────────────────────────
List<Integer> numbers = List.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);

// From a collection directly:
numbers.parallelStream()
    .map(n -> n * n)
    .forEach(System.out::println);  // order NOT guaranteed — parallel

// Convert existing sequential stream to parallel:
numbers.stream()       // sequential
    .parallel()        // switch to parallel
    .map(n -> n * n)
    .forEach(System.out::println);

// Convert parallel back to sequential:
numbers.parallelStream()
    .sequential()      // back to sequential
    .map(n -> n * n)
    .forEach(System.out::println);  // order guaranteed

// Check if a stream is parallel:
Stream<Integer> s = numbers.parallelStream();
System.out.println(s.isParallel());   // true

// ── Splitting: ArrayList vs LinkedList performance ────────────────────
// ArrayList: SIZED + SUBSIZED + ORDERED → perfect parallel splits
List<Integer> arrayList = new ArrayList<>(Collections.nCopies(1_000_000, 1));
long arraySum = arrayList.parallelStream().mapToLong(Integer::longValue).sum();
// Splits perfectly: [0,500k), [500k,1M) → [0,250k), [250k,500k) etc.

// LinkedList: NOT SIZED, NOT SUBSIZED → poor parallel performance
List<Integer> linkedList = new LinkedList<>(Collections.nCopies(1_000_000, 1));
long linkedSum = linkedList.parallelStream().mapToLong(Integer::longValue).sum();
// Cannot split efficiently — degrades to sequential behavior

// ── Ordered vs unordered: ORDERED constraint limits parallelism ────────
List<String> words = List.of("banana","apple","cherry","date","elderberry");

// With ordering (ORDERED source): limit() must preserve first 3 in encounter order
List<String> orderedLimit = words.parallelStream()
    .limit(3)       // must preserve encounter order — SLOW in parallel
    .collect(Collectors.toList());
System.out.println(orderedLimit);   // always [banana, apple, cherry]

// Without ordering: any 3 elements, much faster in parallel
List<String> unorderedLimit = words.parallelStream()
    .unordered()    // remove ORDERED constraint
    .limit(3)       // can return any 3 — FAST in parallel
    .collect(Collectors.toList());
System.out.println(unorderedLimit); // 3 elements, order unpredictable

Thread Safety, Custom ForkJoinPool, and Reduction Requirements

Parallel streams execute lambda expressions on multiple threads simultaneously. Any state accessed or mutated inside a lambda must be thread-safe or must not be shared. Stateful lambda operations that accumulate into a shared mutable container — adding to a shared ArrayList, incrementing a shared counter, writing to a shared map — produce race conditions and incorrect results. The Stream API specifically prohibits stateful behavioral parameters in its contract. The correct way to accumulate results in parallel is through the Collector mechanism. collect(Collectors.toList()) is thread-safe in parallel because the Stream framework uses separate containers for each thread's partial results and combines them at the end. Collectors.groupingByConcurrent() uses a ConcurrentHashMap, allowing parallel puts without combining overhead. AtomicInteger and LongAdder can be safely incremented from parallel stream lambdas as side effects, though this is an antipattern compared to using reduce() or collect(). A custom ForkJoinPool allows parallel stream tasks to run in a pool separate from the common pool, with a configurable parallelism level. This is achieved by submitting the parallel stream terminal operation as a task to the custom pool. The parallel stream detects that it is being called from within a ForkJoinWorkerThread and uses that pool rather than the common pool. The pattern: ForkJoinPool pool = new ForkJoinPool(parallelism); pool.submit(() -> stream.parallel().collect(...)).get(). Important: the custom pool must be shut down when no longer needed to release its threads. Reduction requirements: a parallel reduce() requires an identity value, an accumulator, and a combiner. The identity must be a true identity for the combining operation — combine(identity, v) must equal v. The accumulator and combiner must be associative — (a op b) op c must equal a op (b op c) — so that parallel partial results can be combined in any order. Violating these constraints produces incorrect results that may vary between runs depending on thread scheduling.
Java
// ── Thread safety: shared mutable state causes race conditions ────────
// WRONG: parallel stream mutating shared list:
List<Integer> shared = new ArrayList<>();
IntStream.range(0, 1000).parallel().forEach(shared::add);  // RACE CONDITION
System.out.println(shared.size());  // likely NOT 1000 — lost updates, corruption

// CORRECT: use collect() instead:
List<Integer> correct = IntStream.range(0, 1000)
    .parallel()
    .boxed()
    .collect(Collectors.toList());  // thread-safe: internal partial lists combined
System.out.println(correct.size());  // always 1000

// CORRECT: concurrent collector for parallel grouping:
Map<Integer, List<Integer>> grouped = IntStream.range(0, 1000)
    .parallel()
    .boxed()
    .collect(Collectors.groupingByConcurrent(n -> n % 10));  // ConcurrentHashMap
System.out.println(grouped.size());  // 10 groups

// ── AtomicLong for parallel side-effect counting (antipattern but safe) ─
AtomicLong count = new AtomicLong();
IntStream.range(0, 1_000_000).parallel().forEach(n -> {
    if (n % 2 == 0) count.incrementAndGet();  // thread-safe but suboptimal
});
// Better: use stream operations directly:
long countBetter = IntStream.range(0, 1_000_000).parallel().filter(n -> n % 2 == 0).count();

// ── Custom ForkJoinPool: separate from common pool ────────────────────
ForkJoinPool customPool = new ForkJoinPool(4);  // 4 worker threads
try {
    long result = customPool.submit(() ->
        LongStream.range(0, 1_000_000)
            .parallel()      // uses customPool, not common pool
            .sum()
    ).get();
    System.out.println("Sum: " + result);
} catch (InterruptedException | ExecutionException e) {
    Thread.currentThread().interrupt();
} finally {
    customPool.shutdown();  // must shut down to release threads
}

// ── Reduction identity and associativity requirements ─────────────────
// CORRECT: identity=0, accumulator/combiner are associative:
int sum = IntStream.range(1, 11).parallel().reduce(0, Integer::sum);
System.out.println(sum);   // always 55

// WRONG: identity is not a true identity (violates: combine(identity, v) = v):
int wrongResult = Stream.of(1, 2, 3, 4, 5).parallel().reduce(10, Integer::sum);
// Each parallel sub-stream uses identity=10, so 5 sub-streams add 10 each → 10+10+... added
System.out.println(wrongResult);   // NOT 25, NOT deterministic — depends on split count

// WRONG: non-associative accumulation:
// String concatenation with ordering dependency fails in parallel:
String s = Stream.of("a","b","c","d").parallel()
    .reduce("", (a, b) -> a + "[" + b + "]");  // order-dependent — wrong in parallel
System.out.println(s);  // unpredictable

// CORRECT: use Collectors.joining() which handles parallel combining:
String joined = Stream.of("a","b","c","d").parallel()
    .collect(Collectors.joining(", "));  // always "a, b, c, d"

Performance Pitfalls and When to Use Parallel Streams

Parallel streams have overhead: thread coordination, data splitting, and result combining all take time. For small data sets, this overhead dominates the computation cost and parallel is slower than sequential. The typical threshold varies by operation: for simple operations like addition, parallel wins at around 10,000 elements on modern hardware; for more expensive operations (string processing, regex), the threshold is lower. Always measure both sequential and parallel performance on realistic data before choosing. Autoboxing is a major pitfall. Stream<Integer>.parallel().mapToLong(Integer::longValue).sum() creates one Integer object per element and then immediately unboxes it, adding heap pressure and GC overhead that can swamp the parallel speedup. The fix is to use primitive streams: IntStream.range().parallel() or LongStream.of().parallel() avoid boxing entirely and achieve substantially better throughput. Sources with poor Spliterator characteristics degrade parallel performance. Range-based sources (IntStream.range(), Arrays.stream()) split perfectly. ArrayList, arrays, and most java.util.concurrent collections split well. LinkedList, most I/O-based streams (Files.lines()), and streams from Iterator sources split poorly — often not at all, degrading to sequential execution. For Files.lines(), prefer Files.readAllLines() into a List for parallel processing. Stateful intermediate operations (sorted(), distinct(), limit(), skip()) either require global coordination across all threads or impose ordering constraints. sorted() in a parallel stream requires collecting all elements and sorting, eliminating any parallel benefit for that operation. distinct() with ORDERED requires maintaining encounter order across threads, which is expensive. limit() and skip() with ORDERED are O(N) even in parallel. These operations should either be applied after sequential(), placed early in the pipeline to reduce data size before parallel processing, or the stream should be made unordered() when ordering is not required. The NQ model provides a rough guide: N is the number of elements, Q is the cost per element. If N * Q is large — many elements, or expensive per-element computation — parallelism is beneficial. If either N or Q is small, sequential is likely faster. This is a starting heuristic, not a substitute for measurement.
Java
// ── Small data: parallel is SLOWER ───────────────────────────────────
List<Integer> small = List.of(1, 2, 3, 4, 5);

// Sequential: fast (no thread coordination overhead)
long seqResult = small.stream().mapToLong(n -> n * n).sum();

// Parallel: slow (thread overhead > computation cost for 5 elements)
long parResult = small.parallelStream().mapToLong(n -> n * n).sum();

// Typical timing: sequential 0.001ms, parallel 0.1ms — 100x SLOWER in parallel

// ── Autoboxing pitfall ────────────────────────────────────────────────
List<Integer> boxedNumbers = new ArrayList<>();
for (int i = 0; i < 1_000_000; i++) boxedNumbers.add(i);

// SLOW: boxing/unboxing + parallel overhead cancels benefit:
long boxedSum = boxedNumbers.parallelStream()
    .mapToLong(Integer::longValue)  // unboxing 1M Integer objects
    .sum();

// FAST: no boxing at all:
long primitiveSum = LongStream.range(0, 1_000_000).parallel().sum();

// ── Poor Spliterator: LinkedList doesn't split well ────────────────────
List<Integer> linked = new LinkedList<>();
for (int i = 0; i < 100_000; i++) linked.add(i);
// parallelStream on LinkedList: splits poorly, often sequential in practice
// Fix: collect to ArrayList first if parallelism is needed:
List<Integer> arrayList2 = new ArrayList<>(linked);
long fastSum = arrayList2.parallelStream().mapToLong(Integer::longValue).sum();

// ── sorted() in parallel: requires full materialization ───────────────
// All parallel speedup is lost because sorted() must see all elements:
List<Integer> toSort = new ArrayList<>(Collections.nCopies(100_000, 0));
Collections.shuffle(toSort);

// This is NOT faster in parallel — sorted() forces a sequential step:
List<Integer> sortedParallel = toSort.parallelStream()
    .sorted()             // forces full materialization — no parallel benefit
    .collect(Collectors.toList());

// ── NQ model: when parallel helps ────────────────────────────────────
// N=10, Q=1ms per element: N*Q = 10ms → probably worth parallelizing
// N=10_000, Q=0.001ms:    N*Q = 10ms → borderline — measure
// N=10, Q=0.001ms:        N*Q = 0.01ms → sequential wins easily

// CPU-intensive computation: good parallel candidate
long[] primes = LongStream.range(2, 1_000_000).parallel()
    .filter(n -> {
        for (long i = 2; i * i <= n; i++) {
            if (n % i == 0) return false;
        }
        return true;   // expensive check — Q is high → parallel beneficial
    }).toArray();
System.out.println("Primes found: " + primes.length);

// I/O operation inside parallel: BAD — blocks common ForkJoinPool
// DON'T DO THIS:
urls.parallelStream().forEach(url -> {
    String content = downloadContent(url);   // blocks ForkJoinWorkerThread!
    process(content);                        // pool starved while waiting for I/O
});

// DO THIS INSTEAD: CompletableFuture with dedicated thread pool
ExecutorService ioPool = Executors.newFixedThreadPool(20);
List<CompletableFuture<Void>> futures = urls.stream()
    .map(url -> CompletableFuture.runAsync(() -> {
        String content = downloadContent(url);   // I/O in ioPool, not ForkJoinPool
        process(content);
    }, ioPool))
    .collect(Collectors.toList());
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
ioPool.shutdown();

Related Topics in Multithreading

Process vs Thread
A process is an independent program in execution with its own isolated memory space, file handles, and system resources, managed by the operating system and separated from all other processes by strict boundaries. A thread is a unit of execution that lives inside a process, sharing that process's memory, heap, and resources with every other thread in the same process. Java programs run inside a JVM process; the JVM itself creates and manages threads, and every Java application starts with at least one thread — the main thread — with additional threads created by the JVM for garbage collection, JIT compilation, signal handling, and other runtime tasks. Understanding the distinction between processes and threads is the foundation for all concurrent programming in Java: it determines what is shared and what is isolated, what is fast and what is expensive, what fails independently and what fails together. This entry covers the OS-level and JVM-level model of processes and threads, the memory model that follows from the shared-versus-isolated distinction, the cost model for creation and context switching, failure isolation and its consequences, inter-process and inter-thread communication mechanisms, and the practical decision of when to use multiple processes versus multiple threads.
Thread Basics
A Java thread is an instance of java.lang.Thread that represents an independent path of execution within the JVM process. Every thread has a lifecycle — from creation through runnable, running, blocked, waiting, timed-waiting, and terminated states — and a set of properties including its name, priority, daemon status, thread group, and uncaught exception handler. The Java memory model specifies what visibility guarantees exist between threads and when writes by one thread are guaranteed to be visible to another. Thread scheduling is controlled by the OS scheduler subject to hints from the JVM via thread priority; the JVM does not provide real-time scheduling guarantees. This entry covers the complete thread lifecycle and its state machine, thread properties and how they affect scheduling and JVM shutdown, the happens-before relationship and why it matters for visibility, daemon threads and their relationship to JVM shutdown, thread interruption as a cooperative cancellation mechanism, and the methods on Thread that every Java developer must understand.
Creating Threads
Java provides three primary abstractions for defining the work a thread will execute: the Thread class itself (subclassed to override run()), the Runnable interface (a task with no return value and no checked exception), and the Callable interface (a task with a return value and a declared checked exception). Each represents a different contract between the task and the infrastructure that runs it. Thread subclassing couples the task definition to the execution mechanism and is the oldest and least flexible approach. Runnable decouples the task from the thread, allowing the same Runnable to be submitted to thread pools, scheduled executors, or wrapped in Thread objects. Callable extends that decoupling to include a return value and exception propagation, returning a Future that allows the caller to retrieve the result or handle exceptions asynchronously. Understanding all three — their contracts, their limitations, and when to use each — is the foundation of concurrent programming in Java before reaching for higher-level constructs.
Thread Lifecycle
The Java thread lifecycle is the complete sequence of states a thread passes through from the moment a Thread object is constructed to the moment its execution ends. Java defines six states in the Thread.State enum — NEW, RUNNABLE, BLOCKED, WAITING, TIMED_WAITING, and TERMINATED — and the JVM transitions threads between these states in response to specific method calls, lock acquisitions, monitor notifications, timeouts, and exceptions. Each state has a precise meaning, a defined set of entry conditions, and a defined set of exit conditions. Understanding the lifecycle in full is prerequisite knowledge for diagnosing deadlocks, thread leaks, performance bottlenecks in thread dumps, and incorrect synchronization — all of which manifest as threads stuck in specific states. This entry covers every state in the lifecycle with its entry and exit conditions, all legal and illegal state transitions, how thread dumps represent each state, the interaction between lifecycle states and interruption, the effect of uncaught exceptions on lifecycle, and how to observe lifecycle transitions programmatically.