Spring Boot

Circuit Breaker

The Circuit Breaker pattern prevents cascading failures in a distributed system by wrapping remote calls with a state machine that monitors failure rates. When failures exceed a threshold the circuit opens and subsequent calls immediately return a fallback response — without attempting the failing call — giving the downstream service time to recover. The pattern was popularised by Michael Nygard in Release It! and is named after the electrical circuit breaker.

Why Circuit Breaker Is Needed

In a microservices system, services call each other over the network. If a downstream service slows down or becomes unavailable, the calling service's threads block waiting for a response. With enough slow calls the caller's thread pool exhausts, it stops serving its own clients, and the failure spreads up the call chain — a cascading failure. The circuit breaker breaks this chain by failing fast once a failure threshold is crossed.

Java

// ── Cascading failure without a circuit breaker: ─────────────────────
//
//  Client → OrderService → PaymentService (responding in 30s / timing out)
//
//  Timeline:
//    t=0s   Request 1  → OrderService allocates thread → waits for Payment
//    t=1s   Request 2  → OrderService allocates thread → waits for Payment
//    ...
//    t=20s  Request 20 → OrderService thread pool FULL
//    t=20s  All new requests to OrderService → rejected with 503
//    t=20s  UserService calling OrderService → starts failing too
//    t=25s  Gateway calling UserService → starts failing
//
//  One slow PaymentService has taken down the entire system.

// ── With a circuit breaker: ───────────────────────────────────────────
//
//  t=0s   Requests 1–10 fail (PaymentService is down)
//  t=10s  Circuit OPENS — failure rate > threshold
//  t=10s  Requests 11–100 → immediate fallback, no thread blocked
//         "Payment service unavailable, will retry later"
//  t=20s  Circuit moves to HALF-OPEN — sends 3 test requests
//  t=20s  PaymentService recovered → test requests succeed
//  t=20s  Circuit CLOSES — normal operation resumes
//
//  OrderService threads are never exhausted.
//  Failure is contained — UserService and Gateway unaffected.

// ── What fail-fast looks like to the caller: ─────────────────────────
@Service
public class OrderService {

    @CircuitBreaker(name = "payment", fallbackMethod = "paymentFallback")
    public PaymentResponse processPayment(PaymentRequest request) {
        return paymentClient.charge(request);  // may fail
    }

    // Called immediately when circuit is open — no network call made:
    private PaymentResponse paymentFallback(
            PaymentRequest request, Throwable ex) {
        log.warn("Payment circuit open — returning fallback. Cause: {}",
            ex.getMessage());
        return PaymentResponse.pending(request.getOrderId());
    }
}

Circuit Breaker State Machine

A circuit breaker is a state machine with three states. CLOSED is normal operation — calls pass through and outcomes are recorded. OPEN means the failure threshold has been breached — calls immediately return the fallback without touching the network. HALF-OPEN is a recovery probe — a limited number of calls are allowed through to test whether the downstream service has recovered.

Java

// ── State machine diagram: ────────────────────────────────────────────
//
//                    failure rate >= threshold
//          CLOSED ─────────────────────────────────▶ OPEN
//            ▲                                         │
//            │                                         │ wait duration expires
//            │                                         ▼
//            │                                     HALF-OPEN
//            │          probe calls succeed            │
//            └─────────────────────────────────────────┘
//                                                      │
//            OPEN ◀────────────────────────────────────┘
//                       probe calls fail

// ── CLOSED state: ────────────────────────────────────────────────────
// • All calls pass through to the downstream service.
// • Each call outcome (success / failure / timeout) is recorded
//   in a sliding window (count-based or time-based).
// • When failure rate >= threshold → transition to OPEN.
// • When slow call rate >= threshold → transition to OPEN.

// ── OPEN state: ──────────────────────────────────────────────────────
// • All calls are short-circuited immediately.
// • CallNotPermittedException is thrown (triggers fallback).
// • No network call is attempted — fail-fast.
// • After waitDurationInOpenState expires → transition to HALF-OPEN.

// ── HALF-OPEN state: ─────────────────────────────────────────────────
// • A limited number of probe calls are allowed through.
//   (permittedNumberOfCallsInHalfOpenState — default: 10)
// • If probe calls succeed → transition to CLOSED (recovered).
// • If probe calls fail   → transition back to OPEN (still broken).

// ── Sliding window types: ─────────────────────────────────────────────
//
// COUNT_BASED (default):
//   Records the last N calls.
//   sliding-window-size: 10 → evaluate the last 10 calls.
//   Failure rate = failures / 10.
//
// TIME_BASED:
//   Records all calls in the last N seconds.
//   sliding-window-size: 10 → evaluate all calls in the last 10 seconds.
//   Failure rate = failures / total calls in window.
//
// COUNT_BASED is simpler and more predictable.
// TIME_BASED is better for variable traffic (avoids stale windows).

Circuit Breaker vs Retry vs Timeout

Circuit breaker, retry, and timeout are complementary resilience patterns — not alternatives. They operate at different layers of failure handling. Applying them in the wrong order or combining them incorrectly creates problems: retrying inside an open circuit defeats the purpose of failing fast; retrying without a timeout can block indefinitely.

Java

// ── Each pattern's responsibility: ───────────────────────────────────
//
// TIMEOUT   → "Don't wait forever for a response."
//             Fail a call that takes longer than N milliseconds.
//             Prevents thread exhaustion from slow services.
//
// RETRY     → "Try again — this might be a transient blip."
//             Useful for network hiccups, brief unavailability.
//             Should only retry idempotent operations (GET, PUT).
//             Must have a backoff strategy to avoid thundering herd.
//
// CIRCUIT BREAKER → "Stop trying — the service is down."
//             Useful when failures are sustained, not transient.
//             Protects the caller's thread pool from exhaustion.
//             Gives the downstream service breathing room to recover.

// ── Correct combination order: ────────────────────────────────────────
//
//  Incoming call
//       │
//       ▼
//  [CircuitBreaker] — is circuit open? → fail fast with fallback
//       │ (circuit closed)
//       ▼
//  [Retry] — attempt the call, retry up to N times on transient error
//       │
//       ▼
//  [Timeout] — fail if a single attempt exceeds N milliseconds
//       │
//       ▼
//  Remote service call
//
//  Resilience4j annotation order (outermost first):
//  @CircuitBreaker wraps @Retry wraps @TimeLimiter wraps the actual call.

// ── Anti-patterns: ────────────────────────────────────────────────────
//
// BAD — retrying inside open circuit:
//   If circuit is open and retry fires 3 times →
//   3 × CallNotPermittedException recorded as failures.
//   Circuit stays open longer than necessary.
//   Fix: CircuitBreaker must be outermost decorator.
//
// BAD — retry without timeout:
//   Each retry attempt can hang indefinitely.
//   3 retries × 30s timeout = 90s total blocking time.
//   Fix: always set a timeout shorter than retry wait.
//
// BAD — retry on non-idempotent calls (POST):
//   Retrying a payment charge creates duplicate charges.
//   Fix: only retry on GET/HEAD or with idempotency keys.

Bulkhead Pattern

A bulkhead (named after ship compartments) isolates failures by limiting the number of concurrent calls to a downstream service. Without a bulkhead, a slow downstream service can consume all threads in the application. With a bulkhead, a fixed number of threads (or concurrent calls) are reserved per downstream service — other services are unaffected.

Java

// ── Without bulkhead: ────────────────────────────────────────────────
//
//  Application thread pool: 200 threads
//
//  PaymentService is slow (30s response time):
//    100 threads waiting for PaymentService
//    100 threads left for OrderService, UserService, etc.
//
//    More payment calls arrive:
//    200 threads waiting for PaymentService → ALL threads exhausted
//    OrderService, UserService, Gateway → all start failing
//
// ── With bulkhead (SemaphoreBulkhead): ───────────────────────────────
//
//  PaymentService bulkhead: max 20 concurrent calls
//  OrderService  bulkhead: max 30 concurrent calls
//
//  100 payment requests arrive simultaneously:
//    20 pass through (bulkhead limit)
//    80 immediately receive BulkheadFullException → fallback
//    OrderService threads are completely unaffected.

// ── Two bulkhead types in Resilience4j: ───────────────────────────────
//
// SemaphoreBulkhead:
//   Uses a semaphore to limit concurrent calls.
//   Runs in the same thread — lightweight.
//   Suitable for most use cases.
//
// ThreadPoolBulkhead:
//   Executes calls in a separate, dedicated thread pool.
//   The calling thread is freed immediately.
//   Provides true isolation at the cost of thread overhead.
//   Required for non-blocking/reactive pipelines.

// ── Bulkhead configuration (application.yml): ────────────────────────
// resilience4j:
//   bulkhead:
//     instances:
//       paymentService:
//         max-concurrent-calls: 20     # max simultaneous calls
//         max-wait-duration: 100ms     # wait before rejecting (0 = fail fast)
//
//   thread-pool-bulkhead:
//     instances:
//       inventoryService:
//         max-thread-pool-size: 10
//         core-thread-pool-size: 5
//         queue-capacity: 20
//         keep-alive-duration: 20ms

OpenFeign

Resilience4j