Spring Boot

Fault Tolerance

Fault tolerance is the ability of a distributed system to continue operating correctly in the presence of partial failures. In a microservices architecture any service, network link, or database can fail at any time. Fault-tolerant systems anticipate these failures and respond with degraded-but-functional behaviour rather than complete outages. The core techniques are timeouts, retries, circuit breakers, bulkheads, fallbacks, and graceful degradation.

Failure Modes in Microservices

Failures in a distributed system take many forms. A service may crash entirely, respond slowly, return corrupt data, or become intermittently unreachable. Each failure mode requires a different defensive strategy. Understanding the failure modes is the first step toward building a fault-tolerant system.

Java

// ── Common failure modes and their symptoms: ─────────────────────────
//
// 1. CRASH FAILURE
//    The service process dies — no response at all.
//    Symptom: ConnectionRefusedException immediately.
//    Defence:  Retry (once), Circuit Breaker, Fallback.
//
// 2. OMISSION FAILURE (packet loss / network drop)
//    Request or response is silently lost.
//    Symptom: SocketTimeoutException after timeout expires.
//    Defence:  Timeout + Retry with idempotency key.
//
// 3. TIMING FAILURE (slow service)
//    Service responds eventually, but too slowly.
//    Symptom: Threads pile up waiting → caller exhausts thread pool.
//    Defence:  Timeout (fail fast), Bulkhead (limit concurrent calls).
//
// 4. RESPONSE FAILURE (wrong data)
//    Service responds with HTTP 5xx or malformed payload.
//    Symptom: FeignException, JSON parse error.
//    Defence:  Retry on 5xx, ErrorDecoder, Fallback.
//
// 5. BYZANTINE FAILURE (inconsistent behaviour)
//    Service returns different answers to different callers.
//    Symptom: Intermittent data corruption, hard to reproduce.
//    Defence:  Idempotency keys, distributed tracing, validation.

// ── Cascading failure anatomy: ────────────────────────────────────────
//
//  PaymentService: responding in 30s (timing failure)
//
//  t=0   OrderService calls PaymentService → thread blocked for 30s
//  t=10  More orders arrive → more threads blocked
//  t=20  OrderService thread pool exhausted (200/200 threads waiting)
//  t=20  New requests to OrderService → rejected (cascading to caller)
//  t=20  API Gateway calling OrderService → gateway threads start blocking
//  t=30  Entire system unresponsive — one slow service took down everything
//
// ── Defence in depth: ────────────────────────────────────────────────
//
//  Every remote call should be wrapped with ALL of:
//
//  @RateLimiter  → reject if caller sends too many requests
//  @Bulkhead     → reject if too many concurrent calls in flight
//  @CircuitBreaker → reject if downstream is consistently failing
//  @Retry        → retry transient failures with backoff
//  @TimeLimiter  → fail fast if call takes too long

Timeouts

Every remote call must have a timeout. Without one, a slow downstream service blocks the caller's thread indefinitely. Timeouts should be set at two levels: the connection timeout (how long to wait for a TCP connection to be established) and the read timeout (how long to wait for the response after the connection is open). Always set the read timeout shorter than the upstream caller's own timeout to prevent nested timeout storms.

Java

// ── RestTemplate timeouts: ───────────────────────────────────────────
@Configuration
public class RestTemplateConfig {

    @Bean
    @LoadBalanced
    public RestTemplate restTemplate() {
        HttpComponentsClientHttpRequestFactory factory =
            new HttpComponentsClientHttpRequestFactory();
        factory.setConnectTimeout(1_000);   // 1s to establish connection
        factory.setReadTimeout(3_000);      // 3s to receive response
        return new RestTemplate(factory);
    }
}

// ── WebClient timeouts: ───────────────────────────────────────────────
@Configuration
public class WebClientConfig {

    @Bean
    @LoadBalanced
    public WebClient.Builder webClientBuilder() {
        HttpClient httpClient = HttpClient.create()
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1_000)
            .responseTimeout(Duration.ofSeconds(3))
            .doOnConnected(conn -> conn
                .addHandlerLast(new ReadTimeoutHandler(3))
                .addHandlerLast(new WriteTimeoutHandler(1)));

        return WebClient.builder()
            .clientConnector(new ReactorClientHttpConnector(httpClient));
    }
}

// ── OpenFeign timeouts (application.yml): ────────────────────────────
// spring:
//   cloud:
//     openfeign:
//       client:
//         config:
//           default:
//             connect-timeout: 1000    # ms
//             read-timeout: 3000       # ms
//           payment-service:           # per-client override
//             connect-timeout: 500
//             read-timeout: 5000       # payments need more time

// ── Nested timeout rule: ──────────────────────────────────────────────
//
//  Client        Gateway        OrderService     PaymentService
//  timeout: 10s  timeout: 8s   timeout: 6s      timeout: 4s
//
//  Each level's timeout < its caller's timeout.
//  If PaymentService times out at 4s:
//    OrderService gets the error at 4s (within its 6s budget)
//    OrderService returns fallback to Gateway within 6s budget
//    Gateway returns to Client within 8s budget
//
//  If PaymentService timeout > OrderService timeout:
//    OrderService times out at 6s waiting for PaymentService
//    PaymentService call is still running (wasted resources)
//    PaymentService finally responds at 8s — nobody is listening

Fallback Strategies

A fallback is the response returned when a primary call fails or the circuit is open. A good fallback provides the best possible user experience given the failure — it should degrade gracefully rather than simply returning an error. There are several fallback strategies depending on the use case.

Java

// ── Fallback strategy 1: Default / empty value ───────────────────────
// Use when: the data is supplementary and absence is acceptable.
@CircuitBreaker(name = "recommendationService",
                fallbackMethod = "emptyRecommendations")
public List<ProductResponse> getRecommendations(Long userId) {
    return recommendationClient.forUser(userId);
}

private List<ProductResponse> emptyRecommendations(
        Long userId, Throwable ex) {
    return Collections.emptyList();   // page renders without recommendations
}

// ── Fallback strategy 2: Cached data ─────────────────────────────────
// Use when: slightly stale data is acceptable (prices, catalogues).
@Service
@RequiredArgsConstructor
public class ProductService {

    private final ProductClient productClient;
    private final CacheManager cacheManager;

    @CircuitBreaker(name = "productService",
                    fallbackMethod = "cachedProduct")
    @Cacheable("products")
    public ProductResponse findById(Long id) {
        return productClient.findById(id);
    }

    private ProductResponse cachedProduct(Long id, Throwable ex) {
        Cache cache = cacheManager.getCache("products");
        ProductResponse cached = cache != null
            ? cache.get(id, ProductResponse.class) : null;
        if (cached != null) {
            log.warn("Returning cached product {} due to: {}",
                id, ex.getMessage());
            return cached;
        }
        throw new ServiceUnavailableException(
            "Product service unavailable and no cache available");
    }
}

// ── Fallback strategy 3: Deferred / async processing ─────────────────
// Use when: the operation can be queued and completed later.
@CircuitBreaker(name = "paymentService",
                fallbackMethod = "deferPayment")
public PaymentResponse processPayment(PaymentRequest request) {
    return paymentClient.charge(request);
}

private PaymentResponse deferPayment(
        PaymentRequest request, Throwable ex) {
    // Queue for async processing when service recovers:
    paymentQueue.enqueue(request);
    return PaymentResponse.builder()
        .orderId(request.getOrderId())
        .status(PaymentStatus.PENDING)
        .message("Payment queued — you will be notified when processed")
        .build();
}

// ── Fallback strategy 4: Static / hardcoded response ─────────────────
// Use when: the service provides non-critical enrichment data.
@CircuitBreaker(name = "weatherService",
                fallbackMethod = "defaultWeather")
public WeatherResponse getWeather(String city) {
    return weatherClient.current(city);
}

private WeatherResponse defaultWeather(String city, Throwable ex) {
    return WeatherResponse.builder()
        .city(city)
        .description("Weather data temporarily unavailable")
        .build();
}

Graceful Degradation

Graceful degradation means the system continues to serve users with reduced functionality when a dependency fails, rather than failing completely. It requires identifying which features are critical (must work) and which are optional (can be hidden or replaced with a placeholder when their backing service is down).

Java

// ── Classify features by criticality: ────────────────────────────────
//
// CRITICAL (system unusable without these):
//   User login / authentication
//   Product catalogue browsing
//   Add to cart
//   Checkout / order placement
//
// NON-CRITICAL (can degrade gracefully):
//   Personalised recommendations
//   Product reviews and ratings
//   Live stock count ("Only 3 left!")
//   Loyalty points balance
//   Real-time shipping tracking
//   Weather widget

// ── E-commerce page with graceful degradation: ───────────────────────
@Service
@RequiredArgsConstructor
public class ProductPageService {

    private final ProductClient       productClient;
    private final ReviewClient        reviewClient;
    private final RecommendationClient recommendationClient;
    private final InventoryClient     inventoryClient;

    public ProductPageResponse buildPage(Long productId, Long userId) {
        // CRITICAL — must succeed; exception propagates if it fails:
        ProductResponse product = productClient.findById(productId);

        // NON-CRITICAL — each wrapped independently; failures return defaults:
        List<ReviewResponse> reviews    = fetchReviewsSafely(productId);
        List<ProductResponse> recs      = fetchRecommendationsSafely(userId);
        Integer stockCount              = fetchStockSafely(productId);

        return ProductPageResponse.builder()
            .product(product)           // always present
            .reviews(reviews)           // empty list if service down
            .recommendations(recs)      // empty list if service down
            .stockCount(stockCount)     // null = "Check availability"
            .build();
    }

    private List<ReviewResponse> fetchReviewsSafely(Long productId) {
        try {
            return reviewClient.forProduct(productId);
        } catch (Exception ex) {
            log.warn("ReviewService unavailable for product {}: {}",
                productId, ex.getMessage());
            return Collections.emptyList();
        }
    }

    private List<ProductResponse> fetchRecommendationsSafely(Long userId) {
        try {
            return recommendationClient.forUser(userId);
        } catch (Exception ex) {
            log.warn("RecommendationService unavailable: {}", ex.getMessage());
            return Collections.emptyList();
        }
    }

    private Integer fetchStockSafely(Long productId) {
        try {
            return inventoryClient.stockCount(productId);
        } catch (Exception ex) {
            log.warn("InventoryService unavailable: {}", ex.getMessage());
            return null;   // UI shows "Check availability" instead
        }
    }
}

Health Checks and Readiness Probes

Fault tolerance also requires infrastructure-level health checking. Spring Boot Actuator exposes /actuator/health with liveness and readiness probes. Kubernetes uses these to decide whether to route traffic to a pod (readiness) and whether to restart it (liveness). A service that is alive but not yet ready — for example, still warming up caches — should report UP for liveness but DOWN for readiness.

Java

// ── application.yml — expose liveness and readiness separately: ───────
// management:
//   endpoint:
//     health:
//       probes:
//         enabled: true      # enables /actuator/health/liveness
//                            #         /actuator/health/readiness
//       show-details: always
//   health:
//     livenessstate:
//       enabled: true
//     readinessstate:
//       enabled: true

// ── Custom ReadinessIndicator — not ready until cache is warm: ────────
@Component
@RequiredArgsConstructor
public class CacheWarmupReadinessIndicator
        implements ApplicationListener<ApplicationReadyEvent> {

    private volatile boolean ready = false;
    private final ProductCacheService cacheService;

    @Override
    public void onApplicationEvent(ApplicationReadyEvent event) {
        try {
            cacheService.warmUp();    // load critical data into cache
            ready = true;
            log.info("Cache warm-up complete — service is READY");
        } catch (Exception ex) {
            log.error("Cache warm-up failed — service NOT READY", ex);
        }
    }

    @Bean
    public HealthIndicator cacheReadinessHealthIndicator() {
        return () -> ready
            ? Health.up().withDetail("cache", "warmed up").build()
            : Health.down().withDetail("cache", "warming up").build();
    }
}

// ── Kubernetes deployment probes: ────────────────────────────────────
// apiVersion: apps/v1
// kind: Deployment
// spec:
//   template:
//     spec:
//       containers:
//         - name: order-service
//           livenessProbe:
//             httpGet:
//               path: /actuator/health/liveness
//               port: 8080
//             initialDelaySeconds: 30
//             periodSeconds: 10
//             failureThreshold: 3    # restart after 3 consecutive failures
//
//           readinessProbe:
//             httpGet:
//               path: /actuator/health/readiness
//               port: 8080
//             initialDelaySeconds: 15
//             periodSeconds: 5
//             failureThreshold: 3    # remove from load balancer after 3 failures

Inter-Service Communication

Rate Limiting