Bulkhead Pattern for Fault Isolation
The bulkhead pattern divides a system into isolated compartments, each with its own dedicated resources. The name comes from the watertight compartments in a ship’s hull - if one section floods, the sealed bulkheads prevent water from reaching other sections. Applied to software, this means a failure in one component cannot consume the resources that other components depend on.
How It Works
Without isolation, all components share a single resource pool. One misbehaving component can starve the rest:
Shared pool (10 threads): Service A (healthy): needs 3 threads Service B (failing): consumes all 10 threads waiting on timeouts Service C (healthy): 0 threads available → also failsWith bulkheads, each component gets its own allocation:
Bulkhead A (4 threads): Service A uses 3, 1 idleBulkhead B (3 threads): Service B saturates its 3, blocksBulkhead C (3 threads): Service C uses 2, 1 idle → unaffectedTypes of Bulkheads
Thread pool isolation assigns separate thread pools to different operations. A slow external call blocks only its own pool.
const bulkheads = { payments: new Semaphore(5), // Max 5 concurrent payment calls emails: new Semaphore(10), // Max 10 concurrent email sends search: new Semaphore(8), // Max 8 concurrent search queries};
async function withBulkhead(name, fn) { const permit = await bulkheads[name].acquire(); try { return await fn(); } finally { permit.release(); }}
// Payment service failure cannot starve email or searchawait withBulkhead('payments', () => chargeCard(order));Process isolation runs different workloads in separate processes or containers. A memory leak in one process cannot affect others.
Queue isolation assigns different task types to separate queues with dedicated workers. A spike in one task type cannot delay processing of another.
Sizing Bulkheads
Setting the right limits requires balancing two risks:
- Too large: The bulkhead allows one component to consume too many resources before hitting its cap
- Too small: Normal traffic gets throttled because the allocation is insufficient
Start by measuring each component’s typical and peak resource usage. Set the bulkhead limit above peak but below the level that would harm other components. Revisit these numbers as traffic patterns evolve.
When to Use the Bulkhead Pattern
- Services calling multiple external APIs with varying reliability
- Multi-tenant systems where one tenant’s activity should not affect others
- Worker pools processing mixed task types with different resource profiles
- Any architecture where a single slow dependency has caused system-wide outages before
Bulkheads and Circuit Breakers
Bulkheads and circuit breakers complement each other. Bulkheads contain the blast radius of a failure by limiting resource consumption. Circuit breakers detect repeated failures and stop sending requests entirely. Together, they form a layered defense: the bulkhead prevents resource starvation while the circuit breaker prevents wasted effort.