
A practical implementation of the circuit breaker pattern for AWS Lambda — with pluggable state storage (DynamoDB, Redis, or in-memory), configurable thresholds, concurrency control, and automatic recovery.
When your platform calls external services — payment providers, email APIs, geolocation lookups, AI translation endpoints — those services will eventually fail. Networks drop. APIs go down. Rate limits get hit. The question is not whether external calls will fail, but what happens to your platform when they do. Without protection, a failing external service can cascade through your system — Lambda functions waiting on timeouts, DynamoDB connections piling up, users seeing errors across unrelated features. The circuit breaker pattern prevents this cascade. It detects when an external service is failing, stops sending requests to it, and automatically recovers when the service comes back. This article explains how we built a circuit breaker for serverless that works across Lambda invocations, supports pluggable storage backends, and protects every external call in the platform.
In a traditional server application, a circuit breaker lives in memory. It counts failures, opens the circuit, and recovers — all within the same process. The state persists because the server persists.
In serverless, the process is ephemeral. A Lambda function handles a request and may never be invoked again. The next request might hit a completely different Lambda instance. If the circuit breaker state lives in memory, every new instance starts with a clean slate — it does not know that the previous 100 instances all failed calling the same external service.
This means a serverless circuit breaker must store its state externally — in a database or cache that every Lambda instance can access. The state must be read and updated atomically to prevent race conditions when multiple instances check the circuit simultaneously.
At TCTF, every external service call goes through the circuit breaker. The geolocation service uses it to protect ipinfo.io lookups. The email service uses it to protect SES, Resend, and SendGrid calls. The billing service uses it to protect Stripe API calls. The AI translation service uses it to protect the translation API. One pattern, consistent protection, across all 34 microservices.
🛡️ Every external service call at TCTF goes through the circuit breaker. Geolocation, email, billing, AI translation — one pattern, consistent protection across all 34 services.
The circuit breaker has three states, modeled as a finite state machine.
CLOSED is the normal state. Requests flow through to the external service. The circuit breaker counts failures. If the failure count reaches the threshold (default: 3), the circuit transitions to OPEN.
OPEN means the external service is considered down. Requests are immediately rejected without calling the external service — this is the fail-fast behavior that prevents cascade failures. The circuit stays open for a configurable reset timeout (default: 10 seconds). When the timeout expires, the circuit transitions to HALF_OPEN.
HALF_OPEN is the recovery probe state. A limited number of test requests are allowed through to the external service. If enough succeed (default: 1 success), the circuit transitions back to CLOSED and normal traffic resumes. If a test request fails, the circuit transitions back to OPEN and the reset timeout starts again.
This three-state model means the circuit breaker is self-healing. It detects failures, stops the bleeding, and automatically recovers when the external service comes back — all without human intervention.
🔄CLOSED → requests flow. OPEN → requests blocked (fail fast). HALF_OPEN → test requests probe recovery. The circuit heals itself automatically.

The circuit breaker state — failure count, current state, last failure timestamp — must persist across Lambda invocations. We support three storage backends, all implementing the same ICircuitStorage interface.
DynamoDB storage is the default for production. State is stored in the service's existing DynamoDB table using a PK/SK pattern (CIRCUIT#{serviceKey}). This means no additional infrastructure — the circuit breaker piggybacks on the table the service already has. State reads and writes use conditional expressions to prevent race conditions.
Redis storage is available for services that already use ElastiCache. Redis provides faster state lookups (sub-millisecond vs single-digit millisecond for DynamoDB) and atomic operations via Redis commands. For services with high request rates where circuit state is checked on every call, Redis reduces the overhead.
In-memory storage is used for testing and for services where cross-instance state sharing is not critical. The state lives in the Lambda instance's memory and resets on cold start. This is useful for protecting calls where each instance should make its own circuit-breaking decisions independently.
The CircuitBreakerStorageFactory creates the appropriate storage backend based on configuration. Switching from DynamoDB to Redis is a configuration change — the circuit breaker code does not change.
📦Three storage backends: DynamoDB (default, no extra infra), Redis (faster, for high-rate services), In-Memory (testing, per-instance decisions). Swappable via configuration.
In a serverless environment, a single Lambda instance can handle multiple concurrent requests (with provisioned concurrency or streaming). The circuit breaker uses a semaphore to limit how many concurrent requests pass through to the external service. The default is 10 concurrent requests — enough to keep throughput high while preventing a thundering herd when the circuit transitions from OPEN to HALF_OPEN.
Not every error should trip the circuit. A 400 Bad Request from an API means the request was wrong, not that the service is down. A timeout or 503 Service Unavailable means the service is struggling. The isBreakableError method classifies errors — only errors that indicate service-level problems (timeouts, connection failures, 5xx responses) count toward the failure threshold. Client errors (4xx) are passed through without affecting the circuit state.
The executeWithErrorClassification method wraps an external call with both circuit breaking and error classification. It runs the operation, catches errors, classifies them, updates the circuit state for breakable errors, and re-throws the original error. The caller gets the same error they would have gotten without the circuit breaker — but the circuit breaker has learned from it.
Every circuit breaker parameter is configurable via environment variables with sensible defaults.
failureThreshold (default: 3) — how many failures before the circuit opens. Lower values are more aggressive (faster protection, more false positives). Higher values are more tolerant (slower protection, fewer false positives). For critical services like payment processing, we use 2. For less critical services like geolocation, we use 5.
resetTimeout (default: 10000ms) — how long the circuit stays open before probing recovery. Shorter timeouts mean faster recovery but more probe requests to a potentially still-failing service. Longer timeouts mean slower recovery but less load on the failing service.
halfOpenSuccessThreshold (default: 1) — how many successful probe requests are needed to close the circuit. For services that fail intermittently, a higher threshold (2-3) prevents premature closure.
maxConcurrent (default: 10) — maximum concurrent requests through the circuit. This prevents thundering herd problems when the circuit closes and queued requests all fire simultaneously.
cacheTtl (default: 5000ms) — how long circuit state is cached in memory before re-reading from storage. This reduces storage reads on hot paths while keeping state reasonably fresh.
All values are validated on initialization. Invalid configurations (negative thresholds, empty service keys) throw immediately rather than causing subtle runtime bugs.
⚙️Every parameter is configurable: failure threshold, reset timeout, success threshold, max concurrency, cache TTL. Defaults are sensible. Invalid configs fail fast.
Some services need to manage multiple circuit breakers — one per external endpoint or one per downstream service. The batch operations (batchGetState, batchSetState, batchClearState) allow reading and writing multiple circuit states in a single call, reducing the number of storage round-trips.
Observability is built in through state change listeners. The onStateChange method registers a callback that fires whenever the circuit transitions between states. Services use this to emit CloudWatch metrics (circuit opened, circuit closed, circuit half-open), trigger alerts (circuit opened for payment service), and log state transitions with correlation IDs.
The state change listener returns an unsubscribe function, so listeners can be cleaned up when they are no longer needed. This prevents memory leaks in long-running Lambda instances with provisioned concurrency.
The executeWithTimeout method adds a timeout wrapper around external calls. If the external service does not respond within the timeout, the call is aborted and counted as a failure. This prevents Lambda functions from hanging on slow external services — a common cause of Lambda timeout errors and increased costs.
The circuit breaker is used across the platform wherever an external service call could fail.
The GeoLocation service wraps ipinfo.io API calls in a circuit breaker. When ipinfo.io is down, the circuit opens and the service returns cached geolocation data instead of failing. The circuit breaker state is stored in DynamoDB alongside the geolocation cache.
The email service wraps SES, Resend, and SendGrid calls in separate circuit breakers — one per provider. When one provider fails, the circuit opens for that provider and the failover logic routes to the next provider. The circuit breaker enables the multi-provider failover strategy.
The billing service wraps Stripe API calls in a circuit breaker. Payment processing is critical, so the failure threshold is set to 2 (more aggressive protection) and the reset timeout is 30 seconds (longer recovery window to avoid hammering a struggling Stripe API).
The content moderation service wraps AWS Comprehend calls in a circuit breaker. When Comprehend is unavailable, the circuit opens and the service falls back to keyword-based filtering.
In every case, the pattern is the same: wrap the external call, configure the thresholds, handle the fallback. The circuit breaker is a shared utility in tctf-utils — one implementation, tested once, used everywhere.
🔌Geolocation, email (3 providers), billing (Stripe), content moderation (Comprehend) — all protected by the same circuit breaker. One implementation, tested once, used everywhere.
The circuit breaker is one of those patterns that you hope never activates — but when it does, it saves the platform. A failing geolocation API does not take down authentication. A struggling payment provider does not block social network posts. A Comprehend outage does not prevent users from posting content. Each failure is contained, each recovery is automatic, and the user experience degrades gracefully instead of catastrophically. That is the point of resilience engineering — not preventing failures, but surviving them.
Never miss an edition
Subscribe to get TCTF newsletters delivered to your inbox.