
When DynamoDB TTL caches are not fast enough, Redis steps in. How we integrated ElastiCache Redis for sub-millisecond session lookups, API response caching, and rate limit counters — all from Lambda inside a VPC, with pluggable storage backends and circuit breaker protection.
Part 1 of this series covered DynamoDB-backed TTL caches — reliable, durable, and good enough for most use cases. But some operations need more speed. Session validation happens on every authenticated request. Rate limit checks happen on every API call. Geolocation lookups happen on every login. When these operations hit DynamoDB, you get single-digit millisecond latency. When they hit Redis, you get sub-millisecond latency. That difference matters at scale — not for one request, but for millions of requests where every millisecond compounds. This article covers how we integrated Amazon ElastiCache Redis into the TCTF caching architecture, how it works alongside DynamoDB caching, and why the pluggable storage interface makes the choice a configuration decision rather than a code change.
DynamoDB is an excellent cache backend. It is durable, scalable, and requires zero operational overhead. For most caching use cases at TCTF — geolocation lookups, circuit breaker state, configuration data — DynamoDB caching is the right choice.
But DynamoDB has a latency floor. Even with on-demand capacity and optimized clients, a DynamoDB read takes 1-5 milliseconds. For operations that happen on every single request — session validation, rate limit checks, API response caching — those milliseconds add up.
Redis (via Amazon ElastiCache) operates in sub-millisecond territory. A Redis GET takes 0.1-0.5 milliseconds. For a session validation that happens on every authenticated request, switching from DynamoDB to Redis saves 1-4 milliseconds per request. At 1000 requests per second, that is 1-4 seconds of cumulative latency saved every second.
The trade-off: Redis is volatile. If the Redis node restarts, the cache is empty. If the cluster fails over, there is a brief period of cache misses. Redis is not a source of truth — it is a speed layer in front of a source of truth. At TCTF, DynamoDB remains the durable store. Redis is the hot cache that absorbs the read load.
⚡DynamoDB: 1-5ms reads, durable, zero ops. Redis: < 1ms reads, volatile, requires VPC. Use DynamoDB for durability. Use Redis for speed. The pluggable interface lets you choose per service.
ElastiCache Redis runs inside a VPC. Lambda functions that need to access Redis must also run inside the VPC. This is a significant architectural decision with trade-offs.
The upside: Lambda functions inside the VPC can access ElastiCache, RDS, and other VPC-only resources directly. Network traffic stays within AWS — no internet traversal, no NAT gateway for cache calls.
The downside: Lambda functions inside a VPC historically had slower cold starts because they needed to attach an Elastic Network Interface (ENI). AWS has largely solved this with improved VPC networking for Lambda, but cold starts are still slightly longer than non-VPC functions.
The other downside: Lambda functions inside a VPC cannot access the internet by default. If they need to call external APIs (ipinfo.io for geolocation, Stripe for payments), they need a NAT Gateway or VPC endpoints. This adds cost and configuration complexity.
At TCTF, services that need Redis run inside the VPC. Services that do not need Redis run outside the VPC. The pluggable cache interface means the same CacheService code works in both environments — VPC services use Redis storage, non-VPC services use DynamoDB storage. The switch is an environment variable.
The CacheService is the single entry point for all caching operations. It exposes a clean API: get, set, delete, getMany, setMany, has, expire, getTtl, getStats, and clear. Every method works identically regardless of which storage backend is active.
Three storage backends implement the ICacheStorage interface:
RedisCacheStorage connects to ElastiCache Redis. It supports both standalone and cluster mode. Keys are prefixed to prevent collisions between services sharing the same Redis cluster. TTL is set per key. Batch operations (getMany, setMany, deleteMany) use Redis pipelines for efficiency. The clear operation uses SCAN-based iteration instead of KEYS to avoid blocking the Redis server.
DynamoDBCacheStorage uses a DynamoDB table with TTL-based expiration. Items are stored with a PK (the cache key), a value attribute (JSON-serialized), and a TTL attribute (Unix timestamp). DynamoDB automatically removes expired items. This backend requires no VPC and no additional infrastructure — it piggybacks on the service's existing DynamoDB table.
MemoryCacheStorage uses an in-memory Map with TTL tracking. It is fast but volatile — the cache resets on every Lambda cold start. It is used for testing and for caching data that is expensive to compute but acceptable to lose (configuration parsing, schema validation results).
The storage backend is selected by the CACHE_STORAGE_TYPE environment variable: redis, dynamodb, or memory. The CacheStorageFactory creates the appropriate backend. Switching from DynamoDB to Redis is a configuration change — no code changes, no redeployment of the application logic.
🔌Set CACHE_STORAGE_TYPE=redis for sub-millisecond caching, dynamodb for durable caching, memory for testing. Same code, same API, different performance characteristics.
The RedisCacheStorage class handles the complexity of talking to ElastiCache Redis from Lambda.
Connection management uses lazy initialization. The Redis client is created on first access and reused across invocations (Lambda warm starts). The connection supports both URL-based configuration (redis://host:port) and explicit host/port/password configuration. Passwords are resolved from environment variables or AWS Secrets Manager.
Key prefixing prevents collisions. Every key is prefixed with a service-specific string (e.g., auth:, session:, geo:). This means multiple services can share the same Redis cluster without key conflicts. The prefix is configurable per CacheService instance.
Serialization uses JSON. Values are serialized to JSON strings before storage and deserialized on retrieval. This handles objects, arrays, numbers, and strings transparently. Serialization errors are caught and wrapped in RedisCacheSerializationError with the key and operation context.
Error handling is comprehensive. Connection errors throw RedisCacheConnectionError. Invalid keys throw RedisCacheKeyError. Configuration problems throw RedisCacheConfigurationError. Every error includes context — the operation that failed, the key involved, and the underlying cause. This makes debugging Redis issues straightforward even in a distributed Lambda environment.
The shutdown method closes the Redis connection cleanly. This is called during Lambda graceful shutdown to prevent connection leaks.
Not everything belongs in Redis. We use Redis for data that is read frequently, changes infrequently, and is acceptable to lose on cache miss (because the source of truth is elsewhere).
Session lookups: Every authenticated request validates the session. The session is stored in DynamoDB (source of truth) and cached in Redis. A cache hit serves the session in under 1ms. A cache miss falls back to DynamoDB (1-5ms) and repopulates the Redis cache.
Rate limit counters: Rate limiting needs atomic increment operations with TTL. Redis INCR with EXPIRE is purpose-built for this. The rate limit service uses Redis to track per-user, per-endpoint, and per-IP request counts with automatic expiration at the end of each time window.
API response caching: Expensive API responses — leaderboard rankings, search results, feed computations — are cached in Redis with short TTLs (30 seconds to 5 minutes). This absorbs read spikes without hitting the backend services.
Geolocation data: IP-to-location lookups are cached in Redis for hot IPs (the same IP making multiple requests) and in DynamoDB for long-term caching. The two-layer approach gives sub-millisecond lookups for active users and durable caching for returning users.
Circuit breaker state: Some services store circuit breaker state in Redis instead of DynamoDB for faster state checks. This is configurable per circuit breaker instance.
📊Sessions, rate limits, API responses, geolocation, circuit breaker state — all cached in Redis for sub-millisecond access. DynamoDB remains the source of truth for everything.
Redis is an external dependency. It can fail. The CacheMonitor provides production-grade resilience.
Health checks run on a configurable interval. The monitor writes a test key, reads it back, and deletes it. If the health check fails after the configured number of retries, it logs an error and emits a CloudWatch metric. This gives the operations team visibility into Redis health without waiting for user-facing errors.
Circuit breaker integration protects against Redis outages. If Redis fails repeatedly, the circuit breaker opens and cache operations fall back to the DynamoDB backend or return cache misses. The circuit breaker uses the same CircuitBreaker class from tctf-utils — the same three-state model (Closed, Open, Half-Open) with configurable thresholds.
The executeWithCircuitBreaker method wraps any cache operation with circuit breaker protection. If the circuit is open, the operation is skipped and the caller gets a cache miss. The caller's code does not change — the circuit breaker is transparent.
Graceful shutdown cleans up health check intervals, shuts down circuit breakers, and closes Redis connections. This prevents resource leaks in Lambda functions with provisioned concurrency that run for extended periods.
The decision framework is straightforward.
Use Redis when: the data is read on every request (sessions, rate limits), sub-millisecond latency matters, you need atomic operations (INCR, EXPIRE), and the service already runs inside a VPC.
Use DynamoDB when: the data is read occasionally (geolocation, configuration), single-digit millisecond latency is acceptable, you need durability (cache survives restarts), and the service runs outside a VPC.
Use in-memory when: the data is computed per invocation (parsed configs, validated schemas), you are running tests, or the data is so small and fast to compute that external caching adds more overhead than it saves.
The pluggable interface means you do not have to decide upfront. Start with DynamoDB caching (zero additional infrastructure). If profiling shows that cache latency is a bottleneck, switch to Redis by changing an environment variable. If Redis adds too much operational complexity for a particular service, switch back. The code does not change.
This flexibility is the point of the pluggable architecture. The right caching strategy depends on the service, the access pattern, and the performance requirements. The interface lets you optimize per service without rewriting anything.
🎯Start with DynamoDB caching (zero infra). Switch to Redis when sub-millisecond latency matters. Switch back if Redis adds too much complexity. The code never changes.
Redis is not a replacement for DynamoDB caching. It is a complement — a speed layer for the operations that need sub-millisecond response times. The pluggable cache architecture means every service chooses the right backend for its needs, and the choice is a configuration decision, not a code decision. Sessions use Redis. Geolocation uses DynamoDB. Tests use in-memory. The CacheService API is the same everywhere. That consistency — same interface, different backends, per-service optimization — is what makes the caching architecture work at scale across 34 microservices.
Never miss an edition
Subscribe to get TCTF newsletters delivered to your inbox.