
Managing user sessions, device trust, and real-time WebSocket connections when your compute layer is stateless — our approach with DynamoDB, Redis, API Gateway WebSockets, and the SessionCoordinator that ties it all together.
Serverless functions are stateless. They spin up, handle a request, and disappear. There is no memory between invocations, no persistent process, no in-memory session store. And yet, a platform like TCTF needs to track who is logged in, which devices they trust, when their sessions expire, and how to push real-time messages to their browser. This article covers two challenges that every serverless platform faces: managing user sessions without a server, and delivering real-time updates without a persistent connection. Our solutions: a pluggable session system with swappable cache providers — DynamoDB-backed caching or Redis (ElastiCache) caching, seamlessly interchangeable behind the same interface — and WebSocket connections managed through API Gateway.
In a traditional server application, sessions live in memory. A user logs in, the server creates a session object, stores it in a hash map, and attaches a session ID to a cookie. Every subsequent request includes the cookie, the server looks up the session, and the user is authenticated. Simple.
In serverless, there is no server to hold that hash map. Lambda functions are ephemeral — they may handle one request and never be invoked again, or they may handle thousands of requests across different instances. There is no shared memory between invocations. There is no guarantee that the same instance handles consecutive requests from the same user.
This means sessions must be stored externally — in a database or cache that every Lambda instance can access. The session lookup must be fast (every authenticated request needs it), reliable (a missed session means a logged-out user), and consistent (two Lambda instances must see the same session state).
At TCTF, we solve this with a pluggable cache architecture. The session system is built against a cache interface, not a specific provider. You can run it with DynamoDB as the cache backend — using TTL-based expiration and single-table design — or swap to Redis (ElastiCache) for sub-millisecond lookups. The switch is a configuration change, not a code change. Both providers implement the same interface, so the session logic does not know or care which one is behind it. In production, we use both: DynamoDB as the durable session store and Redis as the hot cache layer. But a service that does not need Redis can run entirely on DynamoDB caching with no code changes.
⚡The session cache is pluggable — DynamoDB or Redis, swappable via configuration. Both implement the same interface. In production, we use both: DynamoDB for durability, Redis for speed.
A session in TCTF is more than a token. The SessionStorage interface captures everything we need to know about an active session: the user's email, the refresh token for session renewal, a unique session ID, the expiration timestamp, device information (browser, OS, IP), whether the user chose to stay signed in, and whether the device is trusted.
Device trust is a key feature. When a user marks a device as trusted, we store that flag and timestamp on the session. Trusted devices skip certain security checks on subsequent logins — reducing friction for users on their own machines while maintaining full security on unknown devices.
Each user is limited to 5 concurrent sessions. This prevents session accumulation — a common problem where users log in from multiple devices and never explicitly log out. When a user creates a sixth session, the oldest session is automatically evicted using an LRU (Least Recently Used) strategy. The eviction happens atomically within the same DynamoDB transaction that creates the new session.
Session expiration uses DynamoDB TTL. Every session has an expiresAt timestamp. DynamoDB automatically removes expired items — no cleanup jobs, no cron tasks, no Lambda functions scanning for stale sessions. The database handles it.
🔒Max 5 sessions per user. When a 6th session is created, the oldest is automatically evicted. DynamoDB TTL handles expiration — no cleanup jobs needed.
Authentication is not a single step. A user enters their email and password. If MFA is enabled, they need to complete a second factor. If MFA succeeds, the login session is cleaned up and a user session is created. If anything fails, all temporary sessions need to be cleaned up.
The SessionCoordinator orchestrates this flow using an event-driven architecture. It extends Node.js EventEmitter and emits lifecycle events at each stage: LOGIN_STARTED when a login session is created, MFA_REQUIRED when the user needs to complete MFA, MFA_COMPLETED when authentication succeeds, and AUTH_FAILED when something goes wrong.
The coordinator manages parent-child relationships between sessions. A login session is the parent. An MFA session is the child, linked to the parent by loginSessionId. When MFA completes, both sessions are cleaned up atomically. When authentication fails, both sessions are cleaned up in the failure handler.
Login sessions have a 5-10 minute TTL — just long enough for the user to complete the auth flow. MFA sessions have a 5-minute TTL — MFA must complete quickly or the session expires. Both are stored in Redis with TTL, so expired sessions are automatically removed without any cleanup logic.
The event-driven design makes the coordinator extensible. Need to log every login attempt? Subscribe to LOGIN_STARTED. Need to trigger a security alert on failed auth? Subscribe to AUTH_FAILED. Need to update a last-login timestamp? Subscribe to MFA_COMPLETED. New behaviors are added by subscribing to events, not by modifying the coordinator.
📨The SessionCoordinator emits lifecycle events: LOGIN_STARTED, MFA_REQUIRED, MFA_COMPLETED, AUTH_FAILED. New behaviors are added by subscribing to events, not modifying the coordinator.
The UserSessionService is the main interface for session operations. Under the hood, it delegates to four specialized classes: SessionCreator, SessionRetriever, SessionUpdater, and SessionDeleter. Each handles one concern.
SessionCreator validates the session data, calculates the expiration time (longer for stay-signed-in sessions), builds a DynamoDB transaction that creates the new session and evicts the oldest if the user is at the 5-session limit, caches the new session in Redis, and logs success metrics. The entire operation is transactional — if any step fails, nothing is written.
SessionRetriever handles lookups by session ID, by refresh token, and by user (for listing all active sessions). It checks Redis first and falls back to DynamoDB on cache miss. Cache hits are logged as metrics so we can track the cache hit rate.
SessionUpdater handles session renewal (extending expiration), device trust changes, MFA state updates, and optimistic locking via lastUpdatedAt timestamps. Conditional updates prevent race conditions when two Lambda instances try to update the same session simultaneously.
SessionDeleter handles single session deletion (user logs out of one device), bulk deletion (user logs out of all devices), and the LRU eviction during session creation. Deletions are propagated to both DynamoDB and Redis to keep the cache consistent.
The UserSessionService also provides health checks, attempt status tracking for rate limiting login attempts, and MFA setup transaction support — completing MFA setup atomically with session updates.
Sessions handle authentication. But how do you push a real-time notification to a user's browser when there is no persistent server process?
The answer is API Gateway WebSocket APIs. When a user opens Cometbid Social, the frontend establishes a WebSocket connection to API Gateway. API Gateway assigns a connectionId and invokes a Lambda function to handle the connection event. The Lambda function stores the connection metadata in DynamoDB — connectionId, userId, userEmail, connected timestamp, and a 24-hour TTL.
The WebSocketConnectionService manages this connection store. It uses single-table design with PK/SK keys: PK is CONNECTION#{connectionId}, SK is METADATA. A GSI maps users to their connections: GSI1PK is USER#{userId}, GSI1SK is CONNECTION#{connectionId}. This lets us quickly find all active connections for a user.
When a service needs to push a message — a new notification, a chat message, a project update — it queries the GSI to find the user's active connections, then uses the API Gateway Management API to post the message to each connectionId. If a connection is stale (the user closed their browser), the post fails and the connection is cleaned up.
The 24-hour TTL on connections ensures stale connections are eventually cleaned up even if the disconnect event is missed. This is important because WebSocket disconnects are not always reliable — network drops, browser crashes, and mobile app backgrounding can all cause silent disconnects.
🔌WebSocket connections are stored in DynamoDB with a GSI mapping users to connections. When a service needs to push a message, it queries the GSI and posts to each active connectionId via API Gateway.
The session system handles scale through the pluggable cache architecture. When Redis is configured, it absorbs the read load — session validation happens on every authenticated request, and Redis serves these at sub-millisecond latency. DynamoDB handles the write load and provides durability. If Redis goes down, the system falls back to DynamoDB caching with slightly higher latency but no data loss. Services that do not need Redis-level speed can run entirely on DynamoDB caching — same interface, same code, different configuration.
WebSocket connections scale differently. API Gateway WebSocket APIs handle connection management automatically — there is no connection limit to configure, no load balancer to tune. The bottleneck is the DynamoDB connection store and the API Gateway Management API calls for message delivery.
For high-fanout scenarios — sending a notification to thousands of users simultaneously — we use SQS to queue the delivery. A Lambda function reads from the queue, looks up connections, and delivers messages in batches. This prevents the sending service from being blocked by slow deliveries and provides automatic retry for failed sends.
The combination of DynamoDB for state, Redis for speed, API Gateway for connections, and SQS for delivery gives us a real-time system that scales without any servers to manage. Every component is managed, pay-per-use, and scales to zero when idle.
Building session management and real-time push for serverless taught us several lessons.
First, Redis TTL is your friend. Login sessions and MFA sessions expire automatically. No cleanup jobs, no scheduled Lambda functions, no DynamoDB scans. Set the TTL when you create the session and forget about it.
Second, transactional session creation prevents race conditions. Creating a session and evicting the oldest session must happen atomically. Without transactions, two concurrent logins could both succeed and leave the user with 6 sessions.
Third, WebSocket disconnect events are unreliable. Always use TTL as a backstop. A 24-hour TTL on connections means stale connections are cleaned up within a day, even if the disconnect Lambda never fires.
Fourth, the event-driven coordinator pattern is worth the complexity. It adds a layer of abstraction, but it makes the auth flow testable, extensible, and observable. Every state transition emits a metric. Every failure triggers cleanup. New behaviors are added without touching existing code.
Fifth, cache consistency matters. When a session is created, updated, or deleted, both DynamoDB and Redis must be updated. We update DynamoDB first (source of truth) and Redis second (cache). If the Redis update fails, the next cache miss will repopulate from DynamoDB. Eventually consistent, but always correct.
📊Key lessons: Redis TTL for automatic expiration, transactional session creation, TTL as a backstop for unreliable WebSocket disconnects, and DynamoDB-first writes for cache consistency.
Stateless does not mean sessionless. It means the state lives somewhere else — in DynamoDB for durability, in Redis for speed, in API Gateway for connections. The challenge is coordinating these pieces into a system that feels seamless to the user. They log in, they see their sessions, they trust their devices, they receive real-time notifications — and they never think about the Lambda functions, DynamoDB transactions, and WebSocket connections making it all work. That invisibility is the goal.
Never miss an edition
Subscribe to get TCTF newsletters delivered to your inbox.