
How we built a multi-tier rate limiting system with three algorithms (fixed window, sliding window, token bucket), geographic rules, whitelist/blacklist support, distributed coordination across Lambda instances, and configurable failure modes.
Every public API needs rate limiting. Without it, a single user can consume all your capacity — whether intentionally (an attack) or accidentally (a buggy client in a retry loop). In serverless, rate limiting is harder than it sounds. Lambda functions are stateless. There is no shared counter in memory. Multiple instances handle requests simultaneously. And the rate limiter itself must not become a bottleneck. At TCTF, we built a rate limiting service that handles per-user, per-endpoint, and per-IP throttling with three algorithms, geographic rules that adjust limits by country, whitelist and blacklist support, distributed coordination across Lambda instances, and configurable failure modes. This article explains how it works.
In a traditional server, rate limiting is a counter in memory. Request comes in, increment the counter, check the limit, allow or deny. The counter lives in the process. Every request hits the same counter. Simple.
In serverless, there is no shared process. A user's requests might hit 10 different Lambda instances in 10 seconds. Each instance has its own memory. If each instance keeps its own counter, the user gets 10x the intended limit — each instance thinks it has seen only one request.
The counter must be external — stored in DynamoDB or Redis where every instance can read and write it atomically. But external counters add latency. A DynamoDB read-increment-write cycle takes 5-10ms. If rate limiting adds 10ms to every request, it becomes a significant overhead.
TCTF's rate limiting service solves this with a layered approach: DynamoDB for durable counter storage, configurable algorithms that trade accuracy for performance, and a distributed mode that coordinates across instances without requiring every instance to hit the database on every request.
⚡In serverless, rate limit counters must be external. The challenge: make them fast enough that rate limiting does not become the bottleneck it is supposed to prevent.

The rate limiting service supports three algorithms, each with different trade-offs.
Fixed window is the simplest. Divide time into fixed intervals (e.g., 60-second windows). Count requests in the current window. If the count exceeds the limit, deny. When the window expires, the counter resets. The trade-off: at window boundaries, a user can make 2x the limit — max requests at the end of one window and max requests at the start of the next. For most use cases, this is acceptable.
Sliding window is more accurate. Instead of fixed intervals, it tracks individual request timestamps and counts requests within a rolling window. A request at time T checks how many requests occurred between T minus the window size and T. No boundary problem. The trade-off: it stores more data (individual timestamps instead of a single counter) and requires more computation per check.
Token bucket allows bursts. The bucket starts full (e.g., 100 tokens). Each request consumes a token. Tokens refill at a constant rate (e.g., 10 per second). If the bucket is empty, the request is denied. This allows short bursts of traffic (up to the bucket capacity) while enforcing a long-term average rate. The trade-off: more complex state management (token count, last refill time).
The algorithm is configured per action. Authentication endpoints use fixed window (simple, low overhead). API endpoints use sliding window (accurate, prevents boundary abuse). File upload endpoints use token bucket (allows burst uploads while limiting sustained throughput).
🔧Fixed window for simplicity. Sliding window for accuracy. Token bucket for burst tolerance. Each action gets the algorithm that fits its traffic pattern.
Not all rate limits count requests. Some count bytes.
Request count is the default — limit the number of requests per time window. This is what most people think of when they hear rate limiting. 100 requests per minute. 5 login attempts per hour.
Bandwidth limiting counts the total data transferred. A user might be allowed 10MB of bandwidth per minute. Each request consumes bandwidth proportional to its response size. This prevents a single user from saturating the network by making many large requests.
Payload size limiting counts the total upload size. A user might be allowed 50MB of uploads per hour. Each upload consumes tokens proportional to its file size. This prevents storage abuse without limiting the number of small requests.
The limit type is configured per action. The calculateTokensNeeded method determines how many tokens each request consumes based on the type. For request count, every request costs 1 token. For bandwidth, the cost is proportional to the response size. For payload size, the cost is proportional to the upload size.
Not all traffic is equal. A country with a large, active user base should get generous rate limits. A country with no registered users generating a burst of signup attempts is suspicious.
The rate limiting service supports geographic rules — per-country and per-region overrides that adjust the base rate limit. The country code comes from the CloudFront-Viewer-Country header (set by CloudFront at the edge) or from the geolocation service.
Geographic rules are stored in DynamoDB alongside the rate limit configuration. They can be updated without redeploying any service — the operations team can respond to emerging threats by tightening limits for specific countries in real time.
The applyGeographicRules method checks if the request's country matches any geographic rule. If it does, the rule's limit overrides the base limit. If no rule matches, the base limit applies. This means a global API endpoint can have different effective limits in different countries — stricter in high-risk regions, more generous in regions with established user bases.
🌍Geographic rules adjust rate limits by country — stricter in high-risk regions, more generous where users are established. Updated in real time via DynamoDB, no redeployment needed.
Some identifiers should bypass rate limits entirely. Internal service accounts, load testing tools, and trusted partners need unrestricted access. The whitelist handles this — whitelisted identifiers skip all rate limit checks.
Some identifiers should be blocked entirely. Known attack IPs, compromised accounts, and abuse sources should never get through. The blacklist handles this — blacklisted identifiers are immediately denied with a rate limit error, regardless of their actual request count.
Both lists are managed through the rate limiting service API: addToWhitelist, removeFromWhitelist, addToBlacklist, removeFromBlacklist. Bulk operations (bulkAddToWhitelist, bulkAddToBlacklist) handle mass updates efficiently. The checkAccess method returns the current status of any identifier: whitelisted, blacklisted, or allowed.
Whitelists and blacklists are stored persistently via the config manager and cached in memory for fast lookups. Cache invalidation happens automatically when lists are modified. The lists are per-action — an identifier can be whitelisted for one action and blacklisted for another.
The standard rate limiting algorithms work well when all requests flow through a single counter. But in serverless, requests are distributed across many Lambda instances. Each instance increments the counter independently, and DynamoDB's eventual consistency means two instances might read the same counter value before either writes the increment.
For most use cases, this slight over-counting is acceptable. If the limit is 100 requests per minute and two instances each see 99, allowing one more each gives 101 — close enough.
For use cases where accuracy matters — billing-related limits, security-critical endpoints — the distributed mode provides stronger coordination. Each Lambda instance maintains a local counter identified by a unique instance ID. Periodically, the local counters are aggregated into a global count. The getGlobalCount method sums all instance counters within the time window.
This approach reduces DynamoDB writes (local counters batch before writing) while maintaining accurate global counts. The trade-off is slightly delayed enforcement — a burst might briefly exceed the limit before the global count catches up. For security-critical limits, the delay is configured to be very short (1-2 seconds).
📊Distributed mode: each Lambda instance tracks locally, aggregates globally. Reduces DynamoDB writes while maintaining accurate counts. Configurable sync delay for security-critical limits.
What happens when the rate limiting storage fails? DynamoDB is down. The counter cannot be read or written. Do you allow the request or deny it?
The answer depends on the endpoint. For a public API that serves content, failing open (allowing the request) is safer — a brief period without rate limiting is better than a complete outage. For a login endpoint, failing closed (denying the request) is safer — a brief period of denied logins is better than unlimited brute-force attempts.
The failure mode is configured per action: open or closed. The default is closed (deny on failure) because security is the more common concern. Services that prefer availability over security can set failureMode to open.
When a rate limit is exceeded, the service can optionally block the identifier for a configurable duration (blockDurationSec). This is useful for login endpoints — after 5 failed attempts, block the IP for 1 hour. The block is stored in DynamoDB with a TTL. Subsequent requests from the blocked identifier are immediately denied without checking the counter.
The blocking mechanism prevents slow-drip attacks where an attacker stays just under the rate limit by spacing requests. Once the limit is hit, the block ensures a meaningful cooldown period before the attacker can try again.
🛡️ Fail open for content APIs (availability over security). Fail closed for auth endpoints (security over availability). Configurable per action. Blocking with TTL prevents slow-drip attacks.

Rate limiting is the invisible guardian of every API. Users never see it when it works — their requests flow through without delay. They only notice it when it activates — a 429 response with a Retry-After header telling them to slow down. At TCTF, the rate limiting service protects every endpoint across all 34 microservices with the same consistent behavior: three algorithms for different traffic patterns, geographic rules for regional intelligence, whitelist and blacklist for access control, distributed coordination for accuracy, and configurable failure modes for the right balance of security and availability. One service, one configuration, consistent protection everywhere.
Never miss an edition
Subscribe to get TCTF newsletters delivered to your inbox.