
How we classify API endpoints by type (admin vs user), track response times with performance categorization, monitor auth failures and rate limit violations, and use business metrics to know if the platform is healthy — all through a single GlobalMonitoring service.
A platform with 25 services in production generates a lot of metrics. Lambda duration, DynamoDB read units, SQS message age, error counts — the raw numbers are overwhelming. The question is not how many metrics you have, but which ones matter. A 500ms response on a health check endpoint is fine. A 500ms response on a payment processing endpoint is a problem. A failed login from a known user is a password mistake. Ten failed logins from the same IP in a minute is an attack. SLA monitoring is about knowing the difference — classifying endpoints by importance, tracking the metrics that indicate real problems, and alerting before users notice. This article covers the GlobalMonitoring service and the EndpointClassifier that power SLA tracking across all TCTF services.
A platform like TCTF has hundreds of API endpoints across 25 services. Some are critical — authentication, payment processing, escrow release. A failure on these endpoints means users cannot log in, money is stuck, or trust is broken. Some are important but not critical — profile updates, feed loading, search queries. A slowdown is noticeable but not catastrophic. Some are background — health checks, metrics collection, cache warming. Nobody notices if these are slow.
Treating all endpoints equally means either over-alerting (paging the team for a slow health check) or under-alerting (missing a payment processing degradation because it is averaged with hundreds of fast health checks). Neither is useful.
The solution is endpoint classification — tagging each endpoint with its type and importance, then applying different SLA targets, different alerting thresholds, and different monitoring granularity based on the classification. A payment endpoint gets a 200ms SLA with aggressive alerting. A health check gets a 2000ms SLA with no alerting. The monitoring system knows the difference because the endpoints are classified.
⚡Not all endpoints are equal. Payment processing gets a 200ms SLA with aggressive alerting. Health checks get a 2000ms SLA with no alerting. Classification makes the difference.

The EndpointClassifier determines whether a Lambda function is an admin endpoint or a user endpoint. This classification drives different SLA targets, different monitoring dashboards, and different alerting rules.
The classifier uses four strategies in priority order. First, it checks an explicit environment variable (IS_ADMIN_ENDPOINT). If set to true, the endpoint is admin. If set to false, it is user. This is the most reliable strategy — the CDK stack sets the variable when deploying the Lambda function.
Second, it checks an explicit allowlist of function names. If the function name is in the admin list, it is admin. This handles cases where the environment variable is not set but the function is known to be admin.
Third, it uses pattern matching on the function name. Patterns like contains-admin-, starts-with-admin-, ends-with-admin, and contains-AdminStack catch admin functions by naming convention. This is a fallback for functions that follow the naming convention but are not explicitly configured.
Fourth, if none of the above match, the classifier defaults to user. This is the fail-safe — an unclassified endpoint is treated as user-facing, which means it gets the stricter SLA targets. Better to over-monitor than under-monitor.
The classifier also reports whether its classification is reliable — based on explicit configuration (env var or allowlist) versus pattern matching or default. This lets the monitoring system flag endpoints that need explicit classification.
The GlobalMonitoring class is a static utility that every service uses to emit structured metrics to CloudWatch. It provides six tracking methods, each designed for a specific monitoring concern.
trackResponseTime records the duration of a Lambda handler execution with performance categorization. It calculates the duration from a start timestamp, categorizes it (fast, normal, slow, critical based on configurable thresholds), and emits metrics with the endpoint type (admin or user) as a dimension. This means CloudWatch dashboards can show admin endpoint latency separately from user endpoint latency.
trackAuthFailure records authentication failures with the error type, endpoint type, and user identifier (masked for privacy). This feeds into security dashboards that detect brute-force attacks, credential stuffing, and account takeover attempts.
trackRateLimitViolation records rate limit hits with the endpoint type, action (which rate limit was triggered), and severity (low, medium, high based on how far over the limit the request was). This helps distinguish between legitimate users hitting limits (low severity) and attacks (high severity).
trackError records errors with categorization. Six error categories: AUTHENTICATION, AUTHORIZATION, VALIDATION, EXTERNAL_SERVICE, DATABASE, and INTERNAL. Each category feeds into a different CloudWatch alarm — a spike in EXTERNAL_SERVICE errors triggers a different response than a spike in VALIDATION errors.
trackSuccess records successful operations with the operation name, endpoint type, and optional duration. This provides the baseline for success rate calculations — you need to know the success count to calculate the error rate.
trackBusinessMetric records domain-specific metrics — daily active users, campaigns sent, messages delivered, projects created. These are the metrics that tell you if the platform is healthy from a business perspective, not just a technical one.
📊Six tracking methods: response time, auth failures, rate limit violations, errors (6 categories), successes, and business metrics. All emitted to CloudWatch with endpoint type dimensions.
Lambda has a maximum execution timeout (default 30 seconds for API Gateway). But an SLA target is usually much shorter — 200ms for critical endpoints, 1000ms for standard endpoints. The timeout wrapper provides application-level SLA enforcement.
The wrapper wraps a Lambda handler with a configurable timeout. If the handler does not complete within the timeout, the wrapper aborts the operation and returns a fallback response — a 408 Request Timeout with a correlation ID. The handler's execution is cancelled, and the timeout is recorded as a metric.
The wrapper integrates with the endpoint classifier. It reads the endpoint type and applies the appropriate timeout — shorter for critical endpoints, longer for background tasks. It also integrates with GlobalMonitoring, recording the response time and the timeout event.
The fallback response is configurable. For read endpoints, the fallback might return cached data. For write endpoints, the fallback returns an error asking the client to retry. For health checks, the fallback returns a degraded status.
This is different from Lambda's built-in timeout. Lambda's timeout kills the entire execution — no cleanup, no response, no metrics. The application-level timeout returns a proper response, records metrics, and allows cleanup code to run. The user gets a fast error instead of a hanging request.
The six error categories in GlobalMonitoring are not arbitrary — they map to different operational responses.
AUTHENTICATION errors (wrong password, expired token) are expected in normal operation. A baseline rate of 2-5% is normal. An alert triggers when the rate exceeds 10% — indicating a possible credential stuffing attack or a broken auth flow.
AUTHORIZATION errors (insufficient permissions) should be rare in production. Any sustained rate above 1% triggers an alert — it usually means a frontend is calling an endpoint it should not, or a permission configuration is wrong.
VALIDATION errors (bad input) are common and expected. The rate varies by endpoint — a signup form might have 15% validation errors (users mistyping emails). Alerts trigger on sudden spikes, not absolute rates.
EXTERNAL_SERVICE errors (Cognito down, Stripe timeout, ipinfo.io failure) trigger immediate alerts because they indicate a dependency failure that affects multiple endpoints. The circuit breaker handles the immediate response; the alert ensures the team investigates.
DATABASE errors (DynamoDB throttling, conditional check failures) trigger alerts based on the error type. Throttling is a capacity issue — scale up. Conditional check failures are logic issues — investigate.
INTERNAL errors (unhandled exceptions, null references) are bugs. Any internal error triggers an alert because it means something unexpected happened that the error handling architecture did not anticipate.
🚨Each error category has different alerting thresholds. Auth errors: alert at 10%. Authorization: alert at 1%. External service: alert immediately. Internal: alert on every occurrence.
Technical metrics tell you if the platform is running. Business metrics tell you if the platform is working.
trackBusinessMetric records domain-specific numbers: daily active users, messages sent per hour, campaigns delivered, projects created per day, proposals submitted, milestones completed, escrow funds released. These metrics have nothing to do with Lambda duration or DynamoDB read units — they measure whether users are actually using the platform.
A platform can be technically healthy (all endpoints responding, all databases available) and business-unhealthy (no one is signing up, no one is creating projects, no one is sending messages). Business metrics catch this.
The metrics are emitted with dimensions for segmentation — by endpoint type, by user tier (free, pro, premium), by region. This lets dashboards show business health per segment. If free-tier signups drop but premium signups are stable, the problem is in the onboarding flow, not the platform.
publishMetrics flushes all accumulated metrics to CloudWatch at the end of each Lambda invocation. This batches the CloudWatch API calls for efficiency — instead of one API call per metric, all metrics from a single invocation are published in one batch.
The monitoring architecture forms a pipeline: classify the endpoint, track the metrics, alert on anomalies, respond to incidents.
Every Lambda handler starts by classifying its endpoint type. The withErrorHandling wrapper records the start time. Business logic runs. On success, trackSuccess and trackResponseTime are called. On failure, trackError is called with the appropriate category. On auth failure, trackAuthFailure is called. On rate limit, trackRateLimitViolation is called. At the end, publishMetrics flushes everything to CloudWatch.
CloudWatch dashboards show the metrics by endpoint type, by error category, by service. Alarms trigger on thresholds — different thresholds for different categories, different endpoint types. PagerDuty or SNS notifications reach the on-call team.
The result: the team knows within minutes when something is wrong, what kind of wrong it is, and which endpoints are affected. A payment processing degradation is detected and alerted differently from a feed loading slowdown. A brute-force attack is detected and alerted differently from a spike in validation errors. The monitoring system knows the difference because every metric carries its classification.
🔄Classify → Track → Alert → Respond. Every metric carries its endpoint type and error category. The monitoring system knows the difference between a payment failure and a feed slowdown.

SLA monitoring is not about collecting more metrics. It is about collecting the right metrics with the right context. An error count without an error category is noise. A response time without an endpoint classification is misleading. A business metric without segmentation is incomplete. The GlobalMonitoring service and EndpointClassifier give every metric the context it needs — so the team can focus on what matters, ignore what does not, and respond to problems before users report them.
Never miss an edition
Subscribe to get TCTF newsletters delivered to your inbox.