Framework Deep DivesFramework Series #16

SLA Monitoring and Endpoint Classification: Knowing What Matters Most

How we classify API endpoints by type (admin vs user), track response times with performance categorization, monitor auth failures and rate limit violations, and use business metrics to know if the platform is healthy — all through a single GlobalMonitoring service.

November 10, 2026· 11 min read

TCTF Editorials

TCTF Newsletter

Admin / UserEndpoint Types

6Error Categories

6Tracking Methods

Multi-StrategyClassification

CloudWatchMetrics Backend

25+Services Using

A platform with 25 services in production generates a lot of metrics. Lambda duration, DynamoDB read units, SQS message age, error counts — the raw numbers are overwhelming. The question is not how many metrics you have, but which ones matter. A 500ms response on a health check endpoint is fine. A 500ms response on a payment processing endpoint is a problem. A failed login from a known user is a password mistake. Ten failed logins from the same IP in a minute is an attack. SLA monitoring is about knowing the difference — classifying endpoints by importance, tracking the metrics that indicate real problems, and alerting before users notice. This article covers the GlobalMonitoring service and the EndpointClassifier that power SLA tracking across all TCTF services.

01Why Not All Endpoints Are Equal

A platform like TCTF has hundreds of API endpoints across 25 services. Some are critical — authentication, payment processing, escrow release. A failure on these endpoints means users cannot log in, money is stuck, or trust is broken. Some are important but not critical — profile updates, feed loading, search queries. A slowdown is noticeable but not catastrophic. Some are background — health checks, metrics collection, cache warming. Nobody notices if these are slow.

Treating all endpoints equally means either over-alerting (paging the team for a slow health check) or under-alerting (missing a payment processing degradation because it is averaged with hundreds of fast health checks). Neither is useful.

The solution is endpoint classification — tagging each endpoint with its type and importance, then applying different SLA targets, different alerting thresholds, and different monitoring granularity based on the classification. A payment endpoint gets a 200ms SLA with aggressive alerting. A health check gets a 2000ms SLA with no alerting. The monitoring system knows the difference because the endpoints are classified.

⚡
Not all endpoints are equal. Payment processing gets a 200ms SLA with aggressive alerting. Health checks get a 2000ms SLA with no alerting. Classification makes the difference.

02The Endpoint Classifier: Multi-Strategy Classification

The EndpointClassifier determines whether a Lambda function is an admin endpoint or a user endpoint. This classification drives different SLA targets, different monitoring dashboards, and different alerting rules.

The classifier uses four strategies in priority order. First, it checks an explicit environment variable (IS_ADMIN_ENDPOINT). If set to true, the endpoint is admin. If set to false, it is user. This is the most reliable strategy — the CDK stack sets the variable when deploying the Lambda function.

Second, it checks an explicit allowlist of function names. If the function name is in the admin list, it is admin. This handles cases where the environment variable is not set but the function is known to be admin.

Third, it uses pattern matching on the function name. Patterns like contains-admin-, starts-with-admin-, ends-with-admin, and contains-AdminStack catch admin functions by naming convention. This is a fallback for functions that follow the naming convention but are not explicitly configured.

Fourth, if none of the above match, the classifier defaults to user. This is the fail-safe — an unclassified endpoint is treated as user-facing, which means it gets the stricter SLA targets. Better to over-monitor than under-monitor.

The classifier also reports whether its classification is reliable — based on explicit configuration (env var or allowlist) versus pattern matching or default. This lets the monitoring system flag endpoints that need explicit classification.

03GlobalMonitoring: The Metrics Hub

The GlobalMonitoring class is a static utility that every service uses to emit structured metrics to CloudWatch. It provides six tracking methods, each designed for a specific monitoring concern.

trackResponseTime records the duration of a Lambda handler execution with performance categorization. It calculates the duration from a start timestamp, categorizes it (fast, normal, slow, critical based on configurable thresholds), and emits metrics with the endpoint type (admin or user) as a dimension. This means CloudWatch dashboards can show admin endpoint latency separately from user endpoint latency.

trackAuthFailure records authentication failures with the error type, endpoint type, and user identifier (masked for privacy). This feeds into security dashboards that detect brute-force attacks, credential stuffing, and account takeover attempts.

trackRateLimitViolation records rate limit hits with the endpoint type, action (which rate limit was triggered), and severity (low, medium, high based on how far over the limit the request was). This helps distinguish between legitimate users hitting limits (low severity) and attacks (high severity).

trackError records errors with categorization. Six error categories: AUTHENTICATION, AUTHORIZATION, VALIDATION, EXTERNAL_SERVICE, DATABASE, and INTERNAL. Each category feeds into a different CloudWatch alarm — a spike in EXTERNAL_SERVICE errors triggers a different response than a spike in VALIDATION errors.

trackSuccess records successful operations with the operation name, endpoint type, and optional duration. This provides the baseline for success rate calculations — you need to know the success count to calculate the error rate.

trackBusinessMetric records domain-specific metrics — daily active users, campaigns sent, messages delivered, projects created. These are the metrics that tell you if the platform is healthy from a business perspective, not just a technical one.

📊
Six tracking methods: response time, auth failures, rate limit violations, errors (6 categories), successes, and business metrics. All emitted to CloudWatch with endpoint type dimensions.

04The Timeout Wrapper: Application-Level SLA Guarantees

Lambda has a maximum execution timeout (default 30 seconds for API Gateway). But an SLA target is usually much shorter — 200ms for critical endpoints, 1000ms for standard endpoints. The timeout wrapper provides application-level SLA enforcement.

The wrapper wraps a Lambda handler with a configurable timeout. If the handler does not complete within the timeout, the wrapper aborts the operation and returns a fallback response — a 408 Request Timeout with a correlation ID. The handler's execution is cancelled, and the timeout is recorded as a metric.

The wrapper integrates with the endpoint classifier. It reads the endpoint type and applies the appropriate timeout — shorter for critical endpoints, longer for background tasks. It also integrates with GlobalMonitoring, recording the response time and the timeout event.

The fallback response is configurable. For read endpoints, the fallback might return cached data. For write endpoints, the fallback returns an error asking the client to retry. For health checks, the fallback returns a degraded status.

This is different from Lambda's built-in timeout. Lambda's timeout kills the entire execution — no cleanup, no response, no metrics. The application-level timeout returns a proper response, records metrics, and allows cleanup code to run. The user gets a fast error instead of a hanging request.

05Error Categories and Alerting

The six error categories in GlobalMonitoring are not arbitrary — they map to different operational responses.

AUTHENTICATION errors (wrong password, expired token) are expected in normal operation. A baseline rate of 2-5% is normal. An alert triggers when the rate exceeds 10% — indicating a possible credential stuffing attack or a broken auth flow.

AUTHORIZATION errors (insufficient permissions) should be rare in production. Any sustained rate above 1% triggers an alert — it usually means a frontend is calling an endpoint it should not, or a permission configuration is wrong.

VALIDATION errors (bad input) are common and expected. The rate varies by endpoint — a signup form might have 15% validation errors (users mistyping emails). Alerts trigger on sudden spikes, not absolute rates.

EXTERNAL_SERVICE errors (Cognito down, Stripe timeout, ipinfo.io failure) trigger immediate alerts because they indicate a dependency failure that affects multiple endpoints. The circuit breaker handles the immediate response; the alert ensures the team investigates.

DATABASE errors (DynamoDB throttling, conditional check failures) trigger alerts based on the error type. Throttling is a capacity issue — scale up. Conditional check failures are logic issues — investigate.

INTERNAL errors (unhandled exceptions, null references) are bugs. Any internal error triggers an alert because it means something unexpected happened that the error handling architecture did not anticipate.

🚨
Each error category has different alerting thresholds. Auth errors: alert at 10%. Authorization: alert at 1%. External service: alert immediately. Internal: alert on every occurrence.

06Business Metrics: Beyond Technical Health

Technical metrics tell you if the platform is running. Business metrics tell you if the platform is working.

trackBusinessMetric records domain-specific numbers: daily active users, messages sent per hour, campaigns delivered, projects created per day, proposals submitted, milestones completed, escrow funds released. These metrics have nothing to do with Lambda duration or DynamoDB read units — they measure whether users are actually using the platform.

A platform can be technically healthy (all endpoints responding, all databases available) and business-unhealthy (no one is signing up, no one is creating projects, no one is sending messages). Business metrics catch this.

The metrics are emitted with dimensions for segmentation — by endpoint type, by user tier (free, pro, premium), by region. This lets dashboards show business health per segment. If free-tier signups drop but premium signups are stable, the problem is in the onboarding flow, not the platform.

publishMetrics flushes all accumulated metrics to CloudWatch at the end of each Lambda invocation. This batches the CloudWatch API calls for efficiency — instead of one API call per metric, all metrics from a single invocation are published in one batch.

07How It All Connects

The monitoring architecture forms a pipeline: classify the endpoint, track the metrics, alert on anomalies, respond to incidents.

Every Lambda handler starts by classifying its endpoint type. The withErrorHandling wrapper records the start time. Business logic runs. On success, trackSuccess and trackResponseTime are called. On failure, trackError is called with the appropriate category. On auth failure, trackAuthFailure is called. On rate limit, trackRateLimitViolation is called. At the end, publishMetrics flushes everything to CloudWatch.

CloudWatch dashboards show the metrics by endpoint type, by error category, by service. Alarms trigger on thresholds — different thresholds for different categories, different endpoint types. PagerDuty or SNS notifications reach the on-call team.

The result: the team knows within minutes when something is wrong, what kind of wrong it is, and which endpoints are affected. A payment processing degradation is detected and alerted differently from a feed loading slowdown. A brute-force attack is detected and alerted differently from a spike in validation errors. The monitoring system knows the difference because every metric carries its classification.

🔄
Classify → Track → Alert → Respond. Every metric carries its endpoint type and error category. The monitoring system knows the difference between a payment failure and a feed slowdown.

SLA monitoring is not about collecting more metrics. It is about collecting the right metrics with the right context. An error count without an error category is noise. A response time without an endpoint classification is misleading. A business metric without segmentation is incomplete. The GlobalMonitoring service and EndpointClassifier give every metric the context it needs — so the team can focus on what matters, ignore what does not, and respond to problems before users report them.

Editor's Note: This is Framework Series #16 in the TCTF Newsletter. Next in the series: Account Lifecycle as a State Machine — from signup to deletion.

Never miss an edition

Subscribe to get TCTF newsletters delivered to your inbox.

PreviousQ4 2026 Roadmap: Launch, Mobile App, and Platform Polish

NextAccount Lifecycle as a State Machine: From Signup to Deletion

More From TCTF Newsletter

Vol. 1, Issue 4

Built to Last: Why Sustained Collaboration Is the Future of Tech Teams

Most platforms optimize for transactions — post a job, hire, move on. TCTF is built around sustained collaboration: long-term teams, milestone-driven projects, language support that breaks barriers, and a community where everyone — not just developers — has a seat at the table.

April 15, 2026

Q2 2026

Q2 2026 Roadmap: What's Next for the TCTF Portal

Our quarterly roadmap for Q2 — what shipped in April, the origin of Cometbid Social, and the plan for May and June as we build toward user accounts, authentication, and the social network launch.

April 1, 2026

Tech Series #3

How We Built a Real-Time Messaging System with AWS Lambda and WebSockets

Inside the architecture of TCTF's messaging platform — three services handling real-time chat, campaign delivery, and transactional notifications, all built on Lambda, API Gateway WebSockets, SQS, and multi-provider email with automatic failover.

March 15, 2026

Browse by Month

2026

June

May

April

March

February

January

Account

SLA Monitoring and Endpoint Classification: Knowing What Matters Most

01Why Not All Endpoints Are Equal

02The Endpoint Classifier: Multi-Strategy Classification

03GlobalMonitoring: The Metrics Hub

04The Timeout Wrapper: Application-Level SLA Guarantees

05Error Categories and Alerting

06Business Metrics: Beyond Technical Health

07How It All Connects

More From TCTF Newsletter

Built to Last: Why Sustained Collaboration Is the Future of Tech Teams

Q2 2026 Roadmap: What's Next for the TCTF Portal

How We Built a Real-Time Messaging System with AWS Lambda and WebSockets

Browse by Month

2026

The Cometbid
Technology Foundation

Our Community

Learn

Legal

More

Subscribe to our Newsletter

SLA Monitoring and Endpoint Classification: Knowing What Matters Most

01Why Not All Endpoints Are Equal

02The Endpoint Classifier: Multi-Strategy Classification

03GlobalMonitoring: The Metrics Hub

04The Timeout Wrapper: Application-Level SLA Guarantees

05Error Categories and Alerting

06Business Metrics: Beyond Technical Health

07How It All Connects

More From TCTF Newsletter

Built to Last: Why Sustained Collaboration Is the Future of Tech Teams

Q2 2026 Roadmap: What's Next for the TCTF Portal

How We Built a Real-Time Messaging System with AWS Lambda and WebSockets

Browse by Month

2026

Account

SLA Monitoring and Endpoint Classification: Knowing What Matters Most

01Why Not All Endpoints Are Equal

02The Endpoint Classifier: Multi-Strategy Classification

03GlobalMonitoring: The Metrics Hub

04The Timeout Wrapper: Application-Level SLA Guarantees

05Error Categories and Alerting

06Business Metrics: Beyond Technical Health

07How It All Connects

More From TCTF Newsletter

Built to Last: Why Sustained Collaboration Is the Future of Tech Teams

Q2 2026 Roadmap: What's Next for the TCTF Portal

How We Built a Real-Time Messaging System with AWS Lambda and WebSockets

Browse by Month

2026

The Cometbid Technology Foundation

Follow Us

Our Community

Learn

Legal

More

Subscribe to our Newsletter

SLA Monitoring and Endpoint Classification: Knowing What Matters Most

01Why Not All Endpoints Are Equal

02The Endpoint Classifier: Multi-Strategy Classification

03GlobalMonitoring: The Metrics Hub

04The Timeout Wrapper: Application-Level SLA Guarantees

05Error Categories and Alerting

06Business Metrics: Beyond Technical Health

07How It All Connects

More From TCTF Newsletter

Built to Last: Why Sustained Collaboration Is the Future of Tech Teams

Q2 2026 Roadmap: What's Next for the TCTF Portal

How We Built a Real-Time Messaging System with AWS Lambda and WebSockets

Browse by Month

2026

The Cometbid
Technology Foundation