
Most microservices advice sounds great in conference talks and falls apart in production. At TCTF, we run 34 microservices on AWS Lambda — and the number is still growing. This article shares what we have learned so far about service boundaries, async communication, data isolation, failure handling, and observability. These are lessons from an active, evolving platform, not a polished retrospective.
Microservices are not a silver bullet — they are a tradeoff. You give up the simplicity of a single deployable unit and get back independent deployment, team autonomy, and the ability to scale and fail in isolation. But you also inherit a new class of problems: network failures between services, data consistency across boundaries, cascading outages, and operational complexity that grows with every service you add. At TCTF, we currently run 34 microservices on AWS Lambda — and the platform continues to grow. We started with a monolith, extracted services incrementally as the team and domain expanded, and learned along the way which patterns hold up under real traffic and which ones collapse under operational pressure. We have already made mistakes (splitting too aggressively, underinvesting in observability early on) and course-corrected. The architecture is not finished — it is evolving, and that is by design. This article shares the patterns that are working for us right now. Not theoretical patterns from architecture textbooks, but practical approaches to service decomposition, inter-service communication, data management, failure handling, and observability — shaped by real production experience on a platform that is actively growing. We also cover the most important pattern of all: knowing when microservices are the wrong choice.
The hardest part of microservices is deciding where to draw the boundaries. Too fine-grained and you have a distributed monolith with network overhead. Too coarse-grained and you have a monolith with extra steps.
Domain-Driven Design's bounded contexts provide the best heuristic: each service owns a business domain with clear boundaries. The user service owns authentication and profiles. The billing service owns payments and subscriptions. The social service owns posts, comments, and reactions.
At TCTF, we started with 8 services and now have 34 — with more on the horizon as the platform expands. The key rule: a service should be owned by one team and deployable independently. If two services always deploy together, they should be one service.
One early mistake we have already corrected: splitting too aggressively. We initially created separate services for user profiles, user preferences, and user settings. They were always deployed together, always queried together, and owned by the same team. We merged them back into a single user service — reducing network calls, simplifying the data model, and eliminating three deployment pipelines. The lesson: start coarse, split when you have a reason, not when you have an abstraction.
🏗️ If two services always deploy together, they should be one service. Start coarse, split when you have a reason. We merged 3 user-related services back into one — fewer network calls, simpler data model, three fewer pipelines.

Synchronous communication (HTTP/REST, gRPC) is simple but creates coupling. If Service A calls Service B synchronously, Service A's availability depends on Service B's availability. Chain three services together and your availability is the product of all three — 99.9% × 99.9% × 99.9% = 99.7%, which is the difference between 8.7 hours and 2.6 hours of downtime per year.
Asynchronous communication (SNS/SQS, EventBridge, Kafka) decouples services in time. Service A publishes an event and continues. Service B processes the event when it is ready. If Service B is down, the event waits in the queue.
At TCTF, we use synchronous calls for queries (read operations where the caller needs an immediate response) and asynchronous events for commands (write operations where the caller does not need to wait for completion). This pattern gives us the responsiveness of sync calls for reads and the resilience of async events for writes.
The rule of thumb: if the caller needs the result to continue, use sync. If the caller just needs to know the work will eventually happen, use async. When in doubt, default to async — it is easier to add a sync wrapper around an async operation than to make a sync operation resilient to downstream failures.
📡If each service is up 99.9% of the time, chaining 3 sync calls means your overall uptime drops to 99.7% — that is 6 extra hours of downtime per year. The more services in the chain, the worse it gets. Use sync only when the caller needs an immediate answer. For everything else, use async — the message waits in a queue, and no one is blocked.

The textbook rule says each service should own its data and no service should read another service's database directly. In theory, this prevents coupling. In practice, with DynamoDB single-table design and a small number of well-structured tables, the cost of enforcing strict API boundaries between every service is often higher than the coupling risk it prevents.
Here is the pragmatic reality at TCTF: making an API call from one service to another means a network round trip — Lambda invocation, API Gateway routing, cold start risk, retry logic, error handling. That is a lot of overhead when the data you need is sitting in a DynamoDB table that your function could read directly in single-digit milliseconds. With single-table design, we use fewer tables than we have services. The data model is designed around access patterns, not service boundaries. When a service needs data that lives in another service's table, we often grant read access directly rather than adding an API call that adds latency, complexity, and a failure point.
Yes, this is technically an anti-pattern. We know. But we also know our table structure, our access patterns, and our team. The coupling risk is manageable because the tables are few, the schema is well-documented, and the team that owns the data is the same team that reads it. The tradeoff is worth it: lower latency, fewer failure points, and simpler code.
When services do need to stay in sync — especially for write operations and cross-service workflows — we use EventBridge. A user signs up, the auth service publishes an event, and EventBridge routes it to the services that need to react: the notification service sends a welcome email, the onboarding service creates the workspace, the analytics service records the signup. Each consumer processes the event independently. If one fails, the others are unaffected. EventBridge gives us the decoupling we need for writes without the overhead of service-to-service API calls.
We rarely make synchronous API calls between services. The pattern is: read directly from DynamoDB when you need data, publish events through EventBridge when you need coordination. This is not the textbook approach, but it is the approach that works for a platform with 34 services, single-table design, and a team that values simplicity over architectural purity.
🗄️ Direct DynamoDB reads between services: technically an anti-pattern, practically the right call with single-table design and fewer tables than services. EventBridge handles cross-service coordination. We rarely make service-to-service API calls — the overhead is not worth it when the data is a millisecond away.
In a monolith, a function call either succeeds or throws an exception. In microservices, a call can succeed, fail, time out, return garbage, or succeed on the remote end but fail to deliver the response. Every inter-service call is a potential failure point, and the failure modes are creative.
The first line of defense is the Lambda timeout itself. Every Lambda function at TCTF has a configured timeout — regardless of what happens downstream, the function will terminate after the timeout period. This is a hard boundary that prevents any single request from hanging indefinitely and consuming resources. It does not matter if a downstream service is slow, a DynamoDB query is stuck, or an external API is unresponsive — the Lambda will stop, return an error, and free the execution slot. This is the universal safety net that every service gets by default.
Retry with exponential backoff is the second line. Transient failures — network blips, cold starts, temporary throttling — resolve themselves if you wait and try again. But naive retries can amplify failures: if a service is overloaded, retrying immediately adds more load. Exponential backoff with jitter spreads retries over time and prevents thundering herds.
Circuit breakers are the third line. When a downstream service is consistently failing, the circuit breaker trips — subsequent calls fail immediately without hitting the downstream service. This prevents cascading failures where one unhealthy service drags down everything that depends on it. After a cooldown period, the circuit breaker allows a test request through. If it succeeds, the circuit closes and normal traffic resumes.
But retries create a problem: if a POST or PUT request is retried, you risk creating duplicate records or applying the same operation twice. This is where our idempotency handler comes in. Every Lambda — not just GET handlers — is wrapped with an idempotency layer. When a request comes in, the handler checks if the same request (identified by a unique idempotency key) has been processed before. If it has, the cached response is returned immediately without re-executing the business logic. If it has not, the request is processed normally and the response is cached for subsequent calls. This means retries are safe for every HTTP method — POST, PUT, PATCH, DELETE — not just GET. A user who submits a payment form twice because of a slow network gets one charge, not two. A retried event that creates a project gets one project, not a duplicate. The idempotency handler turns what would be a dangerous retry into a free cache hit.
Dead-letter queues (DLQs) are the safety net for async processing. When an SQS message fails processing after the maximum retry count, it moves to the DLQ instead of being lost. We monitor DLQ depth with CloudWatch alarms — a growing DLQ means something is broken and needs attention. We also have automated DLQ reprocessing that retries messages after the underlying issue is fixed.
At TCTF, these patterns are not optional — they are built into our reusable CDK constructs. Every new service gets Lambda timeouts, retry policies, circuit breakers, idempotency handling, and DLQ monitoring automatically. The engineer building the service focuses on business logic. The infrastructure handles resilience.
🔌Every Lambda has a configured timeout — the universal safety net. Every request (GET, POST, PUT, DELETE) is wrapped with an idempotency handler that caches responses, making retries safe and fast. Circuit breakers prevent cascading failures. DLQs catch async failures. All built into CDK constructs — not optional, automatic.

You cannot operate what you cannot observe. In a monolith, a stack trace tells you what went wrong. In microservices, a request might touch 5 services, and the error might be in any of them.
Three pillars of observability: structured logging (JSON logs with correlation IDs that trace a request across services), distributed tracing (X-Ray or Jaeger traces that show the full request path), and metrics (latency, error rate, and throughput per service).
At TCTF, every service uses AWS Powertools for structured logging and X-Ray for distributed tracing. Every log entry includes a correlation ID that is passed between services. When something goes wrong, we search for the correlation ID and see the full request journey across all 34 services.
We also track four golden signals per service: latency (p50, p95, p99), error rate (4xx and 5xx separately), throughput (requests per second), and saturation (concurrent executions as a percentage of the limit). CloudWatch dashboards show these signals for all 34 services on a single screen. Alarms fire when any signal deviates from its baseline by more than 2 standard deviations.
The investment in observability pays for itself the first time you debug a production issue. Without correlation IDs and distributed tracing, finding the root cause of a failure across 34 services is like finding a needle in a haystack. With them, it takes minutes.
🔍Four golden signals per service: latency, error rate, throughput, saturation. Correlation IDs trace requests across 34 services. Without observability, debugging microservices is impossible. With it, root cause takes minutes.

The most important pattern in microservices is knowing when not to use them.
Microservices make sense when: you have multiple teams that need to deploy independently, your domains are well-understood and have clear boundaries, you need different technology choices for different parts of the system, and you have the operational maturity to manage distributed systems (observability, CI/CD, incident response).
Microservices do not make sense when: you have a small team (fewer than 10 engineers), your domain is not well-understood (you are still figuring out what to build), you do not have CI/CD and observability in place, or your application is simple enough that a monolith handles it well.
The honest truth: most applications should start as a monolith. A well-structured monolith with clear module boundaries is easier to develop, test, deploy, and debug than a poorly structured microservices architecture. You can always extract services later when you have a clear reason — team scaling, independent deployment needs, or technology divergence.
At TCTF, we started with a monolith. We extracted the first service (authentication) when the team grew and deployment conflicts became a daily problem. We are now at 34 services — each extraction driven by a specific need, not by architectural ambition — and the number will likely grow as the platform evolves.
The worst microservices architectures are the ones built on day one by a team of three. The best are the ones that evolve from a monolith as the team and domain grow.
🎯Most applications should start as a monolith. Extract services when you have a reason: team scaling, deployment conflicts, technology divergence. The worst microservices are built on day one by a team of three.
Microservices are a tradeoff, not an upgrade. They make sense when you need independent deployment, team autonomy, and technology flexibility. They do not make sense when you have a small team, a simple domain, or no operational capacity for distributed systems. The patterns that work — bounded contexts, sync for reads / async for writes, database per service, circuit breakers, and deep observability — are not complex individually. The complexity is in applying them consistently across 34 services (and counting), training every engineer to follow them, and building the tooling that makes them automatic. The platform is still growing, and so are the lessons. Choose the patterns that match your constraints, not the patterns that match the conference talks.
Never miss a post
Subscribe to get the latest TCTF articles delivered to your inbox.