Can cloud native go 20,000 transactions per second?
The short answer is: Yes!
This article provides several techniques and procedures for understanding bottlenecks and scaling your architecture to reach your wished application SLA, focusing on performance and latency optimization.
FYI: a Cloud-Native solution can easily process more than 100,000 events per second, but every use case might bring scalability challenges or opportunities.
Prerequisite
This article is intended for an audience of cloud practitioners dealing with mission-critical application.
We assume you are proficient in the following concepts/technologies:
- Build cloud native architecture in Azure
- API Management
- Azure Functions
- Cosmos DB
- JMeter
- A problem solving mindset [recommended: Apollo 13]
The problem
It was a typical day at Azure Space Center [this place might not exist], when I received an email from a customer having trouble managing more than 4,000 transactions per second (TPS) in a solution that was supposed to work with 20,000 TPS in both sunny and rainy days.
The most relevant observations on their architecture were:
- APIM and Function App using Premium SKU, but no Virtual Network integration or private endpoints
- Customer scaled up manually APIM, Function App and Cosmos DB to support a high volume of request
- Load testing Pipeline with JMeter, ACI and Terraform
Solution
The reality is that there is no one-size-fits-all solution, and overtime when requirements change, your perfect solution becomes obsolete. Managing a mission-critical architecture requires constant monitoring, improvements, and architectural trade-offs.
We have been working for weeks on these issues, investigating, improving and monitoring the entire architecture, component by component.
The challenges and expertise required to work with an extended team of product experts, network experts, developers, performance engineers, principal architects and supports teams from Microsoft.
First step: “I Was Blind But Now I See” John 9:25
The only information we were using to investigate our issue was the monitor feature of API.
It was helpful to understand that we had a problem, but not enough to understand the root cause. In this situation, you need full observability of your end-to-end architecture. Application Insights helps you here.
In this case, it is essential to enable diagnostic logs and Application Insights in all the services and make sure they are pointing to the same workspace.
Second step: Be precise and read logs
Being precise is vital in this exercise. It would help if you annotated when a performance testing starts, finished, how many nodes you are using, etc. I recommend you to use a spreadsheet to annotate all this information and avoid leveraging on memory.
After a test is performed, you will need to investigate component by component your performance and failures:
We strongly recommend you open a bug or a task in your favourite project management tool for each of the issues you discover while investigating logs and exceptions. There might be different root causes for those issues: misconfiguration, development issues, networking, etc.
An additional source of information during the investigation is logs: Traces and Sessions.
Third step: Performance testing tools
While running tests using JMeter and the ACI architecture, several issues were encountered:
- Understanding the correct number of virtual users for each node (it depends on the computational power of your node)
- Understand the maximum number of nodes that your controller can support. The maximum was just ten nodes in the initial load testing architecture: insufficient to generate enough load.
- Deploy the performance testing cluster as closely as possible to your application to minimize latency
- After several tests and defining average latency, you should estimate the total number of virtual users needed:
- (concurrent users) x (requests per user per seconds) = total requests per seconds
- requests per user per second depend by your overall processing time and latency
After encountering several issues with the above architecture, we decide to leverage a new PaaS service to run performance testing: Azure Load Testing.
The key advantages of using azure load testing are:
- PaaS solution: no need to provision or scale the load testing cluster
- Supporting JMeter
- Integrated monitoring features for real-time monitoring during a test
- Low latency
Fourth step: Ask for help
After days of running tests and checking logs, our crucial takeaway was that 40–50% of requests reaching APIM were not forwarded to Azure Function. At that point, I learned that not all the logs are available in application insights. Moreover, some internal logs (mainly networking related) of app service, APIM and other PaaS services are not shared with the customer to avoid data leakage and reverse engineering of IP. So, we triggered Microsoft Support.
We checked several diagnostic logs, and we learned that additional information on the platform can be gathered by the tab “Diagnose and solve problem” of each service:
We didn’t see any problem with the CPU or memory consumption of APIM, but they found a networking issue. The support team helped us understand that we were experiencing a SNAT Port exhaustion issue in APIM. What is it?
Ephemeral ports used for PAT are an exhaustible resource, as described in Standalone VM without a Public IP address and Load-balanced VM without a Public IP address. You can monitor your usage of ephemeral ports and compare with your current allocation to determine the risk of or to confirm SNAT exhaustion using this guide.
If you know that you’re initiating many outbound TCP or UDP connections to the same destination IP address and port, and you observe failing outbound connections or are advised by support that you’re exhausting SNAT ports (preallocated ephemeral ports used by PAT), you have several general mitigation options. Review these options and decide what is available and best for your scenario. It’s possible that one or more can help manage this scenario.
From Microsoft Documentation
Reza (our most senior architect in the team) recommended understanding how to solve the SNAT port exhaustion for each component. Unblocking APIM would have caused the Function App to experience the same issues after.
Two additional interesting reading were:
- Troubleshooting client response timeouts and errors with API Management
- Troubleshooting intermittent outbound connection errors in Azure App Service
Fifth step: A better networking architecture
The SNAT port exhaustion issue could be resolved by introducing a private endpoint between the services. Private endpoints bypass the internal load balance and are not subject to SNAT port exhaustion, at least at this scale.
The target architecture required us to re-deploy APIM and logic app inside Virtual Networks and expose Cosmos DB via private endpoint.
In the diagram below you can find some implementation annotation that might be helpful if you are planning to do a similar change:
Implementing this change solved the SNAT port issues.
Sixth step: CI/CD and private service
After changing the Function App to private, the customer started to experience failure in their CI/CD pipeline. The troubleshoot was easy: Azure DevOps can’t deploy to a private network with no inbound connectivity. The customer needed to deploy a self-hosted virtual machine inside their VNet.
After provisioning and configuring a new self-hosted agent, the last step required is to update the pipeline to leverage the self-hosted agent pool instead of the default one.
Seventh step: Look inside the code
After implementing the VNet integration and private endpoint, we hoped to go back home and focus on a new customer issue, but it was not the case. We could see from the new logs that the new bottleneck became the function app.
We finally had to understand the whole picture and look at the code.
First of all, let’s describe better the use case. The endpoint that we were testing is a user-facing endpoint. It should perform reads and write in the database and then it should return a value to the user in less than 100 milliseconds.
A single execution of the endpoint, whit no stress test ongoing, took from 10 to 15 milliseconds. The problem, of course, changes when you start to have thousands of concurrent execution.
We can’t share the code at this point, but we would like to highlight the areas that you should consider when scanning your code for performance issues:
- Profile your code;
- Introduce cache wherever it is possible (best scenario: cache at APIM level);
- Utilize connection pooling, reuse client connections for external services as much as possible (typically static HttpClient in Functions and singleton CosmosClient);
- Read carefully all the guidelines on any SDK you are using. In our case, using Cosmos DB, we found this helpful resource;
- Investigate every integration point with other services and their SLAs; if possible, remove outside dependency;
- Consider splitting a read and write endpoint in two different endpoints and introducing an asynchronous mechanism. Please consider the following patterns: CQRS pattern and Asynchronous Request-Reply Pattern
- Try to reduce as much as possible the processing time;
- If there is any retry mechanism, implement it less aggressively;
- Ensure that Cosmos DB has enough RUs;
- Implement error handling and logging for any exception: it will help you understand corner cases;
- Avoid any blocking call.
Writing the atomic function is a best practice that can help you avoid this long troubleshooting.
Improving the customer code is still ongoing as of today. Every week we improve something, and step by step, we reach our target. You might think: why are they so slow? Some customers might have other 1,000 endpoints with similar issues and limited time to optimize them.
Conclusion
Optimization of architecture is a never-ending process. This article covers some of the problem solving and methodology required to improve performance in a cloud-native architecture.
The most important lesson learnt that we would like to share are:
- Make sure your architecture is fully observable (Application insight is a must-have tool, always enable it in every service you deploy);
- Performance testing is a matter of process and precision;
- Troubleshooting an issue is not a job for a lonely wolf: when needed, ask for help!
- Your source code quality will always be the most tedious bottleneck to achieving your performance target, and increasing resources is not the answer…
About the author
Since 2012, Francesco managed seven mobile projects as a Mobile Lead or CTO. He covered several industries, from banking to lifestyle. Francesco led teams from one to seventy developers on four continents. He used technologies such as JavaScript, PhoneGap, Objective-C, Java, Kotlin, Swift, ReactNative and Flutter in his project. His most successful project has almost 8 million active users a month.
Today, Francesco is working as Cloud Solution Architect for one of the major cloud providers, providing advice and supporting the customer journeys on application innovation and developer velocity.
Disclaimer
This article reflects my personal opinions, and it should not be attributed to the view of any of my past or present employers.