Can cloud native go 20,000 transactions per second?

Azure Load Testing stressing a cloud native architecture (API Managment, Function App and Cosmos DB)

Prerequisite

Apollo 13: Problem solving workshop

The problem

Initial architecture
  • APIM and Function App using Premium SKU, but no Virtual Network integration or private endpoints
  • Customer scaled up manually APIM, Function App and Cosmos DB to support a high volume of request
  • Load testing Pipeline with JMeter, ACI and Terraform

Solution

A picture of our team at the Azure Space Center

First step: “I Was Blind But Now I See” John 9:25

APIM built-in monitoring
Add Azure Monitor and Application Insights

Second step: Be precise and read logs

1. App Insights Performance tab; 2. Select the exact time and date of your test
You can investigate specific instances. To enable it: 1. click on Roles; 2. Check/uncheck the instances that you would like to investigate
1. App Insights Failure tab; 2. Summary of your endpoint/functions triggered; 3. Overview of the most common errors: recommended to explore each of those issues.
When you investigate a specific issue, you can explore your requests, the transactions (1), the details about a request (2) and the call stack (3)
Quering “traces” provides all the console logs of your application

Third step: Performance testing tools

Initial load testing architecture
  • Understanding the correct number of virtual users for each node (it depends on the computational power of your node)
  • Understand the maximum number of nodes that your controller can support. The maximum was just ten nodes in the initial load testing architecture: insufficient to generate enough load.
  • Deploy the performance testing cluster as closely as possible to your application to minimize latency
  • After several tests and defining average latency, you should estimate the total number of virtual users needed:
  • (concurrent users) x (requests per user per seconds) = total requests per seconds
  • requests per user per second depend by your overall processing time and latency
Screenshot from Azure Load Testing (dummy data)
Screenshot from Azure Load Testing (dummy data): detail on monitored resources
  • PaaS solution: no need to provision or scale the load testing cluster
  • Supporting JMeter
  • Integrated monitoring features for real-time monitoring during a test
  • Low latency

Fourth step: Ask for help

APIM, Diagnose and solve problmes
SNAT Port exhaustion in the internal LB of APIM

Fifth step: A better networking architecture

SNAT exhaustion solution: private endpoint!
Networking architecture (Target)
Function App Networking configuration
Networking architecture (Target): implementation details. Point 2: Configure NSG

Sixth step: CI/CD and private service

1. Install agent and configure ADO 2. The communication between the self hosted agent and ADO requires outgoing traffic on port 443 (tcp) and 53 (udp). The best practice is to use a firewall to manage outgoing traffic.
A quick reference on how to update the pipeline in Azure DevOps

Seventh step: Look inside the code

Like Gary in Apollo 13, we had to get our hands dirty with coding
  1. Profile your code;
  2. Introduce cache wherever it is possible (best scenario: cache at APIM level);
  3. Utilize connection pooling, reuse client connections for external services as much as possible (typically static HttpClient in Functions and singleton CosmosClient);
  4. Read carefully all the guidelines on any SDK you are using. In our case, using Cosmos DB, we found this helpful resource;
  5. Investigate every integration point with other services and their SLAs; if possible, remove outside dependency;
  6. Consider splitting a read and write endpoint in two different endpoints and introducing an asynchronous mechanism. Please consider the following patterns: CQRS pattern and Asynchronous Request-Reply Pattern
  7. Try to reduce as much as possible the processing time;
  8. If there is any retry mechanism, implement it less aggressively;
  9. Ensure that Cosmos DB has enough RUs;
  10. Implement error handling and logging for any exception: it will help you understand corner cases;
  11. Avoid any blocking call.

Conclusion

  1. Make sure your architecture is fully observable (Application insight is a must-have tool, always enable it in every service you deploy);
  2. Performance testing is a matter of process and precision;
  3. Troubleshooting an issue is not a job for a lonely wolf: when needed, ask for help!
  4. Your source code quality will always be the most tedious bottleneck to achieving your performance target, and increasing resources is not the answer…
Be successful is a metter of team working, proper processes, problem solving and a bit of creativity ❤

About the author

--

--

--

Problem solver, CTO, customer-centric mindset. Hoping to raise awareness on the complexity of being a bridge between business and technology

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Amazon Polly — Text-to-Speech

A few Microservices and a Monolithic app walk into a bar…

Session and Cookies (Part-5)

How to write a very fast regex in C #. Best Practice

Media & Entertainment and Containers : A Perfect Match

MySQL Replication Primer

Service Mesh Communication Across Kubernetes Clusters

The Sherbet Weekly Scoop # 3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Francesco De Liva

Francesco De Liva

Problem solver, CTO, customer-centric mindset. Hoping to raise awareness on the complexity of being a bridge between business and technology

More from Medium

Journey of Zero Downtime Migration of Elasticsearch at Blinkit

Managing secrets, API keys and more with Serverless

Serverless Puppeteer on AWS

Growing Your Small Systems to a Large Distributed System in a Reliable Way: Lessons Learned Hard…