Performance and Capacity by CMG 2014

Earlier this year I met Alex Podelko and contributed with a few comments for his blog. A few months later, came the invite to speak at CMG’s Performance and Capacity conference (CMG Performance and Capacity 2014), about our take on Performance Engineering and Testing here at Netflix. Having in mind that one of our main goals here is to “move fast”, and that sometimes performance engineers might struggle with a constantly changing environment like that, I decided to focus my talk on “How to Ensure Performance in a Fast-Paced Environment”. Here’s the full abstract:

Netflix accounts for more than a third of all traffic heading into American homes at peak hours. Making sure users are getting the best possible experience at all times is no simple feat and performance is at the core of this experience. In order to ensure performance and maintain development agility in a highly decentralized environment/(organization?), Netflix employs a multitude of strategies, such as production canary analysis, fully automated performance tests, simple zero-downtime deployments and rollbacks, auto-scaling clusters and a fault-tolerant stateless service architecture. We will present a set of use cases that demonstrate how and why different groups employ different strategies to achieve a common goal, great performance and stability, and detail how these strategies are incorporated into development, test and DevOps with minimal overhead.

Since today most of my effort is around developing new performance-focused tools and techniques in order to be more productive, evangelize performance engineering and scale our efforts, it made sense to focus the presentation on new things we are developing. It took me a while (and many revisions) to get the presentation the way I wanted. As usual, I changed half the content the night before the event.


The overall feedback was really good. Better than expected actually. I decided to go over a few things we do that are big no-nos in many large (and old) companies and sometimes this is not well received. Attendees were really interested in the tools and how we leverage all of them to achieve great performance, specially Canary Analysis, the performance test framework, automated analysis, the Monkeys and Scryer. Lots of great comments about the presentation itself, that was more “lively” than other presentations, and also the content itself. They liked the fact that we do things differently from other organizations, think outside the box and develop thing on our own.

I was also scheduled to participate in 3 panels. The first one was about new workloads, “Measuring New Workloads: Cloud Analytics, Mobile, Social”, and Elisabeth Stahl was hosting the session with Steve Weisfeldt from Neotys and me participating. The panel was really interesting and we had a lot of questions around AWS and how we run all* our streaming infrastructure there. There were also many questions on big data and how we leverage it to analyze user data and understand their behavior. Also, lots of questions around client devices and how we do real user monitoring (RUM) on them.

The second panel was “Modern Industry Trends and Performance Assurance”, hosted by Alex Podelko and with Mohit Verma (Tufts Health Plan), Steve Weisfeldt (Neotys), Ellen Friedman (MUFG Union Bank) and me as panelists. We had a great discussion around performance testing. When, Why and How to test systems. Automating performance tests and automated analysis. What could be automated or not? A/B testing. The value of testing in production and leveraging real user load. Again, lots of questions around our take on performance testing, the tools and techniques, specially the test framework. Some questions around the size of our tests and environment. We are pushing the boundaries on performance testing and engineering, and learning along the way. It was clear that we are trying things that other organizations would not even consider, and that put us a great place for innovation. One interesting question we got was around automated analysis, what should be automated and or not. My first response was obviously, Automate All The Things! But for multiple reasons, that’s not really effective. I came with a nice way of finding good candidates. If your test goal is to VALIDATE something, a pass or fail, that’s a great candidate for automation. If your test goal is to LEARN something about a system, that’s not a great candidate. What do you think?

Automate all the things!

The last panel was around APM, “APM Tools and Technologies: What Do You Need?”, also hosted by Alex and with David Halbig (First Data), Craig Hyde (Rigor), Charles Johnson (Metron) and me as panelists. It was focused mostly around how to analyze, choose and buy APM tools, what they should include or not, and so on. I have to admit that I didn’t have a lot to add to the tool buying discussion, but I tried to point out how we tried a few different tools and none worked really well for us, for one reason or another, so we just decided to fill the gaps and build our own set of tools that would achieve the same goal, transactional and deep stack performance monitoring. I don’t like the idea of spending a lot of effort trying to make a tool work for us when we can create something on our own and make it adapt to us. We already have great monitoring tools in place, like Atlas, and we are creating others to give us more insight into user transactions and demand. Creating our own tools gave us the flexibility we needed to collect only what we need, from the right sources, and easily act on it, manually or in an automated fashion. It also allows us to consume the data the way we see fit and that makes sense for us. Obviously, such endeavor doesn’t make sense for everyone. You need the scale to support it.

I’ve also attended a few interesting sessions. Alex’s talk on load testing tools shed some light on the various aspects that should be taken into account when choosing a tool. Open source vs. commercial? Availability of experienced professionals? Protocols? Environment? Features? Kudos for mentioning many great open source tools. Another interesting session was Peter Johnson’s (Unisys) workshop-like CMG-T on Java. It was geared towards beginners, but great content on Java tuning, specially Garbage Collection.

Besides all presentations and panels, I met so many amazing people there, and had great conversations. I can’t mention all here, but I wanted to at least give a shout-out to Kevin Mobley, from CMG’s board. I think we share the same view around performance engineering and had a great chat about his vision for the future of CMG as a group and the conference. I’m happy to collaborate more in the future!

Were you there? What were your thoughts on the presentation and panels? Any interesting questions you would like to bring up for discussion? Just send comment!

p.s.: You can find references to the tools and articles in the slide deck. There are also a few backup slides with a few things I could not fit into the presentations.

Scaling Performance Tests in the Cloud with JMeter

I have very heterogeneous performance test use cases. From simple performance regression tests that are executed from a Jenkins node to eventual large-ish stress tests that run with over 100K requests per second and > 100 load generators. With higher loads, many problems arise, like feeding data to load generators, retrieving results, real-time view, analyzing huge data sets and so on.

JMeter is a great tool, but it has its own limitations. In order to scale, I had to work around a few of it’s limitations and created a test framework to help me execute tests at scale, on Amazon’s EC2.

Having a central data feeder was a problem. Using JMeter’s master node is impossible. A single shared data source might become a bottleneck, so having a way of distributing it was important. I thought about using a feeder model similar to Twitter’s Iago or a clustered, load balanced resource, but settled for something simpler. Since most tests only use a limited data set and loop around it, I just decided to bzip files and upload them to each load generator before the test starts. This way I avoided the problem of making an extra request to get data during execution and requesting the same data multiple times because of the loop. One problem with this approach is that I don’t have centralized control over the data set, since each load generator is using the same input. I mitigate that by managing the data locally on each load generator, with a hash function or introducing random values. I also considered distributing different files to different load generators based on a hash function, but so far, there was no need.

Retrieving results was tricky too. Again, using JMeter’s master node was impossible because of the amount of traffic. I tried having a pooler fetching raw ( only timestamp, label, success and response time ) results in real-time, but that affected the results. Downloading all results at the end of the test worked by checking the status of the test ( running or not ) every minute and downloading after completion, but I settled with having a custom sampler in a tearDown thread group, compressing and uploading results to Amazon’s S3. This could definitely be a plugin too. It works reasonably well, but I loose the real-time view and have to manually add a file writer and sampler to tests.

With real-time view, I started with the same approach as jmeter-ec2, pooling aggregated data (avg response time, rps, etc) from each load generator and printing that, but it proved useless with a large number of load generators. For now, on Java samplers, I’m using Netflix’s servo to publish metrics in real time (averaged over a minute) to our monitoring system. I’m considering writing a listener plugin that could use the same approach to publish data from any sampler. Form the monitoring system I can then analyze and plot real-time data with minor delays. Another option I’m considering is using the same approach, but using StatsD and Graphite.

Analyzing huge result sets was the biggest challenge I believe. For that, I’ve developed a web-based analysis tool. It doesn’t store raw results, but mostly time-based aggregated statistical data from both JMeter and monitoring systems, allowing some data manipulation for analysis and automatic comparison of result sets. Aggregating and analyzing tests with over 1B samples is a problem, even after constant tuning. Loading all data points to memory to calculate percentiles and sorting is practically impossible, just for the fact that the amount of memory I’ll need is impractical, even with small objects. For now, on large tests, I settled on aggregating data while loading results ( second/minute data points ) and accepting the statistical problems, like average of averages. Another option would be to analyze results from each load generator independently and aggregate at the end. In the future, I’m considering having results on a Hadoop cluster and using Map/Reduce the get the aggregated statistical data back.

The framework also helps me automate most of the test process, like creating a new load generator cluster on EC2, copying test artifacts to load generators, executing and monitoring the test while it’s running, collecting results and logs, triggering analysis, tearing down the cluster and cleaning up after the test completes.

Most of this was written in Java or Groovy and I hope to open-source the analysis tool in the future.

Running on PageSpeed again

Just finished setting it up on Nginx.


amber:~ mojo$ curl -I 'http://overloaded.io/' | grep X-Page-Speed
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
X-Page-Speed: 1.5.27.3-3005

Since I moved from Apache to Nginx, PageSpeed was disabled, or better, not even installed.

Should be writing something about PageSpeed and the install process soon!