Performance and Capacity by CMG 2014

Earlier this year I met Alex Podelko and contributed with a few comments for his blog. A few months later, came the invite to speak at CMG’s Performance and Capacity conference (CMG Performance and Capacity 2014), about our take on Performance Engineering and Testing here at Netflix. Having in mind that one of our main goals here is to “move fast”, and that sometimes performance engineers might struggle with a constantly changing environment like that, I decided to focus my talk on “How to Ensure Performance in a Fast-Paced Environment”. Here’s the full abstract:

Netflix accounts for more than a third of all traffic heading into American homes at peak hours. Making sure users are getting the best possible experience at all times is no simple feat and performance is at the core of this experience. In order to ensure performance and maintain development agility in a highly decentralized environment/(organization?), Netflix employs a multitude of strategies, such as production canary analysis, fully automated performance tests, simple zero-downtime deployments and rollbacks, auto-scaling clusters and a fault-tolerant stateless service architecture. We will present a set of use cases that demonstrate how and why different groups employ different strategies to achieve a common goal, great performance and stability, and detail how these strategies are incorporated into development, test and DevOps with minimal overhead.

Since today most of my effort is around developing new performance-focused tools and techniques in order to be more productive, evangelize performance engineering and scale our efforts, it made sense to focus the presentation on new things we are developing. It took me a while (and many revisions) to get the presentation the way I wanted. As usual, I changed half the content the night before the event.

[slideshare id=41167973&doc=cmg2014publish-141105115540-conversion-gate01]

The overall feedback was really good. Better than expected actually. I decided to go over a few things we do that are big no-nos in many large (and old) companies and sometimes this is not well received. Attendees were really interested in the tools and how we leverage all of them to achieve great performance, specially Canary Analysis, the performance test framework, automated analysis, the Monkeys and Scryer. Lots of great comments about the presentation itself, that was more “lively” than other presentations, and also the content itself. They liked the fact that we do things differently from other organizations, think outside the box and develop thing on our own.

I was also scheduled to participate in 3 panels. The first one was about new workloads, “Measuring New Workloads: Cloud Analytics, Mobile, Social”, and Elisabeth Stahl was hosting the session with Steve Weisfeldt from Neotys and me participating. The panel was really interesting and we had a lot of questions around AWS and how we run all* our streaming infrastructure there. There were also many questions on big data and how we leverage it to analyze user data and understand their behavior. Also, lots of questions around client devices and how we do real user monitoring (RUM) on them.

The second panel was “Modern Industry Trends and Performance Assurance”, hosted by Alex Podelko and with Mohit Verma (Tufts Health Plan), Steve Weisfeldt (Neotys), Ellen Friedman (MUFG Union Bank) and me as panelists. We had a great discussion around performance testing. When, Why and How to test systems. Automating performance tests and automated analysis. What could be automated or not? A/B testing. The value of testing in production and leveraging real user load. Again, lots of questions around our take on performance testing, the tools and techniques, specially the test framework. Some questions around the size of our tests and environment. We are pushing the boundaries on performance testing and engineering, and learning along the way. It was clear that we are trying things that other organizations would not even consider, and that put us a great place for innovation. One interesting question we got was around automated analysis, what should be automated and or not. My first response was obviously, Automate All The Things! But for multiple reasons, that’s not really effective. I came with a nice way of finding good candidates. If your test goal is to VALIDATE something, a pass or fail, that’s a great candidate for automation. If your test goal is to LEARN something about a system, that’s not a great candidate. What do you think?

Automate all the things!

The last panel was around APM, “APM Tools and Technologies: What Do You Need?”, also hosted by Alex and with David Halbig (First Data), Craig Hyde (Rigor), Charles Johnson (Metron) and me as panelists. It was focused mostly around how to analyze, choose and buy APM tools, what they should include or not, and so on. I have to admit that I didn’t have a lot to add to the tool buying discussion, but I tried to point out how we tried a few different tools and none worked really well for us, for one reason or another, so we just decided to fill the gaps and build our own set of tools that would achieve the same goal, transactional and deep stack performance monitoring. I don’t like the idea of spending a lot of effort trying to make a tool work for us when we can create something on our own and make it adapt to us. We already have great monitoring tools in place, like Atlas, and we are creating others to give us more insight into user transactions and demand. Creating our own tools gave us the flexibility we needed to collect only what we need, from the right sources, and easily act on it, manually or in an automated fashion. It also allows us to consume the data the way we see fit and that makes sense for us. Obviously, such endeavor doesn’t make sense for everyone. You need the scale to support it.

I’ve also attended a few interesting sessions. Alex’s talk on load testing tools shed some light on the various aspects that should be taken into account when choosing a tool. Open source vs. commercial? Availability of experienced professionals? Protocols? Environment? Features? Kudos for mentioning many great open source tools. Another interesting session was Peter Johnson’s (Unisys) workshop-like CMG-T on Java. It was geared towards beginners, but great content on Java tuning, specially Garbage Collection.

Besides all presentations and panels, I met so many amazing people there, and had great conversations. I can’t mention all here, but I wanted to at least give a shout-out to Kevin Mobley, from CMG’s board. I think we share the same view around performance engineering and had a great chat about his vision for the future of CMG as a group and the conference. I’m happy to collaborate more in the future!

Were you there? What were your thoughts on the presentation and panels? Any interesting questions you would like to bring up for discussion? Just send comment!

p.s.: You can find references to the tools and articles in the slide deck. There are also a few backup slides with a few things I could not fit into the presentations.