The Open Proxy Saga

A couple weeks ago I was messing with a few Apache configs, trying a few things that could improve the server performance. Everything was fine until late last week when I noticed that the page was really slow. Initially I thought it was a connectivity issue but after a couple hours I decided to troubleshoot it. First thing to do is check the logs for any possible explanation. Found two interesting messages:

[error] server reached MaxClients setting, consider raising the MaxClients setting

That is interesting specially because I tweaked the MaxClients setting not too long ago and the traffic has not increased significantly since then. The second interesting information was the number of GETs to external domains. That can’t be right. Why users would be requesting pages from other domains?

First thing a thought was ‘Damn, I’m serving as an open proxy!’, and I was right! Went to check the Apache configs and found:

ProxyRequests On

ProxyRequests was set to On, meaning that Apache was serving as an open proxy.

Second thing I went to check was the server statistics. Interestingly 3 days ago the memory usage increased significantly and and also the bandwidth utilization. More memory was coming from more Apache processes, showing exactly when it started. But how I started to get so many requests so quickly? Goggled it. My IP was listed on several open proxy lists, containing even the status, latency and even the last check time. That is awesome! Probably they have bots port scanning all around. One of these bots found my IP and published it somewhere and this list was replicated and replicated from here to Japan!

Obviously I don’t want to be serving as and open proxy for several reasons, so I went and changed the ProxyRequests back to Off. Right after I changed it, I saw the logs growing enormously. That’s when I noticed the extent of the problem. I was serving hundreds of concurrent users, a pretty good burn test for the server. And guess what, after days like that, it was still rock solid!

Now the second part of the saga. After turning ProxyRequests back to Off, besides the huge increase on logging (error only), CPU spiked to a load average of 22 on a 4 proc server. That’s a lot for those not familiar with Linux. An increase on logging is expected, since we’re having far more errors now, were users requests for other domains are failing. An increase on CPU usage was also expected, since the number of requests to my main page increased significantly (failed proxy requests are redirected to the default Apache site), but not as much as 22.

Checking the logs again I’ve noticed a huge number of errors stating that the URL was too long. All of these ‘long’ URLs had the same format, an external domain, followed by ‘http’ in a loop, like ‘http://www.google.com/httphttphttphttphttphttphttphttphttphttp…’. That was strange, why would someone requests a website like this. Then I decided to try using my server as a proxy. The same thing happened, tried google.com and I was being automatically got to a redirect loop that and appending ‘http’ to my requests until reaching a limit of 20 or so redirects. This means that every proxy request by users was generating over 20 requests on my server. Next step is to check why that was happening. Time to ‘telnet’ my server on port 80:

GET http://www.google.com/ HTTP/1.1
Host:www.google.com

That returned and HTTP 301 (Moved Permanently) response, moving to the same domain, but appending ‘http’ to the address. Good, same behavior we had on the browser. Now why is this happening. Looking a little bit further into this, I’ve found that when Apache gets a request for a domain that not in your virtual host list, it responds with the default virtual host, or the first virtual host loaded if you have not defined that explicitly. My default virtual host is my main website, a WordPress based site. Analyzing this further, I’ve found that when WordPress receives a request for an unknown page, it redirects to a standard page, instead of returning an HTTP 404 (Not Found) error. That is called canonical URL redirection and is used for a number of reasons, from enabling alternative URLs to ‘fancy’ permalinks. That explains the loop. Apache opens the default website with WordPress, which redirects automatically to the same non existent domain, just appending the requested page to the address. Since the ‘:’ char is not valid in an address, WordPress stops there. Since the user still has my server set as proxy, the process starts again, but that time with an extra ‘http’, and so on, on a infinite loop. So how do we disable that??

Found a simple how-to at velvetblues.com. You simply have to add the following line to your templates ‘functions.php’:

remove_filter('template_redirect','redirect_canonical');

Tried that and it worked! Now when I try to use my Apache server as proxy, all requests return a WordPress page mentioning that the page was not found. Problem solved!

Not so much, we still have the part three of the saga. I waited a few minutes and checked the server statistics again. CPU usage reduced significantly, to a load average of 5. Still a lot, but much better than 22 we had before. Server was responding quickly, but I’m still not satisfied. I don’t like the fact that a lot of leechers are consuming a lot of resources on my server. How can I improve that assuming that leechers will keep trying to access my server as a proxy for a while before figuring out it is working anymore. To solve this we have plenty of options, from simple ones to more complex ones like adding modules to Apache to ‘iptables’ block users that try to request domains that are not on the virtual hosts list. I don’t want to waste too much time on this since it’s not critical, so I opted for a very simple solution. Dynamic pages are very resource intensive compared to static pages. I don’t really care about serving a ‘nice’ page to users that trying to use my server as a proxy. So why not show these users a simple html page instead of my WordPress website? Well, Apache serves the default website to virtual hosts not matching any virtual host on the list, so I decided to simply change the default website to the default and well known Apache page ‘It Works!’. To do so, I just had to enable the default Apache site that was already there, just not enabled.

Guess what?? It worked. Requests to non-mapped domains were being served with a simple ‘It Works!’ page. Waited for a few minutes and checked the server statistics again, and wow, load average went to 0.1. Problem solved. Serving simple static pages reduced the CPU usage drastically. Now I just have to deal with the error log file.

That was easy, since now all ‘undesired’ users were being ‘redirected’ to the default Apache website, it was just a matter of changing the error log level. Just went and changed the following line on the default configuration:

LogLevel crit

This will only log critical errors, which are not the errors we’re having now, solving the log file issue. Ohh, just remember to comment the CustomLog line too, to avoid access logging, which is even worse.

Cheers, Martin

How MySpace Tested Their Live Site with 1 Million Concurrent Users

In December of 2009 MySpace launched a new wave of streaming music video offerings in New Zealand, building on the previous success of MySpace music.  These new features included the ability to watch music videos, search for artist’s videos, create lists of favorites, and more. The anticipated load increase from a feature like this on a popular site like MySpace is huge, and they wanted to test these features before making them live. If you manage the infrastructure that sits behind a high traffic application you don’t want any surprises.  You want to understand your breaking points, define your capacity thresholds, and know how to react when those thresholds are exceeded.  Testing the production infrastructure with actual anticipated load levels is the only way to understand how things will behave when peak traffic arrives. For MySpace, the goal was to test an additional 1 million concurrent users on their live site stressing the new video features.  The key word here is ‘concurrent’.  Not over the course of an hour or day… 1 million users concurrently active on the site. It should be noted that 1 million virtual users are only a portion of what MySpace typically has on the site during its peaks.  They wanted to supplement the live traffic with test traffic to get an idea of the overall performance impact of the new launch on the entire infrastructure.  This requires a massive amount of load generation capability, which is where cloud computing comes into play. To do this testing, MySpace worked with SOASTA to use the cloud as a load generation platform. Here are the details of the load that was generated during testing.  All numbers relate to the test traffic from virtual users and do not include the metrics for live users:

  • 1 million concurrent virtual users
  • Test cases split between searching for and watching music videos, rating videos, adding videos to favorites, and viewing artist’s channel pages
  • Transfer rate of 16 gigabits per second
  • 6 terabytes of data transferred per hour
  • Over 77,000 hits per second, not including live traffic
  • 800 Amazon EC2 large instances used to generate load (3200 cloud computing cores)

Test Environment Architecture SOASTA CloudTest™  manages calling out to cloud providers, in this case Amazon, and provisioning the servers for testing.  The process for grabbing 800 EC2 instances took less than 20 minutes.  Calls we made to the Amazon EC2 API and requests servers in chunks of 25.  In this case, the team was requesting EC2 Large instances with the following specs to act as load generators and results collectors:

  • 7.5 GB memory
  • 4 EC2 Compute Units (2 virtual CPU cores with 2 EC2 Compute Units each)
  • 850 GB instance storage (2×420 GB plus 10 GB root partition)
  • 64-bit platform
  • Fedora Core 8
  • In addition, there were 2 EC2 Extra-Large instances to act as the test controller instance and the results database with the following specs:
  • 15 GB memory
  • 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
  • 1,690 GB instance storage (4×420 GB plus 10 GB root partition)
  • 64-bit platform
  • Fedora Core 8
  • PostgreSQL Database

Once it has all of the servers that it needs for testing it begins doing health checks on them to ensure that they are responding and stable.  As it finds dead servers it discards them and requests additional servers to fill in the gaps.  Provisioning the infrastructure was relatively easy.  The diagram (figure 1.) below shows how the test cloud on EC2 was set up to push massive amounts of load into MySpace’s datacenters. While the test is running, batches of load generators report their performance test metrics back to a single analytics service.  Each of the analytics services connect to the PostgreSQL database to store the performance data in an aggregated repository.  This is part of the way that tests of this magnitude can scale to generate and store so much data – by limiting access to the database to only the metrics aggregators and scaling out horizontally. Challenges Because scale tends to break everything, there were a number of challenges encountered throughout the testing exercise. The test was limited to using 800 EC2 instances SOASTA is one of the largest consumers of cloud computing resources, routinely using hundreds of servers at a time across multiple cloud providers to conduct these massive load tests.  At the time of testing, the team was requesting the maximum number of EC2 instances that it could provision.  The limitation in available hardware meant that each server needed to simulate a relatively large number of users.  Each load generator was simulating between 1,300 and 1,500 users.  This level of load was about 3x what a typical CloudTest™ load generator would drive, and it put new levels of stress on the product that took some creative work by the engineering teams to solve.  Some of the tactics used to alleviate the strain on the load generators included:

  • Staggering every virtual user’s requests so that the hits per load generator were not all firing at once
  • Paring down the data being collected to only include what was necessary for performance analysis

A large portion of MySpace assets are served from Akamai, and the testing repeatedly maxed out the service capability of parts of the Akamai infrastructure

CDN’s typically serve content to site visitors based on their geographic location from a point of presence closest to them.  If you generate all of the test traffic from, say, Amazon’s East coast availability zone, then you are likely going to be hitting only one Akamai point of presence. Under load, the test was generating a significant amount of data transfer and connection traffic towards a handful of Akamai datacenters.  This equated to more load on those datacenters than what would probably be generated during typical peaks, but that would not necessarily be unrealistic given that this feature launch was happening for New Zealand traffic only.  This stress resulted in new connections being broken or refused by Akamai at certain load levels, and generating lots of errors in the test. This is a common hurdle that needs to be overcome when generating load against production sites.  Large-scale production tests need to be designed to take this into account and accurately stress entire production ecosystems.  This means generating load from multiple geographic locations so that the traffic is spread out over multiple datacenters. Ultimately, understanding the capacity of geographic POPs was a valuable takeaway from the test.

Because of the impact of the additional load, MySpace had to reposition some of their servers on-the-fly to support the features being tested

During testing the additional virtual user traffic was stressing some of the MySpace infrastructure pretty heavily.  MySpace’s operations team was able to grab underutilized servers from other functional clusters and use them to add capacity to the video site cluster in a matter of minutes. Probably the most amazing thing about this is that MySpace was able to actually do it.  They were able to monitor capacity in real time across the whole infrastructure and elastically shrink and expand where needed.  People talk about elastic scalability all of the time and it’s a beautiful thing to see in practice. Lessons Learned

  1. For high traffic websites, testing in production is the only way to get an accurate picture of capacity and performance.  For large application infrastructures there are far too many ‘invisible walls’ that can show up if you only test in a lab and then try to extrapolate.
  2. Elastic scalability is becoming an increasingly important part of application architectures.  Applications should be built so that critical business processes can be independently monitored and scaled.  Being able to add capacity relatively quickly is going to be a key architecture theme in the coming year and the big players have known this for a long time.  Facebook, Ebay, Intuit, and many other big web names have evangelized this design principle.  Keeping things loosely coupled has a whole slew of benefits that have been advertised before, but capacity and performance are quickly moving to the front of that list.
  3. Real-time monitoring is critical.  In order to react to capacity or performance problems, you need real-time monitoring in place.  This monitoring should tie in to your key business processes and functional areas, and needs to be as real time as possible.

Via highscalability.com