Scaling Web Applications: My "3 Questions" Philosophy

Comments

As I see it, any reasonable plan for scaling web applications is going to address 3 questions:

  1. How can we accomplish more work?
  2. How can we give the appearance of accomplishing more work?
  3. How can we avoid doing work that doesn’t matter?

So, how do we address these questions? Well, first, we remember that to the end user, the perception that the application is functioning correctly and with reasonable speed is of topmost importance. Everything else lines up behind these two considerations. Since perception is reality, we quickly discover that we can cheat, and the challenge comes in determining where, when, and how to do so.

Accomplishing more work

  1. Vertical scaling - This is the simplest way to scale. We increase the specs of the servers doing the work. This also makes it an expensive way to scale, but that expense is offset by the hours of developer time we might save. Of course, at some point, a single server which is more powerful is either unavailable, or fiscally irresponsible to acquire.
  2. Horizontal scaling - A more cost-effective solution to accomplish more work is to distribute it among more servers. At a minimum, this will generally involve adding web servers on the front end, and it usually includes the application servers (which may or may not be an integrated part of the web front end). It may also include storage tiers including space for files (S3, perhaps) and database servers or key/value stores.
  3. Optimizations - We can optimize software. And by “optimize,” I mean the most honest kind of optimizations, which involve refactoring code to eliminate bottlenecks without changing the user-facing functionality in any way but to make it faster. This might include tuning database keys, database denormalization, or tweaking an algorithm in the application code or a configuration setting on our server. This also costs money, of course, but it’s measured in the hours of developer time spent.

Keeping up appearances

  1. Deferred execution - Not every process that results from a web request needs to be executed at the time that the request is made. For instance we might deploy a queueing system to handle sending new signup e-mails in the background. This allows for the web server to respond immediately to the signup request while queueing up the e-mail to be sent after the HTTP response is finished. This can be implemented as a fault-tolerance measure (when the SMTP server hiccups, we don’t want to make the user wait to finish the signup request submission), or it can also be applied in a worker tier. That, of course, would lend itself to horizontal scaling as well.
  2. Lazy loading - It’s possible to load the basic outline of a page, and then “fill in” any blanks left by longer-running tasks via AJAX. With JavaScript, you can even defer the loading until the part of the page containing this data is displayed. The Disqus comment system does this – you’ll notice that sites which use Disqus don’t load their comments until you scroll down to view them, thus saving load on the Disqus servers, and allowing the initial page load to complete more quickly.
  3. Caching - The question when it comes to caching is almost never whether one should cache, but what to cache, where to cache it, and for how long? A caching strategy should exist early, because the chosen strategy may dictate future network topology or server configuration requirements. The important thing to note is that a cache hit will be orders of magnitude faster than the operation required to populate the cache, and is just as relevant and acceptable to the user within certain tolerances. In the case of caches resulting from truly complicated database queries, no amount of hardware that we can throw at a problem will result in a response as fast as a cache hit.

Avoiding pointless work

  1. Caching (again) - How stale is too stale, and how many users would a potentially stale piece of data affect? For instance, let’s say that a page is being accessed by hundreds of users, thousands of times a minute. One part of that page has a list of recently updated items in the database under a header of “most recent items.” The vast majority of the users on the system don’t know precisely what time these items are being created, nor do they care, so long as there are some sufficiently recent items listed there. Now, let’s say the queries required to populate that list are computationally expensive. Does it make sense to force that query to be run every time the page is accessed, just so that the one user out of hundreds can see that his or her item is on the list? Maybe it’s sufficient to display a notice to the user when he creates his item, telling him that it has been created, and word the header above the list of items to say “some recent items,” thus setting expectations appropriately? The answer depends on the situation, but the question is worth considering.
  2. Eventual consistency - Have you ever looked at your Twitter stream, and seen a response to a tweet that comes earlier in the stream than the tweet it’s responding to? I have, but only rarely. This is because Twitter propagates tweets through its system with a goal of eventual consistency. This just means that while everyone may not have all of the data at any given instant, they all will have all of the data relevant to them in due time. This is “good enough” in Twitter’s case, and allows them to scale much more effectively than they would if they required absolute consistency – if everyone’s stream had to include a tweet before anyone could see it, and all tweets in a stream needed to be in perfect chronological order. That would take a lot more horsepower to accomplish in a timely manner, for negligible gain. This is relevant when it comes to considering any type of replication or sharding.

Questions, corrections, or or fundamental flaws in my general premise? Put ‘em in the comments. Thanks in advance!

comments powered by Disqus