Scaling Instrumental with Scala

Scaling Instrumental with Scala

Background

Instrumental, the product I've been working on at Fastest Forward for the past few years, is written in Ruby. In general, Ruby's been a great win for us: easy to write readable, testable app logic, lots of tools available to help with deploy and infrastructure automation, and, of course, great for developing ideas quickly.

Ruby does have some well known drawbacks, however: concurrency is not a major area of effort for the community (or the most widely used Ruby implementation, MRI), CPU intensive tasks typically are slower in comparison to other languages, and performance tooling support is pretty lackluster.

About a year ago, we were forced to make a migration from Linode to AWS due unpredictably variable IO performance for our primary database (along with a host of smaller issues). While AWS had some great tools to help us scale our infrastructure to meet our performance needs exactly (provisioned IOPs, SNS+SQS), we found that the cost to keep the same performance level was about 5x higher.

The Problem

The largest contributor to our high hosting cost was the money we were paying for high CPU performance boxes. During our migration, we made the decision to do a 1:1 machine transition, such that for the N boxes we had hosted on Linode, an equivalent N was hosted on AWS, with roughly the same CPU/RAM allocation.

It should be noted here how cost efficient Linode can be in comparison to AWS, if CPU performance alone is your major consideration. Separate testing of DigitalOcean on our part showed an even greater $/operation savings.

So, we were paying 5x the cost for the same level of performance. One of our largest cost contributors was our collectors: these were daemons written in Ruby and EventMachine to accept and queue incoming metrics data. These processes represented the most performance sensitive areas of the app, as performance hiccups at their level could cause dropped data from the customer. One Ruby collector would process around 150,000 updates per second on a c1.xlarge; an audit of the Ruby code didn't show very many areas we could make effective performance improvements to our code to make any drastic improvements. The process' behavior was simple enough that there weren't many architectural improvements we could make that would yield any singificant gains; we buffer data in memory, and then flush it to the filesystem at regular intervals to be queued by a separate process.

Months earlier, I had written a prototype implementation of the collector in C, both as a return to a language I enjoy programming in, and as an experiment in seeing how fast we could make the front end. Under the testing conditions, we saw a greater than 3x improvement in performance, and expected that there were more gains to be had should we commit to rewriting the collector. We chose not to because at the time, our need for a high performing collector was a theoretical curiosity; now it was a financial necessity.

Choosing Scala

Scala was not our first choice. The aforementioned C prototype seemed an obvious first pick, but simultaneous code rot and architectural changes made a full rewrite likely necessary, which removed the value of having an existing prototype. C++ or Java seemed promising in that there'd be a well tested standard library available to use for data structures we had got "for free" in Ruby (and would not in C), but prior experience led us to believe that development speed would suffer for choosing either.

JRuby initially seemed like it might be an obvious win, but we only saw 1.7x improvement over the Ruby collector, and some odd behaviors in gems that we relied on that made us believe we might spend more time than we'd like fixing compatibility issues. Both node.js and Go seemed like attractive candidates, but node.js was disqualified for having roughly the same performance level as compared to JRuby, and Go was disqualified for no good reason other than my being bored by the syntax.

During the course of this initial language testing, we created a prototype Scala version based on Twitter's Finagle framework; we had heard lots of good comments from other engineers about the usefulness of Finagle, and it seemed like a good time to try it out. Unfortunatley, our initial prototype server had a performance level 50% slower than the Ruby version. This, and the oddly poor quality of Scala database libraries and JSON libraries caused us to initially discount Scala as a choice.

I'll admit we returned to Scala on a whim, but the extremely poor performance we saw using Scala + Finagle made me question if we had misused Finagle, and that I should retry the experiment by removing one layer of abstraction: instead of using Finagle, I'd just write a Netty app in Scala.

I was pleasantly surprised by the 2.5x increase in performance over our Ruby collector; the removal of Finagle, and the migration to the (then beta) 4.x release of Netty was a great initial win.

Architectural Wins

While Scala was an awkward language to begin working with, after a week of concerted effort, it began to feel more natural. Twitter's very useful Effective Scala style guide was a great cheat sheet to avoiding pitfalls, and the relatively identical Java interop calling style made it easy to bring in Java libs without having to learn idiosyncratic syntax rules.

Netty was an especially great win, as its thread-to-eventloop-to-pipeline mapping allowed us to move to a single process per machine (_as opposed to the process-per-core x ?? calculation familiar to so many Ruby deployments_) that better used all available resources.

During development, it was also great to be able to rely on popular Java performance tools like YourKit to help prove the effectiveness of different techniques in Scala, as well as simply have the visibility into app performance that only good profilers can give.

Additionally, we were able to bypass a number of deployment automation tasks early on by simply creating a single jar using the sbt-onejar and adjusting our process automation tasks to simply start the server with java -jar .... While we likely won't end up sticking with this as our deployment solution, it was incredibly nice to make Scala's deployment requirements merely that we had a JRE on the server.

Finally, the performance wins with the Scala collector were such that we could actually expect a decent performance from a single m1.small. It's now feasible for us to field a much larger number of collector boxes across different availability zones, and better support box and zone failure. Thanks to switching to Scala and Netty, the current bill for all collector machines in our infrastructure is less than the cost of a single c1.xlarge.

Gotchas and Annoyances

As previously mentioned, the Scala database and JSON libraries we were able to find seemed quite lackluster. We initially found a few wrappers for JDBC that seemed like they might be promising, but in the end we found that it just made more sense to do the (_relatively simple_) JDBC interaction ourselves. Regarding JSON libraries, we found them to either be too slow or have an unenjoyably obtuse syntax.

Scala's feature set can be stymying, in that as a beginner to the language, I often found myself caught in indecision as to the right Scala feature to use to accomplish a task. I've met other programmers who compare Scala to C++'s intimidating surface area, and I can't help but find myself agreeing with them.

If I Had To Do It Again...

Were I to repeat this process, I'd be happy for choosing Scala again. While there are some warts, it still feels like the best fit for our needs. Thankfully we found ourselves in a place where this particular performance problem had several well defined acceptance criteria: better performance, better infrastructure cost savings, good developer productivity, and Scala fit all of those.

The one piece of this process I'd likely do better at next time is keeping a better record of performance benchmarks. Had we kept a list of different implementations and performance numbers, we could avoid having to redo much of the performance work required to validate this decision; unfortunately, much of that data ended up in ephemeral places like chat rooms, temporary spreadsheets and temp text files. The code benchmark data (along with descriptions of how the test was performed) could be invaluable now in making educated decisions as to future optimizations; now that it's gone, we're forced to recreate our performance tests and reduce development speed to do so.

The Final Reduction

So did we end up returning to our in-Linode days hosting cost? Yes, as a matter of fact. By replacing our expensive c1.xlarge boxes with more affordable m1.small machines, eliminating some redundant servers and consolidating some existing servers into existing underutilized servers, or leaning on super cost effective AWS services like SQS, we were able to reduce our cost enough to the point that buying a large number of 3-year Heavy Use Reserved Instances made sense. At that point, our amortized costs returned to the good ol' Linode days.

comments powered by Disqus