19 February 2017
When I was learning to be an engineer, people told me two different stories about application performance. The first was that while it's fun to work on, don't spend too much time on it unless you have to - premature optimization and all that. The second was that really, though, it's good to have a fast system because users like fast systems - that is, performance is a feature. But when it's all said and done, I've mostly been doing web application development, where performance is often seen as a problem to be solved only when encountered in an extremely obvious way ("oh, this request is timing out, better add an index").
While sensible enough on their own, these stories are incomplete: they don't fully capture the impact application speed can have on an engineering team. Specifically, I consistently see teams underestimate the hidden costs of maintaining a slow system. Slow systems imposes huge costs on teams, sometimes without the team even realizing that speed is the fundamental problem. To give a few examples I have encountered:
A Rails deployment with a collection of particularly slow routes required an enormous amount of Unicorn processes - and therefore web servers - to meet a modest but increasing workload. Running so many servers had high second-order costs on the team. Deployments required much more effort, database connection pooling went from optional to completely required, prompting a major project to roll it out, and hardware maintenance costs, both in dollars and time, increased far faster than they would have otherwise.
A team maintaining a batch job pipeline with strict daily SLAs ("the pipeline must be done by 3 PM!") found they were spending so much time monitoring the daily stability of the pipeline that they had difficulty making progress on anything else. On high-volume days, the team needed to respond instantly to failed jobs, forcing them to invest heavily in monitoring, health checks, and job retry mechanisms rather than new products or features.
Similar to above, a team maintaining a slow ETL system found it so difficult to re-run or tweak jobs that they asked for weeks of advance notice for any change to the system's input by upstream teams. Management concurred that the disruption suffered by the ETL team was so severe that all upstream teams needed to carefully coordinate major changes with them. This prompted the introduction of several new layers of change control management and general overhead.
In cases like this, the immediate focus on the team is handling something that is otherwise going to break: "we need to use connection pooling or the database will fall over" or "we need to detect and retry failed jobs immediately or we'll miss our SLA". These responses are totally understandable and, in the moment, absolutely necessary. The critical thing is to step back and assess: where are these problems actually coming from? Is it possible that you don't have a scaling problem as much as a performance problem? What can you do about it?
When viewed from this angle, performance is more like an operational quality such as idempotence than it is a product-level feature. Most engineers will tell you: you can usually work around a lack of idempotence or atomic transactions, it just takes a lot more effort. Teams working on slow systems might be making a similar tradeoff without even realizing it. And unlike idempotence, performance isn't all-or-nothing - it's often pretty easy to make significant gains on the margin. So, next time you're starting to engineer around a problem, ask yourself: would I still be doing this if the damn thing was just faster?