A few weeks ago I ran a survey asking a dozen or so questions about Java EE production update. I’d like to thank the 607 individuals who took some time from their busy life to help me out. Today, I’ll return the favor by analyzing the resulting data for your pleasure. If for some reason you don’t trust me, feel free to check the calculations on the original data in this spreadsheet.
I’m sorry I couldn’t do this before, but life was too busy to do a proper analysis in the meantime. As it is, I’m writing these words from the CDG airport in Paris and will likely finish the article in the Porto-Lisbon train (or as it turned out in the Lisbon-Porto train).
Act I: The Pool
Running surveys is always a tricky business. Some amount of bias in a non-random selection is inevitable, so the first thing we need to do is to understand the selection pool. The survey included some questions that help with that.
One of the survey questions was “What industry are the applications you’re working on for?” (for those who cringe at the sight of prepositions at the end of a sentence — ha!). Luckily the answers spread pretty evenly across all the options with Technology, Finance, Online, Banking, Media and Telco having over 10% of representation each, so we can assume that the industries are well represented across the board.
Another question concerned the size of deployment: “About how many servers/instances do you have in production?”. The answers are represented by this graph:
About 93% have 50 servers or less, so the large deployment community is represented by only 43 respondents, only 4 of which have more than a thousand servers. It’s up to you to decide whether this represents the community-at-large, but at the very least it’s highly relevant if you have 50 servers or less in your organization or department.
I asked “What application servers (or other servers) do you use in production?”. This may be interpreted as the popularity lineup of production servers (at least in the 50 servers or less category) or just a categorization of respondents, unrepresentative of the industry – interpret as you like.
Unlike the surveys that ZeroTurnaround ran for the development environment, I allowed to select multiple servers and got a somewhat different lineup. It’s clear that OSS servers are leading the pack, but segmenting the respondents by OSS v/s non-OSS servers didn’t produce any interesting results.
Act II: Redeploys Considered Broken?
As for the data itself, it revealed some interesting patterns and some even more interesting points, where it lacked any pattern whatsoever. One of the hypotheses I was testing was that the “Update” or “Redeployment” functionality provided by containers is used little or not at all. My thinking was that because it’s very hard to impossible to stop redeploying from leaking memory (check out this article & conference presentation), it will quickly throw an OutOfMemoryError. However only 52% of respondents reported OutOfMemoryErrors.
On the other hand only 24% answered that they allow redeploy in production at all and only 13% use it as the primary means for production updates. It turns out that a score of additional problems prevent respondents from using container redeployment, including:
- Lack of proper facilities for doing database updates
- Thread races and deadlocks in Java EE containers
- Thread and resource leaks
- Security concerns
- Native library issues
- Various app server caching issues
- Performance overhead when redeploying is on — I’m not sure what could this be, unless hot redeployment in production with class/resource scanning was meant
- Rollback difficulties
- Process or organizational issues
- My favourite: Fear — there is very little trust into the reliability of redeployment based on multiple past issues
- Bonus quote: “Java: Write once, run away”
The majority of these issues point to the same underlying problem — the apps running in the container and new app versions created by redeploying are not well isolated from each other. Apps can leak other apps or versions, take locks on global monitors, leak resources and produce other unwanted side effects. In that context, asking how soon the OutOfMemoryError happens isn’t so relevant anymore, as other issues both overshadow and mask it.
Act III: The Update Process
The next few questions concerned the update process, the importance of downtime and the amount of automation. Some very interesting patterns turned out.
Nearly 73% replied that they allow downtime during their production update. For them, restarting all servers in the off hours is the simplest and cheapest method of update. Only 54% of respondents answered that downtime doesn’t cost anything to them, leaving 19% losing money on every update.
How much money? 37% replied that downtime is “Priceless!”. From those who gave a number, the average price per minute was $30,016. However this included two extreme responses, where respondents gave a per minute price of $1M and $500K. Excluding those two, the average is only $3,230 per minute.
I asked how long it usually takes for a production update to complete. About 10% replied that it takes “Forever!”. The average time was 1.6h, but the maximum was 60h and the std deviation was 4.3h. Interestingly enough the correlation between the time it takes to update and the amount of servers was very near zero, so the length of an update doesn’t depend on the size of the deployment too much.
I asked the respondents to rate how much of a hack their update process is. In the survey the actual numbers weren’t present, but the answers implied a 1 (hack) to 5 (automated) rating.
The average rating was 3.7, which implies that for the majority the process is pretty well understood and mostly automated, with some manual labor required. This is a better result than I expected and explains why it takes only 1.6h on average to complete the update.
In another question I asked if the update process in use was ideal, and the replies were overwhelmingly negative with only 27% replying affirmatively; downtime and lack of automation being quoted as the largest areas of concern. And quite a few have mentioned that although downtime is ok in the off hours, it also means that the team has to be up at night.
The next set of questions concerned the method and tools used to update the production. I asked which way the production update happens and the layout was as follows:
Under 12% of respondents do updates that are risk free and completely unnoticed by the users. The most popular by far method of update, at 46% is just taking the servers down. There is a large disconnect here, as this means that 88% do updates in a manner that is likely to impact the users, whereas only 54% allow downtime during production updates. This means that 34% of respondents at the moment have to violate their own policies to some extent.
This is also illustrated by the tools used to update production.
By far the most popular tools used to update production apps are Unix shell commands. App servers are used more than I expected, considering the previous answers, but perhaps starting/stopping the servers was included in this category. Hudson is gathering popularity, and although not included in the chart, Maven and Ant were the top choices in the Other category.
This means that even though the production updates are fairly automated on average, the automation is mostly scripted manually and little out-of-the box support is provided.
So what conclusions can I make from this data?
- Container redeploys are indeed mostly broken, but for a much wider range of reasons than I supposed
- The most popular way to update production, at least in the 50 servers or less deployments category, is to take all servers down in the off hours.
- Lack of automation and downtime during updates is the biggest concern with the current processes and tools.
- Command line utilities are the tools of choice for doing an update. This points to a lack of ready-made solutions in the area.
- Although it is not directly visible in the charts, from analyzing the text of replies and looking at the (lack of) correlations between various data points it seems that there is a good amount of chaos in the area. There is a lack of easy-to-use and standardized terminology, processes and tools to support the update process. Everyone comes up with their own solution, and often terminology and roles.
In total the survey points out a range of technical issues with the current production update methods and tools.
There are some solutions on the horizon that might help to make it better. I’ll describe here the two main categories, which I’ll call Autobot and Decepticon.
Autobot solutions aim to automate rolling restarts with no downtime, making it easy as pie to do. They are still mostly in their infancy, but solutions like Puppet, Chef, RunDeck, DeployIt and JClouds are actively working in this direction. There are still a lot of challenges ahead, but they are moving at great strides.
On a side note, the ideal solution in that area would be the one employed by .NET and many dynamic languages, which isolate each app in a separate OS process. The almost-perfect isolation provided by the OS process model would mean that any app version can be terminated at any moment, without any lasting-side effects.
Unfortunately one issue the Autobots are still subject to is the migration of application and user state, which can be challenging and time-consuming. Another challenge is updating the database and other remote dependencies without downtime, which can be hard to achieve when the db/remote changes are incompatible with the current app version and so gradual transition isn’t possible.
The Decepticon solutions improve hot redeployment and at the moment are represented solely by ZeroTurnaround’s LiveRebel. Instead of creating a new instance of the app on every redeploy, LiveRebel applies the code, resource and configuration updates inside the app, preserving all state and avoiding side-effects. It also allows instant rollback on broken updates and will even automatically wait for the database or other remote updates to complete. The goal is to make small updates super cheap, severely decrease time-to-production for minor changes and fixes and is actually (unlike it’s cartoon counterpart) complimentary to the Autobots.
Updating live apps is currently quite challenging and will probably never be trivial, but both Autobots and Decepticons are on the way to a little blue planet near you and together they’ll be able to handle any danger threatening your production environment.