So, we launched an Operations Productivity survey late last month to try to get into the hearts and minds of IT Operations professionals (i.e. Infra engineers, sys admins, release managers, etc.) and we’ve got some interesting findings already:
- Devs care more about Ops than Ops? Around 60% of the responses came from developers or software architects. Does Dev care more about ops than IT Ops? ;)
- These production failures can be prevented folks! The top reasons for production failures are “software quality” and “human error.” Do we need better release processes?
- Too much time needed to recover from failures. Around 55% of failures take more than 30 min to resolve; close to 30% takes more than 60 min! This is hugely disruptive.
If you haven’t already done so, please help the Operations field and make new friends by taking our survey – it should only require about 10min max. The results are yours and can help you understand where peers in your community stand. I suggest that we need 250 more responses for our survey to really rock.
Now to the charts for the findings above. As 24/7-connected experiences become increasingly important, production failures are highly undesirable. I will abstain from making absolute conclusions, since we are still receiving responses, so let the charts speak for themselves!
From the chart above, we can see that “human error” typically causes 60% of failures when we release software into production. It is only second to software quality issues. Uh, shouldn’t we be able to mitigate that somehow?
When failures occur in production, more than half the time it becomes known ONLY AFTER customers/end users alert software teams that something went wrong. That is alarmingly high, since disruptions in service impacts the user experience worse that you might thing, driving users to competitors. Ideally, failures should be identified and remedied before end users are impacted.
Over 50% of respondents cannot recover from a failure within 30 minutes, and close to 30% of respondents indicated that it takes more than an hour to recover. Where does your organization stand? Where can you improve? In which areas can you share best practices with others in the community?
If you haven’t already completed the survey, please do check it out. Especially if you are in IT Ops, Release Management or a DevOps’y role. Your community, company and profession will thank you… :)