August 13, 2008

VMware ESX Postmortem Thoughts

Filed under: Uncategorized — numist @ 8:27 pm

So, yesterday there was a lot of talk about VMware’s ESX bug. Yes, there was a time bomb. Yes, we didn’t mean to ship it. I can’t really provide any information you don’t already know on this topic, and I won’t.

The most surreal part about being part of something like this from the inside though, is seeing the actual effect it has. A lot of people reported this to us, starting in Australia. Big people. With affected systems like… “Payroll”, and some of them with names containing “Government”. Wow. I’m not sure folks have any idea how badly we actually feel about this bug (because, I really do want everyone to get paid on Friday), but a few friends of mine were able to point out a silver lining which is worth mentioning.

This is a business, so this kind of bug is unlikely to happen in free software. That’s been beaten to death. The point is that this bug is a people-forget-bug, and not a there’s-a-bug-in-our-code-bug. ESX sells because it is stable. We have good code practices internally, and bugs have a hard time making it into new code. The result is a fantastically stable product that does what we say it will do, which (for me) is a breath of fresh air from a software company. It’s why I’m here.

VMware has the best code practices of all companies and projects I’ve worked with, including (especially?) my own. A clear engineering guidelines document. Reviews for all commits (made especially better by Review Board, which had its start here). We get the highest score on the Joel test of any place I’ve worked or visited. Our release branches are all super-stable, and you can tell — just try using one of our products. Even the betas go through rigourous testing and release procedures (check your release notes!). There’s always room for improvement, but the things we have written are very reliable.

It really sucks that this has affected so many people, and hopefully people will forgive us the mistake, and judge us based on our reaction as a company. After all, it’s how you solve problems (especially in crisis) that shows your true colours. If you’re affected, patches are already available.

This incident has had two effects for me: first, I am just as confident in our code as I was on Monday; second, we all get to eat some humble pie. It’s not a bad thing to do sometimes.

Delicious, delicious pie.

Now where did I leave that code I was working on?


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: