[Made a few minor edits] [added an additional clip suggested by Jim Geraghty]
There’s a running threat/gag in The Empire Strikes Back involving the hyperdrive of the Millennium Falcon. Various attempts are made to fix it, but it keeps not working, usually with Han Solo saying, “It’s not my fault!” Towards the very end of the movie, when it looks as though the Falcon and its passengers will be captured by the Empire, R2D2 — based on information he got while accessing the computers on Cloud City — makes one last adjustment, and zoom! The ship makes the jump into hyperspace, and the group (minus Han, who’s frozen in carbonite and on his way to Tatooine) escapes to fight another day.
Jim Geraghty, over at National Review, used this sequence as a brilliant analogy for the apparent expectations of the Obama Administration for the “tech surge” intended to repair the Healthcare.gov website:
Of course, the administration can’t do that [i.e., shut the website down and not reopen it until it has been through proper repair and testing]. They need to heave Hail Mary passes from here on out, and hope the thing suddenly and miraculously starts working like the hyperdrive of the Millennium Falcon at the end of The Empire Strikes Back.
I had never thought of this analogy before in respect to large IT system failure, but it’s spot on and, as I said, brilliant, largely because that false hope is understandable (if, well, wrong). As I wrote to Jim this morning:
The reason for that thinking is that on a small scale, this is what often happens with a program — it doesn’t work, it doesn’t work, and then you fix that one last bug, and boom! it works. (Ask any Computer Science undergrad, such as my daughter Crystal.) The problem with large, complex software is that each bug you fix usually just uncovers some number of new bugs that you weren’t encountering before. (This is also the problem with performance improvement on large systems: when you discover and fix the worst performance bottleneck, you merely uncover the next performance bottleneck.)
But wait! It gets worse! It is well- and long-known in software engineering circles that any change to a program’s source code runs a measurable and not-insignificant risk of introducing new bugs (or re-introducing old ones). This is why thorough regression testing — in effect, re-running all your existing test scripts again after each code change — is so important, particularly in the late stages of system development. Without rigorous regression testing (and source code change control, limiting who can make changes to code and when they are allowed into the system), you can in effect be chasing your tail for months without achieving a stable acceptable system.
Which is exactly why I see so many large IT projects that never stabilize — lots of work is done, lots of bugs are fixed, and yet it never gets any closer to being able to go live. One of the most important metrics (and one of the few useful ones) towards the end of a software project is the find/fix ratio, that is, the number of bugs you are finding each week vs. the number you are fixing or deferring. If that ratio is greater than or equal to 1.0 — if you are finding as many or more new bugs each week than you are fixing or deferring, you’ll never ship. However, if the ratio is less than 1.0, then you can start to project it out on a timeline and figure roughly when you’ll be ready to release the system.
And this is why — as I discuss in this extract from a document I wrote 15 years ago about software quality assurance — you need to have strict controls on what changes are allowed to be made to the source code through the alpha and beta phases just before release. If you do not handle this process rigorously, you’ll just keep oscillating for weeks, months, or even years. I’ve seen it happen time and again.
My father — who worked in electronics in the Navy for nearly 30 years — used to say to all us kids when we were growing up, “If you don’t have time to do it right, will you have time to do it over?” That is exactly the problem that the Obama Administration caused for itself (“We needed five years but only had two”), and it is exactly the problem that they face going forward.
One last observation: I don’t know all the people and groups that the Obama Administration brought in for its “tech surge”, but most Silicon Valley projects are not large-scale IT projects. It is unclear if these people know what they are getting themselves into and what the likely pitfalls are in trying to save an existing wounded behemoth.
Finally, for those of you wondering, “But why does the title say, ‘Aluminum Falcon’?”, this is why. (Clip warning: this is Robot Chicken.) I think this clip works wonderfully as a reflection of what must have been some awkward conversations between President Obama and Kathleen Sebelius post launch.
Also, I just think the post’s title sounds better this way.
[UPDATE] Jim Geraghty (at National Review), who loved the above clip, said that this clip fits as well. I fully agree:
Stay tuned. ..bruce w..