How healthcare.gov failed: the programmatic aspects

The Thinker by Rodin

(Also read parts 1, 3 and 4.)

I am getting some feedback: healthcare.gov isn’t really a failure. People are using the website to get health insurance, albeit not without considerable hassle at times. I’ll grant you that. I’ll also grant you that this was a heck of a technical challenge, the sort I would have gladly taken a pass on, even for ten times my salary. It’s a failure in that it failed to measure up to its expectations. President Obama said there would be “glitches”, but these were far more than glitches. If this were a class project, a very generous professor might give it a D. I’d give it a D-, and that’s only then after a few beers. Since I don’t drink to imbibe, I give it an F.

In the last post, I looked at the political mistakes that were made. Today I’ll look at the programmatic mistakes. I’m talking about how in general the program was managed.

Some of it is probably not the fault of the program or project manager. This is because they were following the law, or at least regulation. And to follow the law you have to follow the FAR, i.e. the Federal Acquisition Regulation. It’s the rules for buying stuff in the federal government, including contracted services. Violating the FAR can put you in prison, which is why any project of more than tiny size has a contracting officer assigned to it. In general, the government wants to get good value when it makes a purchase. Value usually but does not always translate into lowest price. With some exceptions, the government considers having contractors construct a national portal for acquiring health care to be the same as building a bridge. Put out the requirements for an open bid and select the cheapest source. Do this and taxpayers will rejoice.

This contract had a lot of uncertainty, which meant it had red flags. The uncertainty was manifested in many areas, but certainly demonstrated in requirements that were not locked down until this year. I’d not want to waste my time coding something that I might have to recode because the requirements changed. This uncertainty was reflected in how the contract was bid. It’s hard to bid it as a fixed price contract when you don’t know exactly what you are building. If you were building a house where every day the owner was directing changes to the design you wouldn’t expect builders to do it using a fixed price contract. Same thing here. It appears the contract was largely solicited as “time and materials”. This accounts in part for total costs, which at the moment are approaching half a billion dollars. This kind of work tends to be expensive by its nature. CGI Federal probably had the lowest cost per hour, which let it win the bid.

There is some flexibility in choosing a contractor based on their experience constructing things a lot like what you want built. CGI Federal is a big, honking contractor that gets a lot of its business in government contracts. Like most of these firms, it has had its share of failures.  A system of the size of healthcare.gov is a special animal. I am not sure that any of the typical prime contractors in the government software space were qualified to build something like this, at least not if you wanted it done right.

There is some flexibility allowed in the statement of work (SOW), generally put together by the program manager with help from a lot of others. I don’t know precisely what rules applied to the contracting process here, but it is likely, probably by expending a lot of political capital, to create SOW that would have properly framed the contracting process so something actually usable could be constructed. A proper SOW should have included criteria for the contractor like:

  • Demonstrated experience successfully creating and managing very large, multi-vendor software projects on time that meet requirements that change late in the system life cycle
  • Demonstrated ability to construct interactive web-based software systems capable of scaling seamlessly on demand and interacting quickly with disparate data sources supplied by third parties

The right SOW would have excluded a lot of vendors, including probably CGI Federal but very possibly some of the big players in this game like Unisys, IBM and Northrop Grumman. Yes, many of these vendors have built pretty big systems, but they often come with records that are spotted at best, but whose mistakes are often overlooked. Until recently I used a Northrop Grumman system govtrip.com for my federal travel. They did build it, but not successfully. For more than a year the system was painfully slow and the user interface truly sucked.

Successfully building a system of this type, which was highly usable upon initial deployment, should qualify that contractor to bid on this work. Offhand I don’t know who would qualify. I do know whom I would have wanted to do the work: Amazon.com. They know how to create large interactive and usable websites that scale on demand. Granted even Amazon Web Services is not perfect, with occasional outages of its cloud network, but we’re talking a hassle factor of maybe .1% compared to what users have experienced with healthcare.gov. They used to do this for other retailers but may have gotten out of that business. I would have appealed to their patriotic senses, if they had any, to get them to bid on this work. In any event, even if they had bid they did not get the contract. So there was a serious problem either with the SOW or the “one size fits all” federal contracting regulations the doubtlessly very serious contracting officer for this project followed.

The size of this project though really made building it in-house not an option. So a board consisting of the best in-house web talent and program management talent in the government should have overseen it. Others have noted that the team that constructed President Obama’s websites, used to win two elections, would have been great in this role. In any event, the project needed this kind of panel from the moment the statement of work (SOW) was put together through the life of the project, and that includes post deployment.

Probably what they would have told those in charge was things they did not want to hear, but should have heard. The project should be delivered incrementally, not all at once. It should not be deadline driven. Given the constantly changing requirements, risk management strategies should have been utilized throughout. When I talk about architectural and technical mistakes in future posts, I’ll get into some of these.

In short, this project was a very different animal: highly visible, highly risky, with requirements hard to lock down and with technical assumptions (like most states would build their own exchanges) far off the mark. You cannot build a system like this successfully and meet every rule in the FAR. It needed waivers from senior leaders in the administration to do it in a way that would actually work in the 21st century, rather than to follow contracting procedure modeled on the spendthrift acquisition of commodities like toilet paper. An exception might even have been needed to have been written into the ACA bill that became law.

Next: architectural mistakes.

Healthcare.gov and the problems with interactive federal websites

The Thinker by Rodin

Today’s Washington Post highlights problems with the new healthcare.gov site, the website used by citizens to get insurance under the Affordable Care Act. The article also talks about the problems the federal government is having in general managing information technology (IT). As someone who just happens to manage such a site for the government, I figure I have something unique to contribute to this discussion.

Some of the problems the health care site are experiencing were predictable, but some were embarrassingly unnecessary. Off the top of my head I can see two clear problems: splitting the work between multiple contractors and the hard deadline of bringing the website up on October 1, 2013, no matter what.

It’s unclear why HHS chose to have the work done by two contractors. The presentation (web-side) was done by one contractor and the back end (server-side) was done by another. This likely had something to do with federal contracting regulations. It perhaps was seen as a risk mitigation strategy at the contracting level, or a way to keep the overall cost low. It’s never a great idea for two contractors to do their work mostly mindless of the other’s work. Each was doing subsystem development, and as subsystems it’s possible that each worked optimally. But from the public’s perspective it is just one system. What clearly got skipped was serious system testing. System testing is designed to test how the system behaves from a user’s perspective. A subset of system testing is load testing. Load testing sees how the system reacts when it is under a lot of stress. Clearly some of the requirements for initial use of the system wildly underestimated the traffic the site actually experienced. But it also looks like in an effort to meet an arbitrary deadline, load testing and correcting the problems from it could not happen in time.

It also looks like the use cases, i.e. user interaction stories that describe how the system would be used, were bad. It turned out that most initial users were just shopping around and trying to find basic information. It resulted in a lot of browsing but little in the way of actual buying. Most consumers, particularly when choosing something as complex as health insurance, will want to have some idea of the actual costs before they sign up. The cost of health care is obviously a lot more than just the cost of premiums. Copays can add thousands of dollars a year to the actual cost of insurance. This requires reading, study, asking questions of actual human beings in many cases, and then making an informed decision. It will take days or weeks for the typical consumer to figure out which policy will work best for them, which means a lot of traffic to the web site, even when it is working optimally.

The Post article also mentions something I noticed more in my last job than in my current one: that federal employees who manage web sites really don’t understand what they are managing. This is because most agencies don’t believe federal employees actually need experience developing and maintaining web sites. Instead, this work is seen as something that should be contracted out. I was fortunate enough to bring hands on skills to my last job, and it was one of the reasons I was hired. In general, the government sees the role of a federal employee to “manage” the system and for contractors to “develop and maintain” the system. This typically leads to the federal employee being deficient in the technical skills needed and thus he or she can easily make poor decisions. Since my last employer just happened to be HHS, I can state this is how they do things. Thus, it’s not surprising the site is experiencing issues.

Even if you do have a federal staff developing and maintaining the site, as I happen to have in my current job, it’s no guarantee that they will all have all the needed skills as well. Acquiring and maintaining those skills requires an investment in time and training, and adequate training money is frequently in short supply. Moreover, the technology changes incredibly quickly, leading to mistakes. These bit me from time to time.

We recently extended our site to add controls that give the user more powerful ways to view data. One of these is a jQuery table sorter library. It allows long displays of data in tables to be filtered and sorted without going back to the server to refresh the data. It’s a neat feature but it did not come free. The software was free but it added marginally to the time it took the page to fully load. It also takes time to put the data into structures where this functionality can work. The component gets slow with large tables or multiple tables on the same page. Ideally we would have tested this prior to deployment, but we didn’t. It did not occur to me, to my embarrassment. I like to think that I usually catch stuff like this. This is not a fatal problem in our case, but it is a little embarrassing, but only to the tune of a second or two extra for certain web pages to load. Still, those who have tried it love the feature. We’re going to go back and reengineer this work so that we only use it with appropriately sized tables. Still, the marginal extra page load time may be so annoying for some that they choose to leave the site.

Our site like healthcare.gov is also highly trafficked. I expect that healthcare.gov will get more traffic than our site, which is thirty to 40 million successful page requests per month. Still, scaling web sites is not easy. The latest theory is to put redundant servers “in the cloud” (commercial hosting sites) to use as needed on demand. Unfortunately, “the cloud” itself is an emerging technology. Its premier provider, Amazon Web Services, regularly has embarrassing issues managing its cloud. Using the cloud should be simple but it is not. There is a substantial learning curve and it all must work automatically and seamlessly. The federal government is pushing use of the cloud for obvious benefits including cost savings, but it is really not ready for prime time, mission-critical use. Despite the hassles, if high availability is an absolute requirement, it’s better to host the servers yourself.

The key nugget from the Post’s article is that the people managing these systems in many cases don’t have the technical expertise to do so. It’s sort of like expecting a guy in the front office of a dealership to disassemble and reassemble a car on the lot. The salesman doesn’t need this knowledge but to manage a large federal website you really need this experience to competently manage your websites. You need to come up from the technical trenches and then add managerial skills to your talents. In general, I think it’s a mistake for federal agencies to outsource web site development. Many of these problems were preventable, although not all of them were. Successful deployment of these kinds of sites depends to a large extent on having a federal staff knowing the right questions to ask. And to really keep up to date on a technology that changes so quickly, it’s better to have federal employees develop these sites for themselves. Contractors might still be needed, but more for advice and coaching.

Each interactive federal web site is its own unique system, as healthcare.gov certainly is. The site shows the perils of placing too much trust in contractors and in having a federal managerial staff with insufficient technical skills. Will we ever learn? Probably not. Given shrinking budgets and the mantra that contracting out is always good, it seems we are doomed to repeat to these mistakes in the future.

Don’t say I didn’t warn you.

(Update: 10/28/13. This initial post spawned a series of posts on this topic where I looked at this in more depth. You may want to read them, parts 1, 2, 3 and 4.)