How healthcare.gov failed: the technical aspects

The Thinker by Rodin

(Also read parts 1, 2 and 3.)

A lot of how healthcare.gov works is opaque. This makes it hard to say authoritatively where all the problems lie and even harder to say how they can be solved. Clearly my knowledge is imperfect and thus my critiques are not perfect either. I am left to critique what the press has reported and what has come out in public statements and hearings. I can make reasonable inferences but my judgment will be somewhat off the mark because of the project’s opacity.

It didn’t have to be this way. The site was constructed with the typical approach used in Washington, which is to contract out the mess to a bunch of highly paid beltway bandit firms. Give them lots of money and hope with their impressive credentials that something will emerge that is usable. It was done this way because that’s how it’s always done. Although lots of major software projects follow an open source approach, this thinking hasn’t permeated the government yet, at least not much of it. Open source projects mean the software (code) is available to anyone to read, critique and suggest improvements. It’s posted on the web. It can be downloaded, compiled and installed if someone has the right tools and equipment.

It’s not a given that open sourcing this project was the right way to go. Open source projects work best for projects that are generic and used broadly. For every successful open source project like the Apache Web Server there are many more abandoned open source projects that are proposed but attract little attention. Sites like sourceforge.net are full of these.

In the case of healthcare.gov, an open source approach likely would have worked, and resulted in a system that would have cost magnitudes less and would have been much more usable. It would still have needed an architectural committee and some governance structure and programmers as well, principally to write a first draft of the code. Given its visibility and importance to the nation it would have naturally attracted many of our most talented programmers and software engineers, almost all of who would have donated their time. Contractors would have still been needed, but many of them would have been engaged in selecting and integrating suggested code changes submitted by the public.

If this model had been used, there probably would have been a code repository on github.com. Programmers would have submitted changes through its version control system. In general, the open source model works because the more eyes that can critique and suggest changes to code, the better the end result is likely to be. It would have given a sense of national ownership to the project. Programmers like to brag about their genius, and some of our best and brightest doubtless would have tooted their own horns at their contributions to the healthcare.gov code.

It has been suggested that it is not too late to open source the project. I signed a petition on whitehouse.gov asking the president to do just this. Unfortunately, the petition process takes time. Assuming it gets enough signers to get a response from the White House, it is likely to be moot by the time it is actively taken up.

I would have like to have seen the system’s architecture put out for public comment as well. As I noted in my last post, the architecture is the scaffolding on which the drywall (code) is hung. It too was largely opaque. We were asked to trust that overpaid beltway bandits had chosen the right solutions. It is possible that had these documents been posted early in the process then professionals like me would have added public comments, and maybe a better architecture would have resulted. It was constructed instead inside the comfort of a black box known as a contract.

After its deployment, we can look at what was rolled out and critique it. What programmers can critique is principally its user interface because we can inspect it in detail. The user interface is important, but its mistakes are also relatively easy to fix and have been fixed to some extent. For example, the user interface now allows you to browse insurance plans without first establishing an account. This is one of these mistakes that are obvious to most people. You don’t need to create an account on amazon.com to shop for a book. This was a poorly informed political decision instead. Ideally someone with user interface credentials would have pushed back on this decision.

What we saw with the system’s rollout was a nice looking screen that was full of many things that had to be fetched through separate calls back to the web server. For example, every image on a screen has to be fetched as a separate request. Each Javascript library and cascading style sheet (CSS) file also has to be fetched separately. In general, all of these have to complete before the page can be usable. So to speed up the page load time the idea is to minimize how many fetches are needed, and to fetch only what is needed and nothing else. Each image does not have to be fetched separately. Rather a composite image can be sent as one file and through the magic of CSS each image can be a snippet of a larger image and yet appear as a separate image. Javascript libraries can be collapsed into one file, and compressed using a process called minifying.

When the rollout first happened there was a lot of uproarious laughter from programmers because of all the obvious mistakes. For example, images can be sent with information that tells the browser “you can keep this in your local storage instead of fetching it every time the user reloads the page”. If you don’t expect the contents of a file to change, don’t keep sending it over and over again! There are all sorts of ways to speed up the presentation of a web page. Google has figured out some tricks, for example, and has even published a PageSpeed module for the Apache web server. Considering these pages will be seen by tens or hundreds of millions of Americans, you would expect a contractor would have thought through these things, but they either didn’t or didn’t have the time to complete them. (These techniques are not difficult, so that should not be an excuse.) It suggests that at least for the user interface portion of the project, a bunch of junior programmers were used. Tsk tsk.

Until the code for healthcare.gov is published, it’s really hard to assess the technical mistakes of the project, but clearly there were many. What professionals like myself can see though is pretty alarming, which may explain why the code has not been released. It probably will in time as all government software is in theory in the public domain. Most likely when all the sloppy code behind the system is revealed at last, programmers will be amazed that the system worked at all. So consider this post preliminary. Many of healthcare.gov’s dirty secrets are still to be revealed.

How healthcare.gov failed: the architectural aspects

The Thinker by Rodin

(Read parts 12 and 4.)

If you were building a house that you didn’t quite know how it would turn out when you started, one strategy would be to build a house made of Lego. Okay, not literally as it would not be very livable. But you might borrow the idea of Lego. Each Lego part is interchangeable with each other. You press pieces into the shape you want. If you find out half way through the project that it’s not quite what you want, you might break off some of the Lego and restart that part, while keeping the part that you liked.

The architects of healthcare.gov had some of this in their architecture: a “data hub” that would be a big and common message broker. You need something like this because to qualify someone for health insurance you have to verify a lot of facts against various external data sources. A common messaging system makes a lot of sense, but it apparently wasn’t built quite right. For one thing, it did not scale very well under peak demand. A messaging system is only as fast as its slowest component. If the pipe is not big enough you install a bigger pipe. Even the biggest pipe won’t be of much use if the response time to an external data source is slow. This is made worse because generally an engineer cannot control aspects of external systems. For example, the system probably needs to check a person’s adjusted gross income from their last tax return to determine their subsidy. However, the IRS system may only support ten queries per second. Throw a thousand queries per second at it and the IRS computer is going to say “too busy!” if it says anything at all and the transaction will fail. From the error messages seen on healthcare.gov, a lot of stuff like this was going on.

There are solutions to problems like these and they lay in fixing the system’s architecture. The general solution is to replicate the data from these external sources inside the system where you can control them, and query replicas instead of querying the external sources directly. For each data source, you can also architect it so that new instances of it can be spawned on increased demand. Of course, this implies that you can acquire the information from the source. Since most of these are federal sources, it was possible, providing the Federal Chief Technology Officer used his leverage. Most likely, currency of these data is not a critical concern. Every new tax filing that came into the IRS would not have to be instantly replicated into a cloned instance. Updating the source once a day was probably plenty, and updating it once a month likely would have sufficed as well.

The network itself was almost certainly a private and encrypted network given that privacy data traverses it. A good network engineer will plan for traffic ten to a hundred times as large as the maximum anticipated in the requirements, and make sure that redundant circuits with failover detection and automatic switchover are engineered in too. In general, it’s good to keep this kind of architecture as simple as possible, but bells and whistles certainly were possible: for example, using message queues to transfer the data and strict routing rules to handle priority traffic.

When requirements arrive late, this can introduce big problems for software engineers. Based on what you do know though, it is possible to run simulations of system behavior early in the life cycle of the project. You can create a pretend data source for IRS data that, for example, always returns an “OK” while you test basic functionality of the system. I have no idea if something like this was done early on, but I doubt it. It should have been if it wasn’t. Once the interaction between these pretend external data sources was simulated, complexity could be added to the information returned by each source, perhaps error messages or messages like “No such tax record exists for this person” to see how the system itself would behave, with attention to the user experience through the web interface as well. The handshake with these external data sources has to be carefully defined. Using a common protocol is a smart way to go for this kind of messaging. Some sort of message broker on an application server probably has the business logic to order the sequence of calls. This too had to be made scalable so that multiple instances could be spawned based on demand.

This stuff is admittedly pretty hard to engineer, and is not the sort of systems engineering that is done every day, and probably not by a vendor like CGI Federal. But the firms and the talent are out there to do these things and would have been done with the proper kind of system engineer in charge. This kind of architecture also allows for business rule changes to be centralized, allowing for the introduction of different data sources late in the life cycle. Properly architected, this is one way to handle changing requirements, providing a business-rules server using business rules software is used.

None of this is likely to be obvious to a largely non-technical federal staff groomed for management and not systems engineering. So a technology advisory board filled with people who understand these advanced topics certainly was needed from project inception. Any project of sufficient size, scope, cost or of high political significance needs a body with teeth like this.

Today at a Congressional hearing officials at CGI Federal unsurprisingly declared that they were not at fault: their subsystems all met the specifications. It’s unclear if these subsystems were also engineered to be scalable on demand as well. The crux of the architectural problem though was clearly in message communications between these system components, as that is where it seems to break down.

A lesson to learn from this debacle is that as much effort needs to go into engineering a flexible system as goes into the engineering of each component. Testing the system early under simulated conditions, then as it matures under more complex conditions and higher loads would have detected these problems earlier. Presumably there would then have been time to address them before the system went live because it would have been a visible problem. System architecture and system testing is thus vital for complex message based systems like healthcare.gov, and a top notch system engineering plan needed to be have been at its centerpiece, particularly since the work was split up between multiple vendors with each responsible for their subsystem.

Technical mistakes will be discussed in the last post on this topic.

How healthcare.gov failed: the programmatic aspects

The Thinker by Rodin

(Also read parts 1, 3 and 4.)

I am getting some feedback: healthcare.gov isn’t really a failure. People are using the website to get health insurance, albeit not without considerable hassle at times. I’ll grant you that. I’ll also grant you that this was a heck of a technical challenge, the sort I would have gladly taken a pass on, even for ten times my salary. It’s a failure in that it failed to measure up to its expectations. President Obama said there would be “glitches”, but these were far more than glitches. If this were a class project, a very generous professor might give it a D. I’d give it a D-, and that’s only then after a few beers. Since I don’t drink to imbibe, I give it an F.

In the last post, I looked at the political mistakes that were made. Today I’ll look at the programmatic mistakes. I’m talking about how in general the program was managed.

Some of it is probably not the fault of the program or project manager. This is because they were following the law, or at least regulation. And to follow the law you have to follow the FAR, i.e. the Federal Acquisition Regulation. It’s the rules for buying stuff in the federal government, including contracted services. Violating the FAR can put you in prison, which is why any project of more than tiny size has a contracting officer assigned to it. In general, the government wants to get good value when it makes a purchase. Value usually but does not always translate into lowest price. With some exceptions, the government considers having contractors construct a national portal for acquiring health care to be the same as building a bridge. Put out the requirements for an open bid and select the cheapest source. Do this and taxpayers will rejoice.

This contract had a lot of uncertainty, which meant it had red flags. The uncertainty was manifested in many areas, but certainly demonstrated in requirements that were not locked down until this year. I’d not want to waste my time coding something that I might have to recode because the requirements changed. This uncertainty was reflected in how the contract was bid. It’s hard to bid it as a fixed price contract when you don’t know exactly what you are building. If you were building a house where every day the owner was directing changes to the design you wouldn’t expect builders to do it using a fixed price contract. Same thing here. It appears the contract was largely solicited as “time and materials”. This accounts in part for total costs, which at the moment are approaching half a billion dollars. This kind of work tends to be expensive by its nature. CGI Federal probably had the lowest cost per hour, which let it win the bid.

There is some flexibility in choosing a contractor based on their experience constructing things a lot like what you want built. CGI Federal is a big, honking contractor that gets a lot of its business in government contracts. Like most of these firms, it has had its share of failures.  A system of the size of healthcare.gov is a special animal. I am not sure that any of the typical prime contractors in the government software space were qualified to build something like this, at least not if you wanted it done right.

There is some flexibility allowed in the statement of work (SOW), generally put together by the program manager with help from a lot of others. I don’t know precisely what rules applied to the contracting process here, but it is likely, probably by expending a lot of political capital, to create SOW that would have properly framed the contracting process so something actually usable could be constructed. A proper SOW should have included criteria for the contractor like:

  • Demonstrated experience successfully creating and managing very large, multi-vendor software projects on time that meet requirements that change late in the system life cycle
  • Demonstrated ability to construct interactive web-based software systems capable of scaling seamlessly on demand and interacting quickly with disparate data sources supplied by third parties

The right SOW would have excluded a lot of vendors, including probably CGI Federal but very possibly some of the big players in this game like Unisys, IBM and Northrop Grumman. Yes, many of these vendors have built pretty big systems, but they often come with records that are spotted at best, but whose mistakes are often overlooked. Until recently I used a Northrop Grumman system govtrip.com for my federal travel. They did build it, but not successfully. For more than a year the system was painfully slow and the user interface truly sucked.

Successfully building a system of this type, which was highly usable upon initial deployment, should qualify that contractor to bid on this work. Offhand I don’t know who would qualify. I do know whom I would have wanted to do the work: Amazon.com. They know how to create large interactive and usable websites that scale on demand. Granted even Amazon Web Services is not perfect, with occasional outages of its cloud network, but we’re talking a hassle factor of maybe .1% compared to what users have experienced with healthcare.gov. They used to do this for other retailers but may have gotten out of that business. I would have appealed to their patriotic senses, if they had any, to get them to bid on this work. In any event, even if they had bid they did not get the contract. So there was a serious problem either with the SOW or the “one size fits all” federal contracting regulations the doubtlessly very serious contracting officer for this project followed.

The size of this project though really made building it in-house not an option. So a board consisting of the best in-house web talent and program management talent in the government should have overseen it. Others have noted that the team that constructed President Obama’s websites, used to win two elections, would have been great in this role. In any event, the project needed this kind of panel from the moment the statement of work (SOW) was put together through the life of the project, and that includes post deployment.

Probably what they would have told those in charge was things they did not want to hear, but should have heard. The project should be delivered incrementally, not all at once. It should not be deadline driven. Given the constantly changing requirements, risk management strategies should have been utilized throughout. When I talk about architectural and technical mistakes in future posts, I’ll get into some of these.

In short, this project was a very different animal: highly visible, highly risky, with requirements hard to lock down and with technical assumptions (like most states would build their own exchanges) far off the mark. You cannot build a system like this successfully and meet every rule in the FAR. It needed waivers from senior leaders in the administration to do it in a way that would actually work in the 21st century, rather than to follow contracting procedure modeled on the spendthrift acquisition of commodities like toilet paper. An exception might even have been needed to have been written into the ACA bill that became law.

Next: architectural mistakes.

How healthcare.gov failed: the political aspects

The Thinker by Rodin

(Also read parts 23 and 4.)

You know a federal IT manager has a problem when the President of the United States is dissing the very web site he was paid to manage. That’s what President Obama was doing today with the healthcare.gov site, the rollout of which was botched by any standard. Also botched was the obscene amount of money paid for the site, obscene even if it had worked. The Canadian contractor CGI Federal got the award, initially $93.7M, but with extra work is now at more than $292M. This is a crazy amount of money to pay for an interactive site and may be the most expensive site of its kind ever purchased with tax dollars.

I wrote a week or so back about my initial critique of the website. It is easy to criticize in hindsight. I can’t claim to know all of the site’s requirements. From news reports it is not too hard to infer a lot of them. There were a number of external data sources such as at the IRS and Social Security Administration that had to be queried to do things like figure out your eligibility for a subsidy, if any. There were many business rules that had to be followed. There were tight security rules to follow because Privacy Act data had to be stored. And there were accessibility rules required of any federal or federally funded website, to ensure access to the visually impaired. All this plus the site had to scale to meet demand.

As a certified software engineer (MS Software Systems Engineering, 1999, George Mason University) and federal employee with more than twenty-five years experience designing, maintaining and managing systems and websites, I can speak with some authority, in part because I have made many of the mistakes I will allude to, just not so spectacularly. I learned from my mistakes. There are many dimensions to engineering a site like this: political, programmatic, architectural and technical. I plan to take each of these in turn in various posts.

Today: the political dimension.

All work for the government is inherently political. This is true even in a science organization where I work. You can’t avoid it because politics are built into the rules and regulations you must follow, such as the Privacy Act and accessibility requirements (Section 508 of the Rehabilitation Act, to be specific). Projects of a certain size, like healthcare.gov, fall into the bucket of a program. A program is basically one or more projects that are interrelated which, because of their overall size, need to be packaged, managed and sold politically and which typically continue indefinitely. Managing a program requires a fistful of certifications. Having the certifications though is not enough. The effective program manager has to really understand all the power players at work and market to them. It’s probably the toughest job out there, particularly for very large scale or high visibility programs. I am sure the program manager for this project tried his or her best, but they got the wrong person. Someone with a lot of experience, a proven ability to manage a program this large successfully, and with the right political skills was needed.

The right program manager would have spoken truth to power, tactfully of course. There were red flags all over this project. Few things are more controversial than health care. He or she probably reported directly to HHS Secretary Kathy Sibelius. To start he or she should have mentioned the triple constraint. It affects all projects and it is basically this:  a project is naturally bounded by cost, schedule and scope. What this means in practice is that if the project was deadline driven, then scope would have to be reduced. This means not all the features of the website could be delivered by October 1, 2013. If the minimal scope was too big, it may have been technically impossible to deliver by the deadline. The typical political response is to throw money at the problem, which is probably why CGI Federal has billed more than $200M dollars so far. Unfortunately, at some point throwing more money at a project is counterproductive. It actually makes the project worse. This means there is an upper limit to what money can buy you as far as features for a given deadline. Someone was probably being dishonest to power by not laying these facts on the table because it was politically incorrect to do so. It was either that or someone in power refused to listen. If that was what happened then Secretary Sibelius should resign. If it was the program manager, he/she should resign.

The White House has some blame here too. This is the Obama Administration’s signature initiative. The Chief Technology Officer for the government should have been all over this project. He should have found the best talent inside and outside the government and brought these resources to bear for HHS, which doesn’t often handle projects like this. Instead, it was developed largely hands off. The CTO should have warned the White House of the high probability of failure, and recommended early on ways to preclude its possibility. Either he did not do this or his warning fell on deaf ears. The Federal CTO wields enormous political capital. It’s hard to imagine that if he squawked that the White House Chief of Staff would ignore him.

In any event, those in the chain of command must have largely acted in CEO mode. “Tut, tut, don’t bother me with details. No excuses, just get it done,” was probably their mentality. Given the prominence of this initiative, everyone from the president on down should have been engaged. They were not.

So a good part of the failure of healthcare.gov is simply an absence of the right kind of leadership. This was a problem that required getting out of the ivory tower and getting your hands dirty. Shame on all who acted in this way.

I don’t operate at the program level but I know enough about it to know I don’t want to. I don’t have the requisite people skills. But if I did I would have not taken responsibility for this work without written and personal assurances from these stakeholders that they would provide the resources to let the project succeed. I’d also want assurances that they would empower me and support me to the maximum extent possible to make it succeed.

Next: the programmatic missteps.

Healthcare.gov and the problems with interactive federal websites

The Thinker by Rodin

Today’s Washington Post highlights problems with the new healthcare.gov site, the website used by citizens to get insurance under the Affordable Care Act. The article also talks about the problems the federal government is having in general managing information technology (IT). As someone who just happens to manage such a site for the government, I figure I have something unique to contribute to this discussion.

Some of the problems the health care site are experiencing were predictable, but some were embarrassingly unnecessary. Off the top of my head I can see two clear problems: splitting the work between multiple contractors and the hard deadline of bringing the website up on October 1, 2013, no matter what.

It’s unclear why HHS chose to have the work done by two contractors. The presentation (web-side) was done by one contractor and the back end (server-side) was done by another. This likely had something to do with federal contracting regulations. It perhaps was seen as a risk mitigation strategy at the contracting level, or a way to keep the overall cost low. It’s never a great idea for two contractors to do their work mostly mindless of the other’s work. Each was doing subsystem development, and as subsystems it’s possible that each worked optimally. But from the public’s perspective it is just one system. What clearly got skipped was serious system testing. System testing is designed to test how the system behaves from a user’s perspective. A subset of system testing is load testing. Load testing sees how the system reacts when it is under a lot of stress. Clearly some of the requirements for initial use of the system wildly underestimated the traffic the site actually experienced. But it also looks like in an effort to meet an arbitrary deadline, load testing and correcting the problems from it could not happen in time.

It also looks like the use cases, i.e. user interaction stories that describe how the system would be used, were bad. It turned out that most initial users were just shopping around and trying to find basic information. It resulted in a lot of browsing but little in the way of actual buying. Most consumers, particularly when choosing something as complex as health insurance, will want to have some idea of the actual costs before they sign up. The cost of health care is obviously a lot more than just the cost of premiums. Copays can add thousands of dollars a year to the actual cost of insurance. This requires reading, study, asking questions of actual human beings in many cases, and then making an informed decision. It will take days or weeks for the typical consumer to figure out which policy will work best for them, which means a lot of traffic to the web site, even when it is working optimally.

The Post article also mentions something I noticed more in my last job than in my current one: that federal employees who manage web sites really don’t understand what they are managing. This is because most agencies don’t believe federal employees actually need experience developing and maintaining web sites. Instead, this work is seen as something that should be contracted out. I was fortunate enough to bring hands on skills to my last job, and it was one of the reasons I was hired. In general, the government sees the role of a federal employee to “manage” the system and for contractors to “develop and maintain” the system. This typically leads to the federal employee being deficient in the technical skills needed and thus he or she can easily make poor decisions. Since my last employer just happened to be HHS, I can state this is how they do things. Thus, it’s not surprising the site is experiencing issues.

Even if you do have a federal staff developing and maintaining the site, as I happen to have in my current job, it’s no guarantee that they will all have all the needed skills as well. Acquiring and maintaining those skills requires an investment in time and training, and adequate training money is frequently in short supply. Moreover, the technology changes incredibly quickly, leading to mistakes. These bit me from time to time.

We recently extended our site to add controls that give the user more powerful ways to view data. One of these is a jQuery table sorter library. It allows long displays of data in tables to be filtered and sorted without going back to the server to refresh the data. It’s a neat feature but it did not come free. The software was free but it added marginally to the time it took the page to fully load. It also takes time to put the data into structures where this functionality can work. The component gets slow with large tables or multiple tables on the same page. Ideally we would have tested this prior to deployment, but we didn’t. It did not occur to me, to my embarrassment. I like to think that I usually catch stuff like this. This is not a fatal problem in our case, but it is a little embarrassing, but only to the tune of a second or two extra for certain web pages to load. Still, those who have tried it love the feature. We’re going to go back and reengineer this work so that we only use it with appropriately sized tables. Still, the marginal extra page load time may be so annoying for some that they choose to leave the site.

Our site like healthcare.gov is also highly trafficked. I expect that healthcare.gov will get more traffic than our site, which is thirty to 40 million successful page requests per month. Still, scaling web sites is not easy. The latest theory is to put redundant servers “in the cloud” (commercial hosting sites) to use as needed on demand. Unfortunately, “the cloud” itself is an emerging technology. Its premier provider, Amazon Web Services, regularly has embarrassing issues managing its cloud. Using the cloud should be simple but it is not. There is a substantial learning curve and it all must work automatically and seamlessly. The federal government is pushing use of the cloud for obvious benefits including cost savings, but it is really not ready for prime time, mission-critical use. Despite the hassles, if high availability is an absolute requirement, it’s better to host the servers yourself.

The key nugget from the Post’s article is that the people managing these systems in many cases don’t have the technical expertise to do so. It’s sort of like expecting a guy in the front office of a dealership to disassemble and reassemble a car on the lot. The salesman doesn’t need this knowledge but to manage a large federal website you really need this experience to competently manage your websites. You need to come up from the technical trenches and then add managerial skills to your talents. In general, I think it’s a mistake for federal agencies to outsource web site development. Many of these problems were preventable, although not all of them were. Successful deployment of these kinds of sites depends to a large extent on having a federal staff knowing the right questions to ask. And to really keep up to date on a technology that changes so quickly, it’s better to have federal employees develop these sites for themselves. Contractors might still be needed, but more for advice and coaching.

Each interactive federal web site is its own unique system, as healthcare.gov certainly is. The site shows the perils of placing too much trust in contractors and in having a federal managerial staff with insufficient technical skills. Will we ever learn? Probably not. Given shrinking budgets and the mantra that contracting out is always good, it seems we are doomed to repeat to these mistakes in the future.

Don’t say I didn’t warn you.

(Update: 10/28/13. This initial post spawned a series of posts on this topic where I looked at this in more depth. You may want to read them, parts 1, 2, 3 and 4.)