Why the First Rollout of HealthCare.gov Crashed, an Architectural Assessment
The Washington Post called HealthCare.gov "one of the most complex pieces of software ever written for the federal government." From an IT perspective, that's exactly why the site didn't work.
First, do no harm! That edict - paraphrased from the Hippocratic Oath - pervades professional health care, as it has since the dawn of Western Medicine some 2,500 years ago. Anyone can appreciate the simplicity and meaning of this mantra. If you do nothing else as a health care practitioner, at least don’t hurt your patient.
Written into the undercurrent of that phrase, you can find an undeniable humility. In fact, for all the various and sundry avenues of science, there is a critical axiom: always be willing to question your assumptions. We only know what we know, and we sure don’t know everything yet, nor will we ever. Let that wisdom serve as a caution to your strongest prescriptions.
Then there’s the doing part. In any life endeavor, one hopes to know something of import, then take appropriate action. Careful is as careful does, and when caring for the lives of others, seriousness is requisite. With this perspective as our canvas, and an understanding of information technology (IT) under our belts, let’s take a look at the rollout of HealthCare.gov, the oft-characterized flagship of the Affordable Care Act, aka "Obamacare."
How blunt can I be? HealthCare.gov was dead on arrival. The collective transparency now says that all of six people signed up on its first day, October 1st. Six. Only 32,994 short of the 33,000 daily goal. And while "capacity" issues were touted as backhanded accolades of demand, anyone with knowledge of Web dynamics knew better.
In fact, the Dutch have been ahead of the game for two decades now, with many lessons learned. The Swiss also have some experience, and of course Massachusetts has MAHealthConnector.org, so-called "RomneyCare."
Bloor went on to say that 40 years of IT experience has proved that big projects always carry big risk.
"Do a big project, high risk [and] high risk of failure. To have three-and-a-half years [to build the HealthCare.gov site] sounds like, in a modern day, that would be enough, but here’s a high-risk project and it's all turned out badly," Bloor said.
He was most candid about the way integration testing was carried out for HealthCare.gov.
"The final thing that did me, almost had me burst out laughing, is no integration testing until two weeks before you go live - and that's just like, how could you ever do that with something like this? How could you?" Bloor said.
Sharing that perspective is a veteran federal contractor and fellow data scientist, Dr. Geoffrey Malafsky of Phasic Systems Inc. Malafsky recently offered an hour-long, detailed assessment of HeathCare.gov's roll-out, and commented on both the strategic and tactical decisions made. Above all else, he points the finger at the acquisition protocol of the federal government.
"One of the critical failure points that permeates particularly government IT projects is this legacy, archaic, obsolete notion that you can articulate all the necessary business logic with some linear requirements process. That fundamentally does not work with large IT systems," he said.
His point is that large IT systems will bedevil even the smartest planners. You just never know whence problems will come, where you’ll need to provide extra support, or what kind of troubleshooting you’ll find yourself engaged in. Consequently, it’s a bad idea to constrain the design process by forcing project engineers to anticipate everything they’ll need upfront.
Complicating matters, Malafsky says, is the fact that procurement officials in the federal government have now become so powerful - due to the vast amounts of money that they control - that they are essentially in control of how major IT projects go forward. This puts departmental officials in the role of supplicant, and inserts an element of risk into a crucial procedure at the center of any significant IT initiative: choosing the right tools, technologies and contractors.
"The people who will most vociferously disagree with that statement are called acquisition professionals, and I encourage them to show up at my house and we will sit around and debate this, because I have a lot of empirical evidence to back that up," Malafsky said.
One big question to ask is why the government embraced such a comprehensive architecture for this website.
"If the overarching government program is set up such that the insurance companies actually own the client after they get a commitment, then why not just push the traffic off to the existing client interaction environment channel that the insurance companies already have? Yes, they might need to augment their own, but that would be a valid business reason because they're now going to get new clients," Malafsky said.
World renowned (and now somewhat infamous) security software pioneer John McAfee also commented on this strategy just recently, making some controversial remarks on the "Neil Cavuto Show" on Fox News:
"Oh, it is seriously bad," McAfee said. "Somebody made a grave error, not in designing the program but in simply implementing the Web aspect of it. I mean, for example, anybody can put up a Web page and claim to be a broker for this system … any hacker can put a website up, make it look extremely competitive, and because of the nature of the system - and this is health care, after all - they can ask you the most intimate questions, and you’re freely going to answer them."
With respect to the Web architecture itself, Malafsky points to the obvious - that the Internet was not built to run complex applications. That was the job of the mainframe back in the days when the Web was in its infancy. Rather, the design point for the Internet was for simple information-sharing via individual pages distributed across a wide network of computers. In systems design, the goal is to build something that works. Incorporating complexity for its own sake is ill-advised, downright sacrilegious, and almost always a recipe for disaster.
In its own deep-dive on what went wrong with HealthCare.gov, The Washington Post published a now-famous graphic that depicted the various challenges experienced by the site. The language used by the paper to describe the site is actually quite revealing, especially when you consider that this is the established newspaper of Washington, D.C., the epicenter of the U.S. federal government:
HealthCare.gov, built by 55 contractors, is one of the most complex pieces of software ever created for the federal government. It communicates in real time with at least 112 different computer systems across the country. In the first 10 days, it received 14.6 million unique visits, according to the Obama administration.
Source: The Washington Post
Arguably, by definition, for someone to assert that they have a piece of software, it must be the case that the software actually works. Otherwise, you have a compilation of code that doesn’t yet constitute a piece of software. That tidbit aside, note the numbers listed, especially the part about communicating "in real time" with 112 different computer systems around the country. This is a perfect example of glorifying complexity for its own sake.
The Graphic Graphic
Systems designers the world over must have cringed upon seeing that graphic. Let’s take a look at the different steps outlined, and in particular, the serious issues that arise with such an ambitious architecture. First and foremost, we’ll consider the number of potential transactions that have failed so far, most of them due to software timeouts - instances when one part of the transaction process doesn’t receive its necessary data within an acceptable time period.
"Every single piece of software in that graphic had its own timeouts, and it's not even one timeout. It can be more," Malafsy said. "The expiration of any one of those will kill the entire transaction. Some of those are easy to set up and monitor, like log files. Those are like the timeouts on the Web server and the app server. Some are more opaque. You have databases with concurrency and triggers, but they're multi-interaction. If you really do a deep dive into how databases work, it is not a pretty sight." (Learn the basics of how databases work in our Databases Tutorial.)
"The database servers love to say, 'We keep everything orderly." Not really," Malafsky said. The only way that they can get the performance up and truly manage it is that there is a series of time-stamped files that are created on the storage, persistent storage, and they are not rolled up into one comprehensive accurate set of data that's available for anyone at any time because that takes too long. That would kill the transactional latency. You have to look in those details and then that's rolled up through a management interface - and that goes by some very nice sophisticated names like triggers and concurrency - but it basically means it takes a bunch of time to go get the data, update the data, and if I can't do it before another request comes in, I'm just going to tell you, 'Forget it. I'm closed for business.'"
- "The Front Door"
The Washington Post's graphic includes a very curious piece of information right at the tippy-top in its first "problem" section, where it says that "the Obama administration decided in late September to exclude for now a feature that would have let people shop for health plans without first creating an online account."
Wow. First of all, is that really a "feature" that was excluded? We’re talking about fundamental site flow. Originally, the plan was to let people shop around, then at the appropriate time, consider registering an account.
Some critics have speculated that this last-minute change (in and of itself an incredibly risky move with a project this big), shows that the administration knew the site wasn’t working well in the last couple weeks leading up to the October 1st launch. Instead, the idea became to capture all the information of those who needed insurance, such that marketing efforts could be made to them somewhere down the line once the site was functional.
From a usability and capacity perspective, this last-minute move put a tremendous strain on whatever database foundation the site had. This explains all the anecdotes of people not being able to register, or being forced to change their passwords. And let’s be honest here. Is there any problem more thoroughly solved all around the World Wide Web than the process of setting up a user account? Yahoo, Google, Microsoft, YouTube, Twitter, LinkedIn - even your grandmother’s knitting class - has its own dynamic sign-up form these days, with baked-in unsubscribe, forward and other fundamental features.
When it came time to register on HealthCare.gov, the contractors say, "The communication between some of these systems wasn’t working properly, meaning that many users weren’t able to successfully create an account."
What? Which systems? We’re talking about a customer database! The "systems" would then be the Web client, and the customer database. Which other systems were involved? This particular "explanation" makes no sense.
- Proof of Identity
Next up, proof of identity. For this step, no problems are listed, which is also curious. Experian is listed as the third-party agent which will "verify" someone’s identity. No doubt, identity resolution is a serious issue that must be addressed. Most insurance companies use your Social Security number, as well as third-party vendors like Experian. Are there really no problems with this step?
We know for sure from numerous anecdotes, verified by documentation presented, that HealthCare.gov has definitely experienced breeches of confidential information. Malafsky points out that the data quality issues are the much more serious ones than the capacity issues. (And Bloor notes that if capacity issues really were the problems, they should have been solved in days, not weeks. You can add hardware, virtualize, do any number of things for capacity issues.)
No, the data quality issues are the really dangerous ones. And the most troubling aspect of all is the kinds of data quality issues that have arisen. There are stories of people signing up, then receiving confidential eligibility documents belonging to other registrants! This smacks of an absolutely dreadful design under the covers. Don’t they use some kind of universal identification code for each person?
"The smart move would be to create a universally unique identifier (UUID), store encrypted values - note plural - of what might be unique information (SSN, DOB, age, biometrics), and then assess these for evidence of unique personhood," Malafsky said.
That someone could receive a different person’s confidential documents is unspeakably bad, and demonstrates some very serious mapping issues deep in the belly of the beast.
OK, folks. Here’s where life gets interesting! If your transaction hadn’t timed out by now, it almost surely did on this step. According to The Washington Post's graphic, "The system must determine eligibility for financial help by sending the consumer’s personal information to a Data Hub that contracts dozens of federal and state agencies."
Trying to execute a transaction across three or four key systems is a genuine challenge. Trying to hit "dozens" of state and federal agencies "in real time" is off the charts, and wholly unnecessary. Malafsky took just one interaction point to make his case:
"One of the obvious ones here is getting financial data per person to determine if they deserve a subsidy or what their price point would be, so we go off to the IRS. Now, we have some link over there, but that link is live. That means as the user is sitting there waiting at their computer screen, that has to make a link over to the IRS systems. In a perfect world, that link happens, the computers talk, I get my result, and I come back.
"What about in the real world? What about when the IRS systems are overloaded? What about when they are at capacity? What about when maybe they're doing maintenance? What about it's a network between the network operating center of the entry-level Web page that the client sees to the IRS center? Maybe there's some problems there. Maybe there's a virus. Maybe there's a Trojan horse running around and the telecoms have shut down things to solve that problem. That will kill the transaction from the point of view of the user. That is just one of many such points in this architecture," Malafsky said.
His point is that each and every one of those systems - as this Web archicture was designed for HealthCare.gov - each and every one of them is a potential Achilles heel. That’s a no-win situation. And again, it’s unnecessary from a workflow perspective. There are any number of points along the way where the workflow could be augmented with near-real-time data marts, right-time data marts, even human intervention to address the main failure points of automation.
The big strategic error, therefore, was trying to achieve such an incredibly complex site.
- Shopping for a Plan
Remember: This was supposed to be the original site flow. Web surfers would first shop for an insurance plan. Then, when they found something of interest, they could register for an account, check for subsidies if they wished and ultimately purchase a plan.
According to the graphic, "some individuals with low incomes are being told they are not eligible for subsidies or don’t qualify for Medicaid, even though they should." The question here becomes: Why is this problem listed under Step 5 instead of Step 4? This is a problem associated with the previous step not being calculated appropriately, and thus not being correctly factored into Step 5.
- Insurance Translation
In our world, we call this part ETL. It’s as solved a problem as site registration.
- Insurance Enrollment
The Holy Grail! But wait, there’s one last "glitch," according to HealthCare.gov's contractors: "The reports, known as 834s, are sometimes confusing and duplicative, making it difficult for insurance companies to know who their new customers really are."
Let’s take a moment of silence to appreciate this one …
So, yes, in actual fact, an insurance company must know who it is truly insuring. That’s a rather critical component. The same goes for an emergency worker knowing which person to treat, or a doctor knowing into whose chest a heart should be transplanted. In the media business, we might characterize this little ditty as a case of our federal contractors quite successfully burying the lede.
Last but not least, the graphic states that "administration officials say shoppers have filed more than 700,000 health insurance applications. Some of those have come through HealthCare.gov and others through state marketplaces. But officials refuse to say how many people have enrolled in a plan."
Perhaps the sharpest curveball thrown into the mix just recently was the move to promote paper applications due to the site’s functionality challenges. Unfortunately, even the paper forms must be submitted into the non-functioning site. By definition, that’s not a manual override. By definition, a manual override must allow someone or something to manually override the automated system.
And now, at the time of this article being published, we hear that for the relaunch of HealthCare.gov, the administration is relying more heavily on insurance companies to fix the problems. Guess what that means - I’ll bet you doughnuts to dollars (yes, it used to be the other way around), that what’s happening right now is a case of widespread rip-and-replace. Specifically, programmers and engineers have likely ripped out many of the "real time connections" and other intensely expensive middleware that got the Washington Post’s editors so excited. Replacing all that complex code are much simpler, higher-latency connections that are fed by a range of data marts linked via more of a batch environment to the various state and federal systems.
In other words, the kind of solution that Malafsky, Bloor and McAfee suggest is where we’re going. And all that fancy spaghetti code that these federal contractors spent half a billion dollars building for the past three-and-a-half years? Into the sharps container.
And one final note: According to testimony before Congress by Henry Chao, the Centers for Medicare and Medicaid Services deputy chief information officer, the payment system that will reimburse insurance companies with all those federal subsidies? It hasn’t been built yet! That means this might just be the first large-scale e-commerce site ever launched without a working means for transferring money.