Poor management helped cause most of the Australian government’s recent string of technology failures. Stilgherrian says that’s often down to cultural issues.
When it comes to managing big IT projects, the Australian government has been showering itself in something other than glory. Since the middle of 2016 we’ve seen three highly-visible train wrecks, and a plethora of slightly more minor disasters.
In November 2016, the Australian Taxation Office (ATO) suffered a systems outage lasting a couple of hours. It was merely the precursor to a series of major outages, some lasting for days, that continued through to February 2017. It’s still not fixed.
From December 2016, it became obvious that Centrelink’s automated debt identification and recovery system was a mess — organisationally, socially, and ethically. It has inflicted distress and financial pressures on the very people that Centrelink is meant to be serving, and it’s unlikely to achieve its budget goals. But the warning signs were there at least a year earlier. If only someone had known to look.
But before those two came the most public failure of all, and that’s today’s case study.
In August 2016, the Australian Bureau of Statistics (ABS) had planned to run the 2016 Census online. But their systems collapsed under the load of a series of distributed denial of service (DDoS) attacks. The eCensus became Censusfail. Humiliated, the ABS was forced to shut it all down.
All three cases have something in common. They didn’t involve a failure of technology, but a failure to manage the design and implementation of technology, and a failure to understand and to mitigate the risks.
Boards and executive management teams need to get this stuff right.
The eCensus dream that became a nightmare
According to the ABS, the Census is “Australia’s largest peacetime logistical operation”. It’s a long-trusted exercise involving 40,000 employees, 20 million postal items, and two million phone calls. Choosing to move all this data collection online was a no-brainer.
The ABS had already run an online system for the 2011 Census, but in 2016 the eCensus would become the primary mechanism. Paper was to be a last resort.
And so a massive communication campaign urged Australians to go digital, and fill in the eCensus on Census night, Tuesday 9 August. And so they did. Until it fell over.
If schadenfreude is to your taste, then the two government inquiries into Censusfail will feed even the biggest gluttons.
- 2016 Census: issues of trust by the Senate Economics References Committee.
- More readably, and my personal recommendation, the Review of the Events Surrounding the 2016 eCensus (PDF) by Alastair MacGibbon, the Prime Minister’s Special Advisor on Cyber Security.
MacGibbon’s 93-page report includes a 15-page timeline. It details a cavalcade of technical failures, communication failures, and poor crisis management.
Read it in full, if you have the time. Meanwhile, here are the highlights.
- Mid-morning, hours before Census night’s post-dinner peak times, the eCensus system started failing under DDoS attacks. This wasn’t a flood of malicious traffic, though. It was a trickle, less than one percent of the volume that other organisations deflect every day.
- The ABS and their prime contractor IBM implemented their agreed mitigation procedure. It worked at first, but a third of the malicious traffic was still getting through to the eCensus system, which couldn’t cope with even this small extra load.
- The Australian Signals Directorate (ASD), the nation’s cyber defence organisation, was called in for technical advice. That advice was incorporated into the increasingly ad hoc and rapidly failing crisis plan.
- When the evening peak period began, the eCensus system simply couldn’t keep up, even though legitimate traffic was within the ABS’s expectations. At one point, a router had to be rebooted, but somehow it lost its settings, and had to be reconfigured manually. The eCensus was overwhelmed.
- Social media was flooded with citizens complaining they couldn’t complete the eCensus. But the ABS social media team continued to tell them that everything was working “smoothly as expected”.
- While all this was happening, IBM’s network monitoring detected an unexplained spike in outbound traffic. Could it be malicious activity, perhaps someone stealing the Census data? With that fear on top of all the failures, at 9.15pm the ABS decided to shut it all down.
- The eCensus wasn’t back online until 1 day 18 hours and 44 minutes later. All up, the outages totalled 43 hours.
eCensus failures on the night, there were many
- There was no clearly identified and tested cybersecurity incident response procedure. That led to messy ad hoc decision making, and improvised fixes.
- Crisis communications were inadequate. It took too long to inform Ministers, stakeholders, and the public. There was no contingency plan or draft talking points for adverse events.
Citizens trying to do their eCensus saw an error message, and it was wrong:
“Please be advised that the 2016 Census online form is currently experiencing high volumes. Please try again in 15 minutes.”
- ASD didn’t provide timely notifications to the Australian Cyber Security Centre (ACSC), nor to the Department of Prime Minister and Cabinet.
- The government’s Cyber Incident Management Arrangements (CIMA) simply weren’t up to the task.
As you might expect, all these failures on Census night stemmed from failures in the planning stages.
eCensus cybersecurity planning failures, we’ve got a few
- There were gaps in cybersecurity arrangements. There wasn’t a comprehensive security framework, and there wasn’t an independent security assessment.
The ABS’s Request for Tender had specified an independent assessment, to be done by an Information Security Registered Assessor Program (IRAP) assessor. IBM agreed to do this, but it never happened.
- The ABS didn’t have a formal process for accepting responsibility for system security, including identifying and accepting any residual risks.
- The DDoS protection was completely inadequate, and hadn’t been tested. My understanding is that an independent test and assessment would have revealed the flaws well in advance.
- Exchanges between ABS, IBM, and ASD suggest a lack of clarity in capacity, roles, and responsibilities.
eCensus procurement, contracting, and governance failures, we’ve got a few
MacGibbon lists a dozen issues here, but nearly all of them relate to the “cosy relationship” that existed between ABS and IBM.
ABS is highly dependent on IBM’s technology. Moving to another vendor was seen as expensive, and difficult to achieve in what they saw as tight timelines. As a result, the ABS hadn’t conducted any market testing or an open approach to the market since 2008
This “vendor lock-in” can cause problems.
- With no open approach to market, how can an organisation be sure they’re getting the best solutions or the best value for money?
- ABS held the “unrealistic assumption” that a supplier who had performed well in the past would continue to perform well in the future, even when the scope of services had changed.
- Because the ABS trusted IBM, their scrutiny and independent assessment of IBM’s security solutions were inadequate.
- ABS did not have an effective IBM outsourcing oversight framework. It appears that ABS had such long-standing trust in IBM that their assurances were taken at face value.
There seems to have been a complete breakdown of the risk management process. As just one example, MacGibbon explained how the process worked for the DDoS protection.
The 2016 online Census Risk Management Plan — dual-badged to the ABS and IBM, but owned by IBM — specifically identifies DDoS risks and associated planning. Loss of system availability via a ‘technical’ or distributed denial of service attack is identified as ‘possible’ in likelihood, ‘major’ in consequence and ‘high’ exposure
Risk mitigations were put in place, which reduced the risk assessment to ‘unlikely’ in likelihood, ‘major’ in consequence and ‘medium’ in exposure.
In both these instances of risk planning, the high initial risk ratings do not appear to have driven a resultant focus on the effectiveness of implementation of the identified controls. Nor do they appear to have shaped preparedness for incident management and potential supporting communications strategies on Census night.
In short, if a risk was still possible and it could cause major consequences, why wasn’t that planned for?
In October 2016, I wrote that Censusfail was an omnishambles of fabulous proportions.
“ABS failed to make sure its contractors were doing their job. As we’ve said so many times, you can outsource work, but you can’t outsource responsibility. Yet the ABS submission to the inquiry is studded with claims that they’d received assurances which they didn’t cross-check.”
eCensus privacy planning failures, we’ve got a few
Starting with the 2016 Census, the ABS planned to extend the time it holds on to the name and address information linked to each census record, from 18 months to four years. But the ABS mismanaged the entire process.
The Privacy Impact Assessment process was limited, with little community input outside stakeholder organisations — a problem, given the added privacy concerns of an eCensus. The ABS failed to explain their justification for this extension to the public. A grassroots protest began, which soon evolved into an ongoing media story, with politicians adding leverage.
The ABS response, however, was to tell those with concerns that they were in the minority and not to worry. As MacGibbon put it:
This coverage created overwhelming ‘noise’ making it difficult for the ABS to remain on message.
The ABS’s planned communications were being drowned out. But rather than trying to adapt its approach to limit the impact the reporting had on the public sentiment toward the Census, the ABS stuck to planned messaging ignoring the public relations storm brewing around them.
The failings of the ABS to address issues of concern in the media extend to its use of social media. Analysis conducted on ABS Twitter and Facebook accounts shows that at no point did the ABS significantly change its planned posting schedule or content as a result of critical media reporting .
The ABS did have a social media strategy, of sorts, but it wasn’t flexible. It simply couldn’t react to the groundswell of concern.
- The ABS’s ‘qualifiers’ (thresholds that had to be met to raise concern) were too high. A ‘red level scenario,’ the highest categorisation for negative conversation, was enacted only if someone had 10,000 plus followers or a post had over 30 engagements.
- The ABS’s response to a ‘red scenario’ was simply to hold all social media communications.
“While the ABS did eventually start engaging in the mainstream media, it was too little, too late,” MacGibbon wrote.
If you ever need a case study in how not to communicate a potentially controversial policy, this is it.
The ABS made one fundamental mistake. They continued to talk about the Census and privacy the same way they’d always done. The ABS and the Census were trusted, and the government will take good care of your personal data.
But citizens are now seeing news stories about nation-state digital surveillance, and data breaches, including some by governments. Citizens have learned first-hand how social networks and online advertising companies collect and use their data, and increasingly they don’t like it.
Citizens have also learned how to use social media to organise themselves, amplify their message, and effect change.
The ABS simply failed to adapt to these changing times.
ABS organisational culture failures, we’ve got a few
“The ABS’s business model is old, outdated and in need of renewal. The ABS is almost missing the potential of the digital age by clinging to past practice.” MacGibbon wrote, citing an ABS-initiated review of stakeholder relationship health published in June 2016.
Consultancy firm CapDA reviewed the ABS’s ICT capacity and capability to conduct the 2016 Census, with a final report in May 2014 . It wasn’t exactly positive.
- Although the ABS has project management expertise, rigorous project management is not strongly embedded within the culture and behaviours of the ABS.
- The way the ABS uses agile software development method means that dealing with security, high performance and accessibility are considered late in the cycle.
- It was unclear where, and in whom, the responsibility and authority is vested for making key architectural decisions.
- There was no evidence that any application or data centre performance monitoring is in place.
To judge by subsequent reports, things haven’t improved much since.
If in doubt, blame the vendor
In the wake of Censusfail came a barrage of finger-pointing. The ABS blamed IBM. IBM blamed contractors. Contractors pointed to advice not taken, or options not purchased.
In their submission to the Senate inquiry, the ABS blamed IBM for the failure. The head of the ABC, Australian Statistician David Kalisch, was the captain who wasn’t taking responsibility for navigating his ship of stats.
Two months after that, when the reports came out, Kalisch had changed his tune.
“The ABS underestimated the nature, complexity and risk of the change process,” Kalisch told The Mandarin. Part of it was organisational pride: the ABS didn’t ask for enough help, because they wanted to do it all themselves.
Kalisch also acknowledged that the ABS had failed to adapt. “We expected the public, media and politicians to respond like they had in the previous censuses, and they didn’t,” he said.
The risk of outsourced IT, that’s the big one
As I wrote at ZDNet in November 2016, the two Census reports highlight government IT incompetence. Section 6.70 of the Senate report quotes the CapDA review:
“CapDA’s report highlights the professionalism and dedication of the staff at the ABS, but in the end recommends that the ABS did not have the internal capacity to develop and deploy an eCensus. If they did not have the ability to develop a solution themselves, it stands to reason that they would only have a limited capacity to question and challenge a contractor employed to develop such a solution.”
Exactly. And this is the real lesson.
If no one in your organisation understands the technology, then you can’t manage technology projects, whether they’re executed by internal staff or external contractors.
- Full video: David Kalisch and Alastair MacGibbon on Census 2016.