ATO launches tech rebuild
The Australian Taxation Office will rebuild its internal IT infrastructure after serious failures of its outsourced storage environment.
The ATO is responding to the outages of its HPE 3PAR storage area network (SAN) in December 2016 and February 2017, and will “enhance [its] IT capability pertaining to infrastructure design and implementation planning (particularly relating to resiliency and availability)”.
“This should be done having regard to recruitment, engagement of contractors, and whole‑of‑government strategies,” the ATO said in a systems report.
The high-level technical causes and system design issues included improperly fitted cables, inactive monitoring tools, and a SAN design that promoted performance over stability and resilience.
Another major meltdown was the result of human error as HPE technicians tried to replace SAN cabling.
“Unfortunately, during one replacement exercise, we were informed that data cards attached to the SAN were dislodged,” the ATO said.
“This caused the 3PAR SAN to act in a similar way to that noted during the December outage. This included unsuccessful steps to automatically remediate, followed by a system shut‑down to preserve data integrity.”
But there are bigger issues in the outsourced arrangement itself, in particular the HPE’s system design decisions.
ATO said its IT staff had “no direct access” to the SANs.
“Analysis of SAN log data for the six months preceding the [December 2016] incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage,” the ATO said.
“Specifically since May 2016, at least 77 events related to components that were observed to fail in the December 2016 incident were logged in our incident resolution tool.
“In addition at least 159 alerts were recorded in SAN device monitoring and management logs (SNMP logs).”
HPE replaced cables connecting the SAN, but the ATO said the alerts continued.
“We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN,” the ATO claimed.
“The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure.
“The SAN build [also] included ‘daisy‑chain’ cage configuration which exacerbated the risk of errors spreading across cages as occurred during the incident.”
The ATO said HPE had not evaluated other configuration options during setup.
“This particular SAN configuration leverages a feature known as wide-striping which is designed to significantly improve performance by reading and writing blocks of data to and from multiple drives at the same time, preventing single-drive performance bottlenecks,” the ATO said.
“When several physical disk drives were impacted by a drive firmware issue which prevented those drives from re-booting, the result was that a small number of drives temporarily and in some cases permanently prevented access to a significant amount of application data.
“This also had the effect of extending the duration and complexity of the recovery effort.”
The ATO cautioned that it is still not sure of the “root cause” of the issues that downed the SAN.
“Root cause examination cannot be completed until the SAN is physically removed and taken back for forensic testing. This process may not be completed until late 2017,” it said.
The ATO plans to decommission the current SAN in July, allowing HPE to ship it elsewhere for forensic analysis.
It will be replaced with a newer 3PAR SAN with better data replication, failover, backup and monitoring than the prior system.