1310nm.net

Amazon S3 outage on Tuesday – the growth of complexity.

So if you’ve been experiencing problems with access to some web sites and platforms on Tuesday 28th February 2017 between 09:37 and 13:54PST, you may have been a victim of Amazon’s S3 storage outage in their US-East1 region data centres.

According to The Register, the cloud storage platform was experiencing ‘increased error rates’. And in a later article, it claims that even the status indicators for Amazon’s infrastructure couldn’t be updated, as the data was stored on the platform that failed.

The whole point of the S3 infrastructure is that data is mirrored between systems and data centres, so that if there is a failure, it can be recovered using resources elsewhere.

This didn’t quite happen, and we’ll pull apart Amazon’s explanation to find out why manual processes should be reduced and minimised as complexity increases.Amazon have completed and published a post-mortem of the event, which covers the original issue, and the restoration process.

So the original issue was caused by running a process to remove some elements from the S3 billing system which was having performance problems. However, the actual command entered removed more servers than expected, and took services out of action that were running other elements of the S3 service.

The removal of the index system led to a lack of metadata for the S3 system, and the placement system needed the index system to map new storage allocations. So two key infrastructure environments were also running on platforms that supported the billing system. Given the number of service subsystems removed, the platforms required a restart (it’s not clear if this was a manual intervention for this, or if this occurred automatically). The S3 service is also used as a key building block for other Amazon services that some of their key cloud applications require.

Amazon state that given the growth they have experienced, they haven’t had a full restart of the indexing or placement systems in the larger regions for many years. It’s nice to know that their systems are resilient enough to support the day-to-day failures within the environment, but not running the processes meant that Amazon were not aware of the length of time it would take for the services to restart and complete their validation processes. Once the index service once was back in operation, the placement system was reactivated, and it had recovered, the rest of the Amazon infrastructure that was impacted also started recovering.

So there are a couple of questions here, and they are:

For the first, Amazon is slowing the process of removing subsystems from the infrastructure and providing a minimum threshold below which the subsystems will not be removed from the services. They are also auditing other tools to ensure that they also have these checks.

For the second, they are looking at ways to restore systems to service more quickly, by increasing the number of deployed cells, as well as partitioning the indexing system to deliver a quicker recovery.

For the third, the Amazon Service Health Dashboard will now operate across multiple regions, so that it can be updated during a major outage.

In summary, this event is a function of the emergent complexity that grows rapidly in systems that experience growth and expansion over time, and can remain hidden until it is too late.

So the key take away here is that as you scale, you still also need to test performance of recovery of subsystems. Similarly in large-scale always-on deployments, you must have means of removing some elements from service, and ensuring that you minimise the cross-system dependencies is critical to the running of the service.

But it also helps if you don’t remove the key infrastructure elements completely at any point, as that’s tantamount to sitting on the tree branch whilst wielding the chainsaw between you and the tree trunk.

 

Exit mobile version