DevOps / AWS /

Taking the positives out of a negative situation

02 March 2017 | Author: Aaron


This week, Amazon Web Services (AWS) reported problems with S3 in their US-East-1 region that was affecting their customers that use that region. The usual ‘online chaos’ headlines quickly appeared, making AWS hit the news.

As issues go, a problem with a fundamental tool like S3 will always appear greater than it is – because it affects so many users.

Our own monitoring at base2Services alerted our US team to the issue. This enabled them to contact customers and proactively review potential impacts.

AWS were very proactive about the issue. We were soon contacted by an AWS Solutions Architect who alerted us straight away to start discussing solutions.

The impact on our clients was relatively minor. All our customers’ sites are cached via CloudFront, so to the end user, everything appeared and worked as expected, with some mild impact on streaming videos.

For our clients, the only real issue was the inconvenience of not being able to make updates or deployments. Overall, there was no downtime to their services.

When outage issues happen, a lot of people panic, and discussion inevitably moves to cloud versus physical infrastructure.

In reality, outages, even though rare, are going to happen. In our experience the Cloud is still a far more stable environment than physical infrastructure. As an industry we continue to evolve, and incidents like this provide us with a chance to learn, adapt and respond. Had this same issue happened three years ago, the impact would have been far greater.

In this instance this week, the Personal Help Dashboard launched by AWS last year at re:Invent Las Vegas, proved pivotal in identifying the exact resources being impacted. This enabled us to isolate problems, react accordingly, and reduce impact.

We look upon any issue as an opportunity to analyse where we can improve. We will be integrating the Personal Help Dashboard more and more into our monitoring to further enhance our system.

If you have any questions or wish to discuss DevOps as a Service, please contact us.

Update 3rd March 2017

AWS released a full statement detailing the human error behind the outage, and apologising for the effects on its clients.