Oorzaak verstoring 12 februari 2018

Verder op deze pagina vind je een relatief gedetailleerde en technische Engelstalige omschrijving van de oorzaak van de verstoring. Omdat we begrijpen dat niet iedereen behoefte heeft aan zo veel details hebben we hier eerst een kortere samenvatting van het probleem:

Er was een probleem met de apparaten (switches) die de (redundante) internet (netwerk) verbindingen naar ons datacentrum regelen. Tijdens het verhelpen van dit geïsoleerde probleem is helaas een feedback loop ontstaan waardoor “rondzingend” verkeer de meerdere verbindingen heeft verstopt.

We hebben deze meerdere verbindingen naar het datacentrum juist om te voorkomen dat we de verbinding naar onze machines kunnen verliezen, dit probleem heeft echter al deze verbindingen verstoord waardoor we fysiek naar het datacentrum moesten reizen om de verbinding te herstellen.

Nadat de verbinding was hersteld merkte we dat de servers binnen het datacentrum tijdens deze verstoring geen verbinding hebben kunnen maken met de opslagservers. Als een gevolg hiervan hebben we alle servers moeten herstarten om deze verbinding te herstellen.

Full description
As the result of a malfunction originating in one of our switches the Greenhost network in Amsterdam (Haarlem) has been unreachable between roughly 15:15 and 16:30 CET. After the network connectivity was restored, it was apparent that access to the storage systems could not be re-established for all systems. The only way to restore this was to reboot all virtual machines in the network. Impacting all Greenhost services.

It took till 18:00 to restore all major services, and until 18:30 before all virtual machines were rebooted and available. Individual (cloud) VPS clients might have faced additional issues related to their machines rebooting.

There has been no data loss and all delayed e-mail has been delivered afterwards.

Our network has a fully redundant design to be fault tolerant. However one of the mechanisms (STP) in this design was the root-cause of the failure.

10:30Our engineers confirmed a problem in the traffic propagation of one of our customer VLANs through the network infrastructure.

14:00After several hours in pursuit of the source of the problem it was found that one of our core switches experienced an STP re-election earlier that morning and was since then discarding the traffic of the VLAN in question.

14:30After consulting the documentation possible courses of action were discussed to remedy the issue.

15:00After agreeing on a possible solution an attempt was made to trigger a new STP root switch re-election which would have in theory enabled proper propagation of traffic of the VLAN in question.

15:15While attempting to trigger the STP root re-election a network loop was inadvertently created over one of the access switches. This made a network congestion that resulted in most of our infrastructure being unable to communicate with the rest of the platform or the outside world. Unfortunately the out-of-band access failed, so reverting the configuration was not immediately possible.

15:25A team of our engineers was dispatched towards our data center to fix the problem on-site, while our front office tweeted about the outage and prepared a voice message for clients calling about the issue. The engineers at the headquarters were trying to find a way to restore communications remotely.

16:00The engineers arrived at the data center, but entry was delayed, as it was to crowded at the data center at that point in time.

16:10Once on site the problematic switch was reverted to the original configuration which restored the network. The team in the headquarters immediately started verifying the state of the platform.

16:15At this point it was apparent that some of our infrastructure wasn’t able to recover from the network outage automatically. With modern approaches to both Web, e-mail and VPS hosting having underlying storage that is heavily dependent on the network to function there is increased sensitivity to network disruptions, especially if timeouts occur.

16:25After a quick planning session engineers set out to restart and verify the operation of our infrastructure elements that were showing erratic behaviour after the outage. This affected a large part of our services, including e-mail, Web and most of our VPS hosting customers.

17:20Most of our hosting services were back online, an outage notice has been put on our website which was later updated as the situation changed.

18:30Everything was confirmed to be back online and functioning as expected, including the problematic VLAN that sparked this issue initially. The final update to our website and twitter account was published.

 Corrective actions:
Based on the events of yesterday, several improvements are scheduled for our operational procedures and network infrastructure design.

  • The out-of-band equipment failed. This made it necessary to send staff to the data center, delaying recovery. The design of our out-of-band facility will be adjusted.
  • Although there is strict policy in place for change management, this policy will be improved.
  • We are in contact with the data center how to make sure delays for entry can be eliminated. Although the delay was limited in time, we think this should not have happened in the first place.