Er was een probleem met de apparaten (switches) die de (redundante) internet (netwerk) verbindingen naar ons datacentrum regelen. Tijdens het verhelpen van dit geïsoleerde probleem is helaas een feedback loop ontstaan waardoor “rondzingend” verkeer de meerdere verbindingen heeft verstopt.
We hebben deze meerdere verbindingen naar het datacentrum juist om te voorkomen dat we de verbinding naar onze machines kunnen verliezen, dit probleem heeft echter al deze verbindingen verstoord waardoor we fysiek naar het datacentrum moesten reizen om de verbinding te herstellen.
Nadat de verbinding was hersteld merkte we dat de servers binnen het datacentrum tijdens deze verstoring geen verbinding hebben kunnen maken met de opslagservers. Als een gevolg hiervan hebben we alle servers moeten herstarten om deze verbinding te herstellen.
As the result of a malfunction originating in one of our switches the reenhost network in Amsterdam (Haarlem) has been unreachable between roughly 15:15 and 16:30 CET. After the network connectivity was restored, it was apprnt that access to the storage systems could not be re-established for all systems. The only way to restore this was to reboot all virtual machines in the network. Impacting all Greenhost services.
It took till 18:00 to e all major services, and until 18:30 before all machines were rebooted and available. Individual (cloud) VPS client have faced additional issues related to their machines rebooting.
There has been no data loss and all delayed e-mail has been delivered afterwards.
10:30 – ur engineers confirmed a problem in the traffic propagation of one of our customer VLANs through the network infrastructure
14:00 – fter several hours in pursuit of the source of the problem it was found that one of our core switches experienced an STP earlier that morning and was since discarding the traffic of the VLAN in question
14:30 – fter consulting the documentation possible courses of action were discussed to remedy the issue
15:00 – fter agreeing on a possible solution an attempt was made to trigger a new STP root switch re-election which would have in theory enabled proper propagation of traffic of the VLAN in question
15:15 – hile attempting to trigger the STP root re-election a network loop was inadvertently created over one of the access switches. This made a network congestion that resulted in most of our infrastructure being unable to communicate with the rest of the platform or the outside world
15:25 – team of our engineers was dispatched towards our data center to fix the problem on-site, while our front office tweeted about the outage and prepared a voice message for clients calling about the issue. The engineers at the headquarters were trying to find a way to restore communications remotely
16:00 – he engineers arrived at the data center,
16:10 – The team in the headquarters immediately started verifying the state of the platform
16:15 – t this point it was apparent that some of our infrastructure wasn’t able to recover from the network outage a. With modern approaches to both Web, e-mail and VPS hosting having underlying storage heavily dependent on the network to function there is increased sensitivity to network disruptions
16:25 – fter a quick planning session engineers set out to restart and verify the operation of our infrastructure elements that were showing erratic behaviour after the outage. This affected a large part of our services, including e-mail, Web and most of our VPS hosting customers
17:20 – ost of our hosting services were back online, an outage notice habeen put on our website which was later updated as the situation changed
18:30 – verything was confirmed to be back online and functioning as expected, including the problematic VLAN that sparked this issue initially. The final update to our website and twitter account was published.