Go North Wales service disruption last week and over the weekend.
Following on from the recent downtime of our main website gonorthwales.co.uk please see below explanation as to what caused the issue from the Managing Director of New Mind | tellUs Richard Veal our website provider.
Following the recent service disruption, I am pleased to be able to write to you to say that a complete data restore took place over the weekend and all websites, software applications and web services are fully operational again.
I am however conscious that you will need to offer an explanation to the tourism providers who feature on the websites and will also have concerns as to whether another outage is likely to happen in the future. To address both of these points, this letter gives a detailed breakdown of the event and also outlines broadly what our plans are going forwards.
Before I do that however, I would like to offer a sincere apology for the disruption and for the length of time that it went on for. I would also like to thank you for your patience and trust in the team whilst we recovered the situation. Your emails of support were greatly appreciated.
As you know, New Mind tellUs has invested a great deal of time, money and effort into the hosting solution that we have built. We fully understand the importance of running a business critical infrastructure and take great pride in the reliability of the service. To put this into context, we achieved an uptime of 99.94% for both 2018 and 2017.
The reason that the system is usually so reliable is that it is multiply redundant at all levels. We have multiple connections to the internet, multiple power supplies and dual firewall configurations. Many years ago we also switched to the industry standard of a virtualised environment (VMWare) where the database and web servers are managed independently of the hardware they are running on.
At some point of course, the data does reside on physical disks and we have also invested in a Storage Area Network (SAN) to provide a resilient, high capacity storage solution. The SAN consists of 240 separate disks in a raid array configuration, meaning that if any disks fail then the servers are immediately moved to other disks in the array with no downtime whatsoever. This configuration is actually the reason for the long periods of uptime and excellent service delivery.
The SAN is manufactured by a company called Tegile who are in turn owned by Western Digital, one of the biggest suppliers of disk storage in the world. We pay Tegile for full hardware and software support of our SAN and they are fully responsible for its operation.
On Wednesday night, Tegile had scheduled a standard upgrade to the operating system that controls the SAN and we provided them access to do this work. This upgrade had already been applied to many hundreds of SAN devices across the world and was therefore not perceived of as a high risk activity.
However, as soon as they applied the upgrade to the SAN it rendered not just the disks but all the data on them inaccessible and hence there was a total loss of service. Despite their engineers working throughout the night they could not fix the problem and we entered the first day of the outage with our own network engineers simply waiting for Tegile to restore access to the SAN.
Fortuitously, we have a 2nd pool of disks in our SAN and of course we also had multiple back-ups of all our databases, websites and images and so were able to use these to rebuild the hosting environment. Unfortunately, the sheers volume of data stored meant that, even with fast internet connections, processors and disk drives, restoring 30 terrabytes (17 terrabytes when compressed) of data takes much longer than we had anticipated and there was no shortcut to achieving this, hence the extended period of downtime.
What I would say however is that I am very proud of how the New Mind tellUs team responded to the situation. We had staff working day and night and over the weekend to restore the service and I could not have asked for more commitment and professionalism on their part.
Conversely, we are extremely disappointed with Tegile. Not only did a routine software upgrade inexcusably bring down all our client websites but they did not have any roll-back plan or anyway to mitigate the situation. This is something that we will be taking up with them at the highest level with a view to understanding what went wrong and gaining assurances that no repeat of the event could happen in the future.
For our own part, we are now reviewing our Disaster Recovery plans which worked to the extent that no data was lost, but were clearly too slow to recover by your standards or ours. We have already started to think about ways in which to defend against a similar situation so that we are not so dependent upon a single supplier and we will be sharing those plans with you ahead of implementation. We will also be taking full advantage of the know-how and resources of our new parent company, Simpleview going forwards.
Needless to say, we are extremely sorry for the disruption and hope that it has not undermined your faith in the excellent service that you, quite rightly, have come to expect of us. We will write again once we have reviewed our Disaster Recovery plans and notify you of the positive changes we will make to reduce the risk of such an outage happening in the future.