Full disclosure, we had an issue on our server last week – this was caused by a command being run on the server which caused a large number of files to be altered.
Initially we thought the issue was localised however it quickly became apparent it was server wide – this potentially meant the server operating system was compromised.
So…easy we’ll restore our Backups
We use a backup system called Guardian, technically a very good restore tool. It takes incremental backups and you can roll back ad hoc, in the process it does a huge number of file and data checks thus ensuring the result is pretty much 100%. Guardian we were led to believe could handle any “catastrophic failure” instances.
When this issue was identified it was recommended we do a server restore which was (allegedly) going to take about 2 hrs. This was well wide of the mark. (any subsequent updates were also well wide of the mark).
Feedback was hard to come by and we finally gave up on the restore, created a new server and started putting sites on the new server.
This created the next challenge as we did not have access to some companies DNS records and by now it was comfortably after hours.
So what did we learn?:
- The backup system we were sold was crap – it could in no way cope with “catastrophic” it was pitifully slow.
- We were too reliant on other people and not in charge of our own destiny.
- We were too trusting of the “well tried and tested systems”.
- We were too reliant on everything being in one place (DNS, email, websites) – so one down = all down.
So what next?
First our goals:
- Restore ALL our sites in a max of 4 hrs in case of a catastrophic failure.
- Enable services to operate independently of each other (DNS vs Email vs Website)
- Ensure we are able to operate independently of the hosting company
- Ensure we are able to make DNS changes on behalf of our customers as decisions/actions need to be taken in different time zones
So our actions:
- Create a separate DNS server so it is quick to migrate sites/applications – Done
- Spread the “risk” by having several different servers and segment the sites/applications across several different servers – Done
- Create a system that enables us to restore/move sites from server to server quickly – Done
- Ensure access to all customer DNS settings or at the least explain the implications – Ongoing
- Rationalise our Amazon S3 account (off site backup) to ensure we can use in a semi-automated fashion – Done
- Upgrade email systems to an off-server system – Ongoing
What is interesting (or concerning based on your viewpoint) is that most hosting providers operate the way we did so it is worth referring this to them for that once in a lifetime catastrophic failure.
Finally now we have set up our systems we will be coming to you to update various things in your systems.