A Case Against Database as a Service
Recently, in my current company, we have been struck by a fairly unexpected main database availability crisis for 4 days (thankfully, this includes 2 days of the weekend and "only" 2 business days–otherwise it might have been worse).
At the moment we had a single critical production MySQL database that ran on Google Cloud Platform (GCP) CloudSQL service "managed" VM. Managed in the sense that even though it is somewhat a regular VM (it has fixed # of CPUs and RAM) running MySQL however the storage subsystem is more advanced: CloudSQL handles backups transparently and extends the size of the database transparently for you. They also do some automated tuning for you as well to make the performance top. None of us in the company are doing full-time sysop–this was a great service. Until it wasn't.
One day the database was not responding. Not exactly that but you would get an error when trying to connect to the database:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
It needs to be said that CloudSQL databases do not run on your network as-is. You need to run cloud-sql-proxy to which you feed encryption key so you are authorised to connect to the database from you regular mysql client or whatever you have. However this was not where the failure was at.
Next we look on CloudSQL console on GCP page. There are the dials to tune the database. Basically–start-stop controls, backup management and ability to set certain mysql settings which are safe to tune by the layman. The page however, says the database is "Under maintenance". AND you cannot stop/export or restart it when it is in "Under maintenance" mode:
Weird–the maintenance window for this database instance was set to Saturday evening but the database has entered maintenance mode last Thursday evening:
At this stage there are not that many options: you can observe the logs, retrieve the last successful backup. However backups are no longer issued under maintenance mode, apparently. So there was no easy bail out for us as we want to keep our one-business-day-worth of data.
Calling the GCP support has helped us register the issue however we needed a higher support tier to have someone from the GCP engineering team to actually take a look at it than we had at the time. You need to be at least on Silver (IIRC) to receive any promises of having your issue fixed by GCP team. Customers on the lower tier are left to their own devices: they recommend you finding or registering a new post on StackOverflow regarding issues related to CloudSQL. StackOverflow or a couple of google searches were not productive in our case, however.
Once we had our support tier upgraded, we then have contacted GCP support again and have received a promise that someone would now look at this within 24 hours. Well, there was no response during this time either. Yet another more desperate call has helped to get this issue go along to the responsible engineers and then, in some 24 hours an engineer has actually fixed a problem with our database instance and we finally got things working late Monday morning.
GCP support has informed us that the database was stuck in maintenance mode because of a fix that was deployed to CloudSQL instances en-masse however, for some reason, it had crashed ours. The fix was deployed outside of the set "maintenance window" for our instance because they had been expecting for this update to be innocuous. Well, it was not for us…
My retrospective notes on this is:
- Had we had our own VM running the database and had we done the sysop work required ourselves the downtime could have been couple of hours, not 4 days;
- There is no guarantee that an additional replicate or a slave node HA mysql node would have helped, – there is a great chance both could have had this poisonous update pill received and then crashed;
- You have to have at least Silver support level on GCP (or equivalent on other cloud providers) in order to have meaningful support response time and quality if you want to use any of the managed services;
- The previous point seems to be reflected by someone else's comment on StackOverflow:
Published: 2018-12-26