Monday 23 May 2011

How I handle "MySQL Server has gone away"

The Problem

I use Ruby and the Sinatra framework for developing web-based applications. I prefer Sinatra's lightweight approach over Rails. For similar reasons, I "roll-my-own" database classes rather than utilise something like ActiveRecord or DataMapper (the latter looks more tempting of the two for Ruby developers).

During development I've occasionally encountered the "MySQL Server has gone away" situation, which is a frequently raised issue, and various solutions exist. These range from periodic "false transactions" to keep the MySQL connection active; adding a 'before' filter to your Sinatra application to verify the connection status; and using ActiveRecord (which would appear to perform its own 'verify' prior to each call).
I'm against these solutions.

I don't like wasting resources, and these sound a lot like that. Okay, while my server is idling and under no stress then I have resources to burn, and that's typically the case. Certainly stray database access costs next to nothing when your database and application are on the same server, but imagine when your app takes off (dream of Twitter here!) and you have a distributed load-balanced farm of servers to work with. Okay - you won't be using the current solution, but to me these solutions just smack of quick-fixes without a lot of thought.

To put it another way - consider the 'verify before use' scenario. At 11 in the morning, when your application has been up for 3 hours, and served 10,000 MySQL requests, you will have asked (10,000 times) "are you there?", to which MySQL will have responded "yup". After about 9,999 queries, MySQL ought to respond with "yes ... didn't you listen the last 9,999 times!" and take its ball away. When we're working, we're working, and there's no need to check. It's only in those edge cases where we haven't been doing anything for a while we need to ask before using.

Another solution is to increase the 'wait_timeout' parameter of MySQL. By default it's 8 hours, but is increasing it a solution? What do you increase it to? 12 hours will probably get you from day-to-day, but not across a weekend. 48 hours? Okay - what about  a long weekend? And do you want the connection to remain open when nobody is going to use it for several days? No. If you aren't going to be used, don't be there.

-----

Okay. Back to how the issue arises. You connect to MySQL, and generally re-use that connection on subsequent calls, rather than open new connections for each request (because spread across a large user base, making many calls you soon run out of connections ... your solution doesn't scale that way). Effectively you're using some form of 'pooled thread manager' to handle connections between your application and the database. When both are being utilised, all is well; what happens when your application is quiet though, is that (after 8 hours / 'wait_timeout' seconds) MySQL (silently) closes the open connections. And then a disconnect exists between what your application believes is the state of the connection, and the actual state. So your application attempts a request on an inactive connection, and "MySQL Server has gone away".

Another solution is the inelegant 'try/catch' around any SQL call. Catch the error, verify it's a connection issue, and re-connect. This is not resource wasteful (on the MySQL side) but leads to sloppy coding practice. I've seen entire applications where every method is enclosed within a try/catch pair. If that were really a 'solution' (to everything) the language would already support it out-the-box, rather than demand a ubiquitous trap around every event. Prolific use of such techniques points to an architectural issue, beyond solely your database access mechanism. But enough of that ...

What we really want is to ensure our database connection state is accurately reflected within the application (when it needs to know it). It would be useful if MySQL triggered an event when it closed a connection (through timeout) rather than doing it silently (the .net modules do have such an event, but it is not for this situation - it is a reflection of a deliberate application event, not of one originating within MySQL) but it doesn't appear to - or if it does, I could find no documentation of it, nor anything in any logs - so we'll have to monitor the situation ourselves and react accordingly. However, once we've achieved this we can ensure that the application re-connects only when it needs to, yet also enable the connection to timeout naturally, based on its configuration. Moreover we would not need to unnecessarily verify the connection status prior to each call.

-----

My Solution

To implement a simple 'synchronisation' mechanism between application and database connection, I have written a simple server-side job which queries the database connection:

A cron job which runs each day at 23:30 (on a work day the connections will typically be open at this time).
  • It interrogates MySQL (SHOW PROCESSLIST / SHOW GLOBAL VARIABLES) to see how 'old' the database connections are, and when they are due to expire
  • Sleeps until this time (or 10 seconds after that time actually)
  • Re-awakens to check the connections again
  • If nothing has happened in the interim, the connections will now be inactive - in which case the application is signalled to update it's connection view to 'closed' (and hence the thread management will re-connect when next used)
  • If they are active (implying some database calls have been made) the new 'timeout' time is recalculated, and the process sleeps again, until that time.
This carries on until the new 'timeout' time is beyond the next day's process start time (in my case, 23:30 the next day) when the process quits.

On a typical week (work) day the process runs at 23:30; immediately sleeps until 03:00; wakes and tells the application that the connections are closed, and then exits. The application re-connects (around 07:00) and the day progresses as 'normal'. I have only a sleeping process which exists in the server for a few hours overnight.

At the weekends, the process runs at 23:30 then does nothing as the connections have been 'closed' the previous day.

When work continues overnight, the process can run around 03:00, then at 10:00, and 18:00. Each time it awakes briefly, tests the connection, and sleeps again.

-----

This solution keeps the application clean - it is concerned with using connections that it knows the state of, and does not add unnecessary verifications to connections that are being used and are active. The database connections can quiescence based on their timeout values and are not unnecessarily open for longer than they need be. The server has an extra process runnning, but not all of the time, and it sleeps for the majority of the time.

I need to improve the process to create 'monitor threads' on a per database level (I have isolated databases on a per system basis - at present they are all accessed synchronously, so they all timeout in harmony, but this isn't necessarily the case).

"My SQL Server has gone away" still happens - I just know when it does now.