The year is getting off to a great start already since early today the main server process on the Strathspey server crashed (i.e., was killed for lack of available memory) and I didn't notice until just now. This is especially aggravating because during the last few days I spent quite some time installing and configuring infrastructure to ensure that I do find out about that sort of thing as quickly as possible. Bother.
The site needs some reliability work because – as the newly-installed uptime monitor tells us – every few hours it seems to stop cold for a minute or so before it recovers. This is duly notified by the new infrastructure, and I should have been concerned about a bunch of “the site is down” messages on my phone from this morning that didn't have corresponding “the site is up again” messages. I'm going to have to make sure that that sort of thing isn't missed in the future, preferably by installing a huge siren or something that is guaranteed to get my attention if my phone doesn't.
For the more technically interested, I've configured a new VM in the “PinguCloud” project which does uptime monitoring and notification for everything on my servers. This includes Strathspey and all the other stuff such as my own web site, the “Votomat” and game sites, Marie's home page, Eva's blog and so on. I'm also in the process of installing Zabbix, which does more detailed performance monitoring, in the hope of eventually figuring out exactly why the Strathspey server stalls every few hours.
Perhaps the solution is to make its VM bigger, which will cost a few euros – but OTOH I've been working on replacing Sentry with GlitchTip, which does less but is way less of a resource hog than Sentry is. Right now Sentry runs on its own VM in the Strathspey universe, together with a fairly huge external data volume to make sure it doesn't run out of disk space (which will make it freeze and is difficult to repair). GlitchTip is compatible with Sentry in that it uses the same client library for integration with Django and will (like Sentry) catch and catalogue various runtime errors that people encounter with the site, but omits many of the very sophisticated performance analysis features of Sentry which I'm not using in any case. GlitchTip runs on the same new VM that does the monitoring, and this means I will be able to get rid of the Sentry VM, which is by far the largest one in the Strathspey infrastructure. In effect, this will let me use a bigger VM for the Strathspey server itself and still save a little money every month.