Gitlab Runners

At work we're using Gitlab – a software package that helps us organise our development. We have a fleet of “runners”, which we use to run automated tests and other tasks like packaging up our software. These can sometimes be an unruly bunch and it can be useful to be able to keep an eye on them.

When you're using Gitlab for software development, most of the work takes place on “feature branches”, where you can try stuff without interfering with the work your colleagues are doing. Once you're done with whatever bit of the code you're working on, the corresponding branch is “merged” back to the main version of the code. Just to make sure that things are on the up-and-up, Gitlab can run automated tests and let you actually do a merge only when these tests succeed.

The actual tests are done using “Gitlab runners”, and to speed things up, we run a number of those in the Hetzner cloud. Technically there is only one runner but that runner can spawn additional virtual machines in the cloud to share the work, in the interest of speed. The number of virtual machines in use can vary dynamically between a minimum of one or two and a maximum of – in our case – 16; the runner stands up more of them if there is a lot of work to do, and removes them again once things have calmed down. This is all very nice and works reasonably well, but the virtual machines can sometimes get stuck – typically because, since we're not starting new ones for every job but recycle them, Gitlab doesn't clean them up properly and crud accumulates to a point where there isn't enough room to run new jobs. Since the Gitlab runner isn't particularly smart about detecting and fixing such a situation by itself, work can come to a virtual halt unless something is done manually by an administrator (i.e., me).

I spent most of today polishing a “Gitlab runner monitor”, which is a program that we're running in the background which will check on the Gitlab runner VMs and kill them mercilessly when (a) their storage use is above a certain limit, such as 60% of available “disk” space, and (b) they're not actually running anything important just then. (Thankfully, the most recent version of the Gitlab runner is a lot better at detecting that a runner VM has been removed by outside forces, and not trying to give it more work for ages until it eventually figures out that something's not doing what it should.) As a fringe benefit, it will also make nice visualisations of what is running and where. This is a reasonably short Python program, based on FastAPI, and it works rather well if I say so myself; the nice thing is that, using HTMX, it is possible to get a very smooth UI that feels like one of the “single-page applications” people are building today at huge trouble and expense using JavaScript front-ends such as Angular or React, without the hassle of having to deal with JavaScript and the overhead of megabytes of support code. HTMX lets you dynamically swap out and update parts of an HTML page based on attributes in the HTML page itself, with very little overhead. Very nice!

As a matter of fact, I'm using Gitlab for my own stuff too (the free version; the one we have at work is better and more expensive, but since I'm mostly working by myself, I don't need all the fancy teamwork-supporting stuff that is in the paid version), including a Gitlab runner, but I'm not banging on that nearly as hard as we do at the office, so a monitor isn't needed. But I could probably pinch it if I had to.