While we're talking about minor features of the Strathspey SCD Database, here's another topic we haven'T really covered a lot so far. This is the “short links” service and, in particular, its relationship to The Reel, the newsletter put out by the RSCDS London Branch.
The Reel has been around for a very long time – it goes back to when Hugh Foss was an active dancer in the London branch after WWII – and London Branch has graciously made available a full archive of every issue (over 300 by now) on the Internet. Various items in the Strathspey SCD Database reference The Reel, typically dances which have appeared in its pages in the form of full dance descriptions, but also via explicit links in “Extra Info” type annotations. This includes, for example, obituaries of famous people from the SCD scene.
The London Branch web site used reasonably logical URLs for their back issues of The Reel until 2024, when there was a big redesign. Unfortunately now they look like
https://www.rscdslondon.org.uk/_files/ugd/4a642b_d692fc5cc9354460b46c49a99f593b12.pdf
where the bit between /ugd/
and .pdf
looks like a hash of some
sort, different from one issue to the next. (That, incidentally, is a
link to The Reel issue 330.) This is obviously not
something one wants to have to deal with on a regular basis, and we
don't want abominations like this spread all over the database just in
case London Branch redesigns its web site for the next time and all
the links change again.
In a nutshell, when we want to reference issue 330 of The Reel in
the database, we don't want to have to use the unwieldy URL shown
above. We would much rather use something like <<link:thereel/330>>
(using the “magic link” mechanism available in the SCDDB's dialect of
the Markdown language), which is (a) shorter and (b) more symbolic and
easier to work with. We can look up the actual link in a “short link”
database which is pre-filled with the links to the PDF files
currently available from the archives of The Reel. (RSCDS London
Branch graciously made available a file that contained issue numbers
and the corresponding PDF links, and we used that to seed the
short-link database.) So far, so good.
The problem we need to deal with today is that the file we received from London Branch covered everything up to issue 325 of the newsletter, and of course they keep publishing more issues every so often. We can't expect London Branch to send us more links, so to update the sort-link database we can “screen-scrape” the archive page on the London Branch web server to see whether new issues have appeared.
The tool of choice for screen-scraping in Python is a package called “Beautiful Soup”. This offers tools to ingest HTML pages and identify interesting bits in them. The raw HTML version of the Reel archive page is a huge unreadable mess of disgusting-looking HTML (again, we would like to emphasise that this is strictly a comment on the tool in use rather than the talents or intentions of whoever has designed the London Branch web site, but we still want to call a spade a spade), but fortunately Beautiful Soup doesn't care. We just need to hold our nose and use the Google Chrome web development tools to figure out how to find what we want to find. The HTML page basically contains a “card” for every issue of The Reel which features the issue number (in the shape of a text saying “Reel 330”) and a download button marked “Read more”. Roughly this looks like
… STUFF …
<div role="listitem">
… MORE STUFF …
<span style="font-weight: bold;" class="wixui-rich-text__text" … STUFF …>
Reel 330
</span>
… MORE STUFF …
<a href="https://www.rscdslondon.org.uk/_files/ugd/4a642b_d692fc5cc9354460b46c49a99f593b12.pdf">
Read more
</a>
… STILL MORE STUFF …
</div>
… EVEN MORE STUFF …
if we leave out lots of irrelevant clutter.
We want to be generic and therefore we add a number of attributes to
the ShortLinkNamespace
class we use to define types of short links
in the database, such as thereel
. (Currently there is another short
link namespace, thethistle
, which we use to refer to issues of
another newsletter also archived by London Branch; we don't need to
deal with new issues of The Thistle because it has been effectively
“dead” for some decades, but it is still historically interesting.)
These are:
source_url
: The URL for the page we want to screen-scrape.item_element_locator
: A specification which helps us find the “card”, or more generally the repeated element which is used to present individual issues of the newsletter. This is basically an HTML element name followed by a list of attributes, sodiv role=listitem *
will look for<div role="listitem">
elements in the page pointed to by thesource_url
. (The asterisk at the end is only a place holder.)tag_element_locator
andlink_element_locator
: These are used to describe the elements containing the issue number and the link to the issue. Forthereel
, thetag_element_locator
looks likespan style=font-weight:bold class=wixui-rich-text__text "0:TEXT:\\d+$"
Again, the first bit is an element name that we'll be looking for inside every element that theitem_element_locator
dug up, and the other bits – except for the last – are the attributes we want to find on the element. (The challenge, of course, is to pick exactly the correct ones so we don't erroneously focus on something else.) Once we have located the “tag” element, we use the final bit to identify the actual issue number: The0
refers to the first match for thetag_element_locator
found (there should be only one, but it's just as well to be explicit), theTEXT
refers to the textual content of the element (the text between the<span>
and the</span>
), and the\\d+$
is a regular expression which describes the part of theTEXT
we're actually interested in (the digits at the end ofReel 330
). Thelink_element_locator
works just the same except for the link; the0:href
picks up the value of thehref
attribute of the firsta
element in the “card”.
So in effect, the screen-scraping algorithm works like this:
- Download the page that the
source_url
points to. - Use the
item_element_locator
to find all the “cards” on that page. - For each “card” found, use the
tag_element_locator
and thelink_element_locator
to identify the issue number and the link to the PDF file for that issue. These go into the short-link database.
We do this for every issue of The Reel listed on the main archive page (50 or so); we don't bother to check whether the issues are already in the database, because (a) the PDF URL might have changed (although we hope not), (b) computers are fast and don't mind repeatedly doing the same work, and (c) this happens very early in the morning, anyway.
A fun side hustle is that we can use the short-link database to publish an RSS feed for the archive of The Reel. New issues will show up in the RSS feed as they are added to the archive, with links to their PDF version, so people like the proverbial Scots who are too cheap to subscribe to The Reel can subscribe to the RSS feed instead. They will get every issue a few months late, but hey, if you want to read The Reel when it comes out, the subscription page is here. All the RSS feed does is save you the hassle of having to check the archive page every day.