Short Links and “The Reel”

While we're talking about minor features of the Strathspey SCD Database, here's another topic we haven'T really covered a lot so far. This is the “short links” service and, in particular, its relationship to The Reel, the newsletter put out by the RSCDS London Branch.

The Reel has been around for a very long time – it goes back to when Hugh Foss was an active dancer in the London branch after WWII – and London Branch has graciously made available a full archive of every issue (over 300 by now) on the Internet. Various items in the Strathspey SCD Database reference The Reel, typically dances which have appeared in its pages in the form of full dance descriptions, but also via explicit links in “Extra Info” type annotations. This includes, for example, obituaries of famous people from the SCD scene.

The London Branch web site used reasonably logical URLs for their back issues of The Reel until 2024, when there was a big redesign. Unfortunately now they look like

https://www.rscdslondon.org.uk/_files/ugd/4a642b_d692fc5cc9354460b46c49a99f593b12.pdf

where the bit between /ugd/ and .pdf looks like a hash of some sort, different from one issue to the next. (That, incidentally, is a link to The Reel issue 330.) This is obviously not something one wants to have to deal with on a regular basis, and we don't want abominations like this spread all over the database just in case London Branch redesigns its web site for the next time and all the links change again.

At this point I would like to make very clear that the London Branch people have been very helpful and went a long distance out of their way to ensure that we could deal efficiently with the fallout of the change. When we sound a little exasperated this should not be construed as a criticism of London Branch or its web designer. It is more of a statement on the nature of various tools which are popular for web sites.

In a nutshell, when we want to reference issue 330 of The Reel in the database, we don't want to have to use the unwieldy URL shown above. We would much rather use something like <<link:thereel/330>> (using the “magic link” mechanism available in the SCDDB's dialect of the Markdown language), which is (a) shorter and (b) more symbolic and easier to work with. We can look up the actual link in a “short link” database which is pre-filled with the links to the PDF files currently available from the archives of The Reel. (RSCDS London Branch graciously made available a file that contained issue numbers and the corresponding PDF links, and we used that to seed the short-link database.) So far, so good.

The problem we need to deal with today is that the file we received from London Branch covered everything up to issue 325 of the newsletter, and of course they keep publishing more issues every so often. We can't expect London Branch to send us more links, so to update the sort-link database we can “screen-scrape” the archive page on the London Branch web server to see whether new issues have appeared.

The tool of choice for screen-scraping in Python is a package called “Beautiful Soup”. This offers tools to ingest HTML pages and identify interesting bits in them. The raw HTML version of the Reel archive page is a huge unreadable mess of disgusting-looking HTML (again, we would like to emphasise that this is strictly a comment on the tool in use rather than the talents or intentions of whoever has designed the London Branch web site, but we still want to call a spade a spade), but fortunately Beautiful Soup doesn't care. We just need to hold our nose and use the Google Chrome web development tools to figure out how to find what we want to find. The HTML page basically contains a “card” for every issue of The Reel which features the issue number (in the shape of a text saying “Reel 330”) and a download button marked “Read more”. Roughly this looks like

  … STUFF …
  <div role="listitem">
     … MORE STUFF …
     <span style="font-weight: bold;" class="wixui-rich-text__text"  STUFF >
       Reel 330
     </span>
     … MORE STUFF …
     <a href="https://www.rscdslondon.org.uk/_files/ugd/4a642b_d692fc5cc9354460b46c49a99f593b12.pdf">
       Read more
     </a>
     … STILL MORE STUFF …
  </div>
  … EVEN MORE STUFF …

if we leave out lots of irrelevant clutter.

We want to be generic and therefore we add a number of attributes to the ShortLinkNamespace class we use to define types of short links in the database, such as thereel. (Currently there is another short link namespace, thethistle, which we use to refer to issues of another newsletter also archived by London Branch; we don't need to deal with new issues of The Thistle because it has been effectively “dead” for some decades, but it is still historically interesting.) These are:

  • source_url: The URL for the page we want to screen-scrape.
  • item_element_locator: A specification which helps us find the “card”, or more generally the repeated element which is used to present individual issues of the newsletter. This is basically an HTML element name followed by a list of attributes, so div role=listitem * will look for <div role="listitem"> elements in the page pointed to by the source_url. (The asterisk at the end is only a place holder.)
  • tag_element_locator and link_element_locator: These are used to describe the elements containing the issue number and the link to the issue. For thereel, the tag_element_locator looks like span style=font-weight:bold class=wixui-rich-text__text "0:TEXT:\\d+$" Again, the first bit is an element name that we'll be looking for inside every element that the item_element_locator dug up, and the other bits – except for the last – are the attributes we want to find on the element. (The challenge, of course, is to pick exactly the correct ones so we don't erroneously focus on something else.) Once we have located the “tag” element, we use the final bit to identify the actual issue number: The 0 refers to the first match for the tag_element_locator found (there should be only one, but it's just as well to be explicit), the TEXT refers to the textual content of the element (the text between the <span> and the </span>), and the \\d+$ is a regular expression which describes the part of the TEXT we're actually interested in (the digits at the end of Reel 330). The link_element_locator works just the same except for the link; the 0:href picks up the value of the href attribute of the first a element in the “card”.

So in effect, the screen-scraping algorithm works like this:

  1. Download the page that the source_url points to.
  2. Use the item_element_locator to find all the “cards” on that page.
  3. For each “card” found, use the tag_element_locator and the link_element_locator to identify the issue number and the link to the PDF file for that issue. These go into the short-link database.

We do this for every issue of The Reel listed on the main archive page (50 or so); we don't bother to check whether the issues are already in the database, because (a) the PDF URL might have changed (although we hope not), (b) computers are fast and don't mind repeatedly doing the same work, and (c) this happens very early in the morning, anyway.

A fun side hustle is that we can use the short-link database to publish an RSS feed for the archive of The Reel. New issues will show up in the RSS feed as they are added to the archive, with links to their PDF version, so people like the proverbial Scots who are too cheap to subscribe to The Reel can subscribe to the RSS feed instead. They will get every issue a few months late, but hey, if you want to read The Reel when it comes out, the subscription page is here. All the RSS feed does is save you the hassle of having to check the archive page every day.