In these past few months I’ve been getting myself more and more into the rabbit hole of optimization. I’m discovering that the act of writing small programs that automate processes which are repetitive and boring is a very fun thing to do. I guess it’s good that I enjoy doing such things, since when it comes down to it, the main purpose of all computational technology is automation, and that’s what I will probably be doing at whatever work I’ll choose to do.

Since developing this habit of wanting to automate everything, I started to look more and more for things which could be automated in my life. I was not surprised when I found out that there were so many things that could be improved upon, each of which came with something new to learn. Learning in this way is both fun and rewarding, because your learning has a clear objective: improve this thing; do it better than it is already being done; provide a new service for everyone to use. Even the smallest thing can make you feel like you’ve created something cool, which pushes you to do more and more.

This blog post is about one such challenge I proposed to myself. The problem? Well, it is about the news-board of the official site of my university course.

I’m currently enrolled in a Computer Science program (master’s degree) at Tor Vergata, a university of Rome, and the official site for my degree can be found at http://www.informatica.uniroma2.it/. I have various issues with how the site is setup and managed (why no SSL/TLS?), too many for a single blog post. Without ranting too much I’ll get straight to the point: the way news and announcements are managed, in my opinion, is just plain horrible. Having said that, I don’t like to express my disappointment about something only to do nothing about it, and so I’ll now try to argue where are the flaws, and then I’ll offer a solution which, in my opinion, works better and, actually, should’ve been already implemented. We’ll see how the solution makes use of RSS, an amazingly simple but powerful technology which everyone should at least know about. So, let’s get started.

Table of Contents

The Problem

When you go to the official site of my course you are presented with this view

Starting page of www.informatica.uniroma2.it

What interests us is not Van Gogh’s painting, The Pink Beach Tree, a nice reference to the fact that Computer Science heavily uses tree-like data structures (although the trees that we use are rotated such that the root of the tree is at the top and the leaves are at the bottom), but rather the news section, the part that stands below the text “Primo piano”.

News section in detail

There are mainly two issues with the way these news are managed, and these are:

  1. Only the most “recent” news are shown, the one posted within a certain timeframe. As more and more news get added, the old ones are not shown anymore.

  2. There is no mechanism that notifies the students when news are posted.

While the second issue is one of efficiency, i.e. it forces the student to manually check the site everytime to know if something new has been posted, the first one is actually quite problematic, because in these news important informations might be found. I find it quite confusing to grapple with the fact that someone thinks this is a decent way to manage news and announcements.

Just the other day a colleague of mine, knowing about the solution I will shortly be discussing, asked me if I had saved an announcement that was posted on the 27 of April, 2020, detailing the guidelines to follow for graduating in this COVID-19 world, and that was later removed as other news got added in the news-board. I think its unacceptable that something like this can happen.

So, this is it, as far as the problem goes. Before introducing my solution I will now talk in general about an amazing technology called RSS.

RSS

RSS stands for Really Simple Syndication, and it is a format used by content producers on the web to notify users when new content is created and is available to be read and or seen, depending on the nature of the content. It is essentially a notification mechanism, comparable to the one you have in your smartphone, but for web content instead of messages and general apps notification.

The way RSS works is quite simple and can be described as follows:

  1. The Web creator, say a blogger, writes and posts a new blog post in his site. To actualy notify all the users, he writes some metadata about his new post inside a particular .xml file, called an rss feed.

  2. Each user can choose to “subscribe” to the rss feed of the blogger, and when new blog posts are posted online, each user can use an rss client which automatically checks all rss feeds the user has subscribed to in order to understand if something new has been posted and tell the user accordingly.

With this RSS-based workflow the user no longer has to go to the blogger site everytime to check if something new has been posted: he or she is immediately notified by the rss client when that happens. In other words, this mechanism makes more efficient the act of consuming content on the internet.

Another advantage of using RSS is that, being extremely simple, it doesn’t do anything fancy with your feed of content. It doesn’t follow the same trends of all the big companies such as Instagram, Facebook, Youtube, Twitter and so on. It doesn’t deploy AI techniques to make your feed more “interesting”, more “personal” and more “useful” (and also more confusing, definitely more confusing). It just gets you what you want, how you want it, as soon as its ready. Simple and efficient, what more could you ask for?


The syntax used by the .xml file that specifies the RSS feed is quite simple. To give an example I will show a few entries in the RSS feed of this very blog, which is available at the following link.

The general structure is as follows: in the first portion of the file there are a bunch of info regarding the particular channel, which represents the source of the content. These are formatted as follows

<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Leonardo Tamiano's Cyberspace</title>
    <link>https://leonardotamiano.xyz/</link>
    <description>
      Recent content on Leonardo Tamiano's Cyberspace
    </description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 03 Sep 2020 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://leonardotamiano.xyz/index.xml"
	       rel="self"
	       type="application/rss+xml"/>

    <!-- Items representing content updates go here -->

  </channel>
</rss>

As we can see the channel contains basic informations such as the title of the blog, the link, a general description, and so on.

Then, inside the channel attribute, there are the items, each of which represents a single piece of content that is to be consumed by clients. The items have the following syntax

<item>
  <title>How to manage e-mails in Emacs with Mu4e</title>
  <link>https://leonardotamiano.xyz/posts/mu4e-setup/</link>
  <pubDate>Tue, 28 Jul 2020 00:00:00 +0200</pubDate>
  <guid>https://leonardotamiano.xyz/posts/mu4e-setup/</guid>
  <description>In the past months I&rsquo;ve spent quite a bunch of
  time trying to optimize the way I interact with technology. One of
  the most important changes I made in my personal workflow is the way
  I manage e-mails. Before this change I always used the browser as my
  go-to email client. I did not like this solution for various
  reasons, such as: browsers are slow; basic text editing was a pain,
  especially considering I&rsquo;m very comfortable with Emacs default
  keybinds (yes, I may be crazy);</description>
</item>

In here we find once again basic information, but this time regarding single blog posts, or in general single pieces of content.

The site https://validator.w3.org/feed/docs/rss2.html contains the specification for RSS 2.0, which includes all the possible tags that can be put in the RSS feed, each of which with its own meaning and how it ought to be used.

My Solution

Having been aware of the simplicity of RSS, when I thought about optimizing the news-board for my degree course I immediately thought about using it. The idea I had is as basic as it gets in computer science: scrape the data from the official site, which is badly managed, and wrap it with xml tags to form a well-mainted rss feed. If I managed to do this I could solve both problems at the same time: I could be notified when something new was posted, and I had no worries that some news would be deleted, since I stored all the news in my rss feed without deleting the old ones (because, once again, why the hell would you delete them?).

To actually code the solution I choose, of course, python3, and after a full morning of work I managed to do it. The final result of the process are two RSS feed, which are available on this site itself at the following URLs

That first feed contains the news posted on the news-board on the bachelor portion of the site, while the other contains the news posted on the master portion of the site.

Both of these feeds can be used with any RSS-clients by simply using the URL to subscribe to the RSS feed. I, for example, use Emacs as an RSS feed when I’m at home, using my Desktop, or outside using my laptopt. I’ve already written a bunch of articles detailing the way I use Emacs, so go look at that if you are interested. When I’m instead using my iPhone I use another RSS reader, called NetNewsWire. There’s nothing special about it, but its simple and easy to use (although I’ve not yet managed to get attachments to download with this)

Screens taken from NetNewsWire, iOS apps to read RSS feeds

This final part of the blog post is not necessary to read, since in it I will describe in more detail how I coded the solution. So, as a final note (for those not interested in the final section), I hope it was an interesting read, and made you realize just a bit the power given by the simplicity of RSS!


The solution I found can be broken down into 4 functions, which I will briefly describe. Before starting though I will admit that the following code is probably not the most beautiful, but it does the job its supposed to do, so for now I’m satisfied.

As far as libraries goes, I used the following

import time                   # to get current date
import schedule               # to schedule execution
import datetime               # to parse data
import requests               # to do GET/POST HTTP requests
from bs4 import BeautifulSoup # to scrape HTML

The four functions I mentioned are the following ones

  • update_rss(rss_file): This is the main function which downloads the HTML page containting all the news and tries to add each news to the RSS feed as a new entry.

    def update_rss():
        r = requests.get(BACHELOR_URL)
        soup = BeautifulSoup(r.text, 'html.parser')
        entries = soup.find_all("table")
        for i in range(0, len(entries)):
            add_rss_entry(BACHELOR_RSS_FILE, entries[len(entries) - i - 1])
    
        # ----------------------
    
        r = requests.get(MASTER_URL)
        soup = BeautifulSoup(r.text, 'html.parser')
        entries = soup.find_all("table")
        for i in range(0, len(entries)):
            add_rss_entry(MASTER_RSS_FILE, entries[len(entries) - i - 1])
    
  • add_rss_entry(filename, entry): This function generates the RSS entry for a particular announcement/news, and only adds it if its “new”, i.e. if it has not been already added.

    def add_rss_entry(filename, entry):
        rss_entry = generate_rss_entry(entry)
    
        rss_file = open(filename, 'r', encoding='utf-8')
        contents = rss_file.readlines()
        rss_file.close()
    
        # NOTE: only write if entry is not already present
        if "".join(contents).find(rss_entry) == -1:
            contents.insert(8, rss_entry)
    
            rss_file = open(filename, 'w', encoding='utf-8')
            rss_file.writelines(contents)
            rss_file.close()
    
  • generate_rss_entry(entry): This function creates a new RSS entry using the data extracted from the HTML markup that contains a single announcement/news

    def generate_rss_entry(entry):
        title, date, author, description, enclosure = extract_data(entry)
    
        rss_entry = (
            "<item>\n"
            f" <title>{title}</title>\n"
            f" <author>{author}</author>\n"
            f" <pubDate>{date}</pubDate>\n"
            f' <enclosure url="{enclosure}" type="application/pdf" />\n'
            f" <description>\n<![CDATA[{description}\n]]></description>\n"
            "</item>\n\n")
    
        return rss_entry
    
  • extract_data(entry): Finally, this function is the one that deals with the scraping of the data from the website. This is the most critical function, as it relies on the the particular way the data is structured. I have to say that I was lucky that the basic structure found in every post is the same, otherwise this work would’ve been much, much harder.

    def extract_data(entry):
        header = str(entry.th)
    
        # get title
        title = header[header.find("\"/>")+4:header.find("<br/>")]
    
        # get date
        date = header[header.find("gray;\">        ") + len("gray;\">        "):header.find("inviato")].split(" ")
        dmy = date[0].split("-")
        hm = date[1].split(":")
        day, month, year = int(dmy[0]), int(dmy[1]), int(dmy[2])
        hour, minute = int(hm[0]), int(hm[1])
        date = datetime.datetime(year, month, day, hour, minute).strftime("%a, %d %b %Y %H:%M:%S %z")
    
        # get author
        author = header[header.find("inviato da ") + len("inviato da "): header.find("</span></th>")]
    
        # get description
        description = entry.select_one("tr:nth-of-type(2)").text.replace("\n", "<br />")
        enclosure = ""
    
        # -- deal with links
        links = entry.select_one("tr:nth-of-type(2)").find_all("a")
        for link in links:
            # in case of remote urls <a href="..."> text </a>
            if "http" in link.get("href"):
                description = description.replace(link.text, link.get("href"))
            else:
                enclosure = BASE_URL + link.get("href")
    
        return title, date, author, description, enclosure
    

To finish off, the main loop schedules the execution of the update_rss() function every morning at 08:00 A.M. so that each day the feeds gets updated with the most recent announcements.

if __name__ == "__main__":
    schedule.every().day.at("08:00").do(update_rss)

    while True:
        schedule.run_pending()
        time.sleep(1)