After years of studying Computer Science at university I’ve come to realize the simple yet powerful fact that digital technology is all about the processing of information encoded in data. What we’re doing with technology, boiled down to its core, is creating ever more powerful machines that are able to generate, process, and manipulate ever increasing amounts of data in order to drive various activities and processes that require, as their complexity increases, ever more information.

Data are the raw contents of our computers memory. Data in and of itself is not that important: it is just a medium, a way to represent something else. Its this “something else”, that is important: information.

And how could one grasp what is information?

Well, to really answer such question one could write more than an entire book – and, in fact, this has happened and will continue to happen – but in general I would say that information is anything your mind can possibly comprehend and more. Information is all about problems, decisions and the possible consequences of those decisions. Good information allows you to understand better whatever problem you are facing in whatever context you are currently situated in order to make a decision.

Information is encoded in data

\[\big\downarrow\]

Information can be used to make decisions

A decision made without any supporting information is simply a shot in the dark, nothing more than a bet. To make good, strategic decisions one has to work on getting the right kind of information. The difficulty of doing this depends of course on the particular problem at hand: sometimes its easy, sometimes its almost impossible, and most of the times is something in between those two extremes.

For the longest time, throughout most of history, humans have faced the problem known as information scarcity. That is, even just a few hundred years ago, information was mainly encoded in books, and in general it was very hard and expensive to get your hands on some good books which contained useful information.

Since the rise of digital technology, which made the duplication and distribution of digital information extremely easy, we now face a different problem: we are flooded with ever increasing amount of information, while our brain can only process a very limited amount of it. Everyday there are a crazy amount of articles to read, of tv-series and movies to watch, of books to study and so on. You get the idea.

In this blog post I want to discuss a powerful technique called web scraping.

If done well, web scraping can be used to solve the problem of dealing with ever increasing amount of information by extracting carefully from all the mess present in the Web, only the kind of information we’re interested in. The power of this technique is that it can be automated, and so, after some initial work, we can keep getting our information without any additional hustle.


HTML and the Web

The Web is probably one of the most tangible creation of the evolution of digital technology, since everyone uses it everyday. Yet if we consider the huge number of its users, only few of them understand even a little how it works under the hood. This article is clearly not the place to discuss such huge and deep topics, but still, to understand web scraping it’s important to know how to think properly about the Web.

First of all, just what is the Web?

The Web is a network of documents

There are two key characteristics that any network posseses:

  1. What kind of element it contains. These elements are called nodes.

  2. How the different nodes are connected to eachothers. These connections are called edges.

By combining different nodes together with different edges, we get the intuitive notion of a network:

Example of a network (source: wikipedia commons)

In the Web the majority of the nodes are html documents, and different nodes are connected to eachothers by links called hypertext references.


Whenever you go to google.com, your browser asks google web servers to send back the page that contains the well known google search form.

google search form

This page is a typical html page. But what is html?

HTML stands for HyperText Markup Language, and its the language that is used to define the structure of the documents that can be obtained in the web, that is, of html documents (or html pages). But html does not only defines the structure of a document, it also defines its contents.

To be able to read html, the main thing to understand is that an html page is made up of various elements, and that each element is defined by certain tags. Consider for example the following element, which in the html language represents a paragraph

<p> This is a paragraph! </p>

As we can see, the element is structured as follows: we have the opening tag <p> which tells us we are entering a new paragraph, then we have the content itself of the paragraph, and then we have the closing tag </p> which means we’re done with that paragraph and we can move on to something else.

When the browser reads this html code, it interprets it and renders it on the page.

There are lots of different elements, each of which is used to characterize the purpose of a particular piece of data in a document.

The most important element in html is, without a doubt, the anchor element. This element is used to link together different documents with eachothers. It is the element which creates the edges in the network of the web we’ve alread mentioned. The syntax of this element is the following one

<a href="www.google.com"> Click here and go to google.com! </a>

as we can see, the tag for this element is the <a> tag. Notice also another new thing: within the opening tag of the element we have an attribute, which in this case is the href attribute, which contains the “link” our browser will go to when a user clicks on the text displayed within the tags.

Without entering into too much details about all the possible different tags, what is important to realize is that thanks to these tags the html language allows one to define both the different elements of the document as well as how they are related to one another. That is, by putting an element within another element, we can create a hierarchical structure.

Consider the following

<body>
  <div id="content">
    <h1> Headline level 1 </h1>
  </div>
</body>

The body element is used to contain all the content of the document, while the div element is a generic element with no particular meaning, and it is used by front-end developers to create complex graphical layouts by combining html with css.


Observation: css stands for Cascading Style Sheets, and its the language used to define the style for an html document. This blog post is not about css, because when doing web-scraping, we’re not at all interested in this particular language.

However, just to show what it looks like, the following is a css snippet that sets the font-size of all text within the body element as well as the background color of the page.

body {
    font-size: 30px;
    background-color: black;
}

Returning now to the previous html snippet, what’s important to realize here is the fact that the h1 element is inserted within the div element, which is itself inserted within the body element. Its exactly this nesting of elements within other elements that defines the structure of an html document.

Consider the following example of a full html document.

<!DOCTYPE html>
<head>
  <title> Titolo Pagina </title>
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

<body>
  <div id="content">

    <h1> Headline level 1 </h1>
    <p> This is a paragraph! </p>

    <div id="footer">
      <p class="author"><b>Author</b>: Leonardo Tamiano</p>
    </div>
  </div>
</body>

</html>

As we can see, the first element is the DOCTYPE element, and it specifies that the document is an html document, as well as the particular version of html used (yes, there are multiple versions of html, each with its own pecularities). Then we have the head element, which contains various metadata, that is, data that encodes information regarding the other data in the document, such as the title of the page, what kind of content there is in the page, how it is encoded, and so on.

We can visualize the structure of this document using a tree, which is a famous data structure used throughout computer science.

HTML document structured represented as a tree

as we can see, this is another sort of “network”, but here the nodes are the various elements of the html page, while the edges are used to represent which element contains which other element.

That is, consider the element body. There is an arrow from html to body, and this arrow represents the fact that in the html document, the html element (the one with the !DOCTYPE tag) contains the body element. All the other edges encode the same kind of information but for different elements.

Without going into too much detail, this structure is also called the Document Object Model, in short DOM. By using the programming language javascript one can modify the structure of the DOM to create dynamic pages that change depending on the user action.


To recap briefly, when we download a web page, there are three main types of files we download (excluding all the images, sounds and other media types):

  • The .html files, which define the structure of the content and the content itself that we’re looking for.

  • The .css files, which define how the content is displayed on the screen to the user.

  • The .js files, which define how the structure of the content and the content itself changes depending on the user action.

When doing web scraping we’re typically only interested in the .html pages, since that’s where the data we want is to be found. Sometimes however the final html is obtained only after executing a lot of javascript code. So sometimes just downloading the html page is not enough: we also need to download the related js code, and we have to execute it to get the final html page.

Why would you do web scraping?

Now that we’ve briefly introduced what the Web is and how HTML is structured, before going into the technical details on how to do web scraping, I will now mention a couple of use-cases where knowing about web scraping could come in hand.


USE CASE #1: Display useful information in a different way

This use case is very real for me, because just recently I’ve had to implement it.

Currently I’m enrolled at the university of Tor Vergata studying Computer Science. As with everything in life, there are some good things and some bad things about this university. What bugs me the most however is that the official site of the course is just plain ugly and its not every efficient to use.

Official site for Computer Science @ Tor Vergata

So, I asked myself: what can I do about it?

And that’s where web-scraping came in hand, because what I did was:

  1. I myself wrote a scraper which downloads all the web pages of the official site and extracts from these all the useful data that I need, such as exams and lectures dates, latest news, teachers contacts data, course data, and so on.

  2. Then, working with a university friend of mine, we built (although I have to say it was he who did the heavy-lifting for the design and the style) a new web site which shows the same information but in a different style, which in our opinion is much cleaner and also easier to use.

The scraper can be found here: scraper.py, while our version of the site looks like this

Our version of the site (informatica.tvstudents.xyz)

Now that we have the raw data we could offer a series of interesting services such as:

  • The ability to get notified when news are posted using RSS.

  • A log which shows the recent changes on the underlying data. This could be useful to know when a professor has uploaded new material.

  • An emailing service which alerts students when exam dates are set.

This are just a few of the endless services that one can offer with the raw data available. And to get the raw data, well, web scraping was necessary.

For those interested in the project just mentioned, you should check out the relative github repo: PinkPeachProject.


USE CASE #2: Show train tickets when reaching a certain price treshold

While the previous use case was very real, this one is not yet real enough for my taste. Still, I wanted it to mention either way because in the following weeks I might be working on something like this as a side-project.

Recently a friend of mine moved to Milano to study at Polimi for his master’s degree in Computer Engineering. This means that in the following years from time to time I’ll be taking trains to go and meet up with him.

Now, since train tickets cost money, and since the price of the tickets changes from time to time, I would like to write some python code that would periodically check the prices of the trains that go from Rome to Milan and back, and alert me when it finds ticket below a certain price threshold.

I’m not sure how practical this system would be, as I’m also not sure how much variation there is in train ticket prices (because if the variation is minimal, then the effort is probably not worth it). Still, it could be a cool idea to automate. If you ever implement something like this, or if you know something else that already does this, feel free to contact me :D

How would you do web scraping?

Finally, after a lot of words, we can start to get our hands dirty.

I mean, if I did a good job at this point you understand a little bit better why web scraping can be very useful. Let us now tackle the remaing question: how do you actually do it?

As with everything in programming, our first tool to choose is the programming language we want to use. In this blog-post we’ll use python3, because its a very flexible and powerful language.

But a language can be used to do many things, and each single thing can be done in many different ways. This is also true for web scraping. There are various libraries that allow you to do web scraping in python. The one we’ll use is called BeautifulSoup.

To download the library one can use pip, the package manager for python

pip install beautifulsoup4

Since each library has particular naming conventions and ways of doing things, I always suggest to read the official documentation for a complete overview of what the library can do. Here I’ll just give you enough pieces of code to get started.

If you’re italian at this point I suggest you to watch the video I made on web scraping on my youtube channel:

Of course if you don’t understand italian this might be hard for you, so just keep reading for the english version.


The html code we’ll scrape is the following one, which is the same as the one shown before.

<!DOCTYPE html>
<head>
  <title> Titolo Pagina </title>
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

<body>
  <div id="content">

    <h1> Headline level 1 </h1>
    <p> This is a paragraph! </p>

    <div id="footer">
      <p class="author"><b>Author</b>: Leonardo Tamiano</p>
    </div>
  </div>
</body>

</html>

For now let us assume we have saved this page in a file named web_page_example.html. Then we’ll show how we can download it from the web if we didn’t already have it saved.

Since scraping is all about extracting useful information from the raw html data, to start off let’s try to extract from the html file the name of the author, that is, the string “Leonardo Tamiano”.

The usage of the library will be broken down in three steps:

  • Step 1 – Parsing the HTML file.

    The first thing we’ll always need to do is importing the beautifulsoup library

    #!/usr/bin/env python3
    
    from bs4 import BeautifulSoup
    

    after that we have to read the html contents of the page we want to scrape. Since in this case we assume we already have the html document in a file, we can do a simple open

    f = open("./web_page_example.html", "r")
    text = f.read()
    

    Finally, we have to parse the HTML file

    soup = BeautifulSoup(text, 'html.parser')
    

    this call essentially puts in the soup variable the main data structure which is then used by the library to navigate the html document and extract data from it.

  • Step 2 – Navigating the HTML file.

    After we have parsed the file we can query the structure to navigate it.

    Remember that tree I showed before? Well, now we’ll navigate it. That is, starting from the root of the tree – the highest node, the one above every other node – we’ll go to different nodes depending on our needs.

    For example, we might be interested in going over all the p elements, which can be done with

    # -- find all tags of the form <p> ... </p>
    paragraphs = soup.find_all("p")
    

    this method returns a list containing all the elements that matched the query. All elements contained in the returned list can then be used to search other elements depending on the document structure (that is, the tree I showed before).

    In our particular case for example the paragraphs will be a list of two elements: the first element is the one corresponding to the paragraph “This is a paragraph!”, while the second paragraph is the one containing the information about the author.

    While from the first paragraph we cannot go further into the tree (such ending-nodes are called leaves), from the second paragraph we can go within the b element, because in the tree the b element is connected to the second paragraph.

    If no element is found, then an empty list is returned [].

    if paragraphs:
        print("[INFO] - Found at least one paragraph :)")
    else:
        print("[INFO] - No paragraph was found :(")
    

    Notice however that in our particular case we are only interested in one paragraph: the paragraph element which has the class author. We can add filters to our query by using a dictionary as follows

    # -- find all tags of the form <p class="author"> ... </div>
    author_p = soup.find("p", {"class": "author"})
    

    This time the method is find, not find_all, and thus it returns only the first element it finds that matches the query. If nothing is found, then None is returned. This means that we can check if we found something with

    if author_p:
        print("[INFO] - A paragraph of class 'author' was found :)")
    else:
        print("[INFO] - NO paragraph of class 'author' was found :(")
    

    Other than with the find method we can also explore the html document with the dot (.) syntax. For example if we do author_p.b we get the first element associated with the <b> tag in the html document that is inside the p element with class="author" . This gets us the same result as if we did author_p.find("b").

  • Step 3 – Extracting data from the HTML file.

    Once we have obtained the element we’re interested in, to extract the data we can do

    # -- this gets us: "Leonardo Tamiano"
    name = author_p.text
    

    this however only extract the textual data. If we want to extract both the text as well as the html tags, we have to do

    # -- this gets us: "<b>Author</b>: Leonardo Tamiano"
    content = author_p.decode_contents()
    

This may not seem much at all, and in fact it isn’t, but using this library mainly reduces to this: doing calls to find and find_all, and in general in understanding how to move within the structure created by the library. Once we reach the interesting elements, we can then do calls to text and decode_contents() to extract the data.

In the meanwhile, all python facilities for strings and lists manipulation can be used for further processing.


In the snippet of code I showed, the html data was already saved as a file. How can we actually download it from the web? This step is very critical, and later we will come back to this point when discussing about the services offered by ScraperAPI.

Let us assume then that the previous html code is not anymore saved on a file, but that it is served by a web server. To try on your own machine, you can simply host the web server locally with the following comand. Just be sure that the folder from where you’re executing this commands contains the web_page_example.html file we want to download.

python3 -m http.server 1337

Now if we go to the url http://localhost:1337/web_page_example.html our browser downloads the html page we showed before.

To download a web page automatically we need a different python library called requests. As always, the first thing to do is to download it with pip

pip install requests

Once we have downloaded it, we can use it as follows

import requests
URL = "http://localhost:1337/web_page_example.html"
r = requests.get(URL)

if r.status_code != 200:
        print("Could not download page!")
        exit()

text = r.text

we can then give the variable text to the BeautifulSoup library as showed before.


So, this is it for this brief showcase of the BeautifulSoup library. If you want to see a real life example of using such library, I would once again suggest you to go and read the scraper I wrote for that university project: scraper.py.

For example, consider the get_teacher_list() function, which downloads the contacts data of the various teachers. Since the code is limited, I will simply paste it here and discuss it a little bit.

The code is mainly divided into three parts:

  • In the first part we compute the proper URL, and then we use the requests library to download the web page containing the contacts data of the teachers.

    def get_teachers_list(self):
        """
        Scarica le informazioni riguardanti i professori.
        """
    
        # -- compute URL
        cdl_param = self.__cdl_param()
        URL_PARAMS = f"/f0?fid=30&srv=4&cdl={cdl_param}"
        URL = self.BASE_URL + URL_PARAMS
    
        r = requests.get(URL)
        if r.status_code != 200:
            print(f"[(WARNING) {self.degree}]: Could not download teachers data")
            exit()
    
        print(f"[{self.degree}]: Downloading teachers data")
    
  • Then we open a file in which we will save all the data we extract from the web page.

        # -- compute current scholar year in form 19-20, 20-21
        scholar_year = get_current_school_year()
    
        file_path = f"{self.DATA_ROOT}/{scholar_year}/{self.directories['basic_dirs']['teachers']}/{self.directories['basic_dirs']['teachers']}.csv"
        with open(file_path, "w+") as out:
            # -- first row with metadata
            out.write("nome,qualifica,studio,telefono,mail,homepage,insegnamenti\n")
    
  • Finally, we give the downloaded html page to the BeautifulSoup library, we search for a table element, and we go over all the row (tr elements) of the table. For each row we extract from the columns (td elements) the various pieces of data, and we write all of it in the underlying file

            soup = BeautifulSoup(r.text, 'html.parser')
            table = soup.find("table")
            rows = table.find_all("tr")
    
            for row in rows[1:]:
                cols = row.find_all("td")
    
                # -- extact data from columns
                nome = cols[0].a.decode_contents().strip()
                qualifica = cols[1].decode_contents().strip()
                studio = cols[2].decode_contents().strip()
                telefono = cols[3].decode_contents().strip()
    
                mail = cols[4].a.img['title'].strip() if cols[4].find("a") else ""
                homepage = cols[5].a['href'].strip() if cols[5].find("a") else ""
                # cv = ""
                insegnamenti = ""
                for ins in cols[7].find_all("a"):
                    insegnamenti += ins.decode_contents().strip() + "-"
                insegnamenti = insegnamenti[:-1]
    
                # -- write to file
                out.write(f"{nome},{qualifica},{studio},{telefono},{mail},{homepage},{insegnamenti}\n")
    

Thus, after this has done executing, instead of having the teachers data embedded in an html document with lots of other tags, we will simply have a .csv file with only the data we’re interested in.

Graphically,

  • This is the view we get from the web browser.

    Web browser view
  • This instead is the underlying html code of that page, which contains the data we’re interested in but its very hard to read.

    HTML code view
  • Finally, this is the contents .csv file opened in Emacs (it would be the same in any other text editor) after the data has been scraped from the html document.

    CSV (plaintext) view

I hope this little example also gets across another point:

The complexity of web scraping depends mainly on how the web page we want to scrape is structured

If a page is well structured, and if we can find the id and class attributes values of the element we want to scrape, then scraping the data becomes easy. If the page is a mess however it can be quite hard to make sense of the structure and extract from it the data we need.


So, this is it! I hope it was an interesting read, and if you want to give some feedback, well, feel free to write an email or a comment down below.

Have fun scraping and programming!