Scraping Web Pages with Python

December 23, 2014

Now and then I have a particularly annoying problem to deal with: that of emailing a large number of students given only their name. For instance, during group advising I need to email the students who expressed interest in a class to tell them that they can register for it.

The part of this that is annoying is that there's no way to know what a student's email address is based only on their name. John Smith might be jsmith4 or jsmith7, for instance. Instead, I have to use this page http://www.umw.edu/directory/ to find their email. With more than a couple this is too tedious to deal with.

As programmers, we can automate repetitive tasks, things like this included. In order to automate tasks using a web site, we can look at how it really works. If I search for John Smith in the directory and hit enter, my browser goes to this page: http://www.umw.edu/directory/?adeq=john+smith&ad_group%5B%5D=&search-choice=people

"John" and "Smith" got encoded in the URL. This tells the directory web page who we are searching for (this is called an HTTP "GET" request). This is really nice, because we can build new search URLs without having to actually go to the search box at all. The next step would be to write a program to generate the URLs for us. Something like this in Python:


first = input("Enter first name: ")
last = input("Enter first name: ")
url = "http://www.umw.edu/directory/?adeq=%s+%s&ad_group%%5B%%5D=&search-choice=people" % (first, last)

If we were to run this program, it would ask for the person we are searching for, then build a link that will search for them. This is only half helpful, however. The next step would be to access the link in our program, and automatically find the email address.

This can be done using Pycurl, a Python interface to the wonderful curl utility for transferring data across networks in a variety of formats. Pycurl allows us to make an HTTP request from the URL we constructed above inside of a program. To do this, we have to create a "callback" function which Pycurl will send the results to. A callback function is one that we create but which is called by functions we did not create. In this case Pycurl will call the function we specify. In this case, we will have it be a method of an "Address" class which will store the HTTP response and rip the email out of it:


# a class for storing the email address
class Address:
    # contents will store the result of the HTTP request
    def __init__(self):
        self.contents = ""

    # callback function used by curl when it gets more data
    def callback(self, buff):
        # we have to decode the bytes into a string using UTF8
        self.contents += buff.decode("utf-8")

# a function to lookup a student email by first and last name
def lookup(first, last):
    # create an address and a pycurl object
    a = Address()
    c = pycurl.Curl()

    # build an URL for the directory for the student in question
    url = "http://www.umw.edu/directory/?adeq=%s+%s&ad_group%%5B%%5D=&search-choice=people" % (first, last)

    # setup Pycurl to make our HTTP request, and send results to the address
    c.setopt(c.URL, url)
    c.setopt(c.WRITEFUNCTION, a.callback)

    # download the page
    c.perform()
    c.close()

    # get the address (this function defined below)
    return a.get()

So now we can build a search URL dynamically, and use that URL to send a request to the server, exactly like your web browser would. Now we have the response which is the HTML code you would see if you click the link in your browser.

The last step is to search the text for an email address. Luckily this is not too hard since email addresses begin with the string "mailto:" which won't likely appear any place else. We can add another method to the Address class to pull the email address out of the HTML page:


# pull the email address out of the downloaded page
def get(self):
    # split the HTML text into a list based on new lines
    lines = self.contents.splitlines()

    # for each line of text
    for line in lines:

        # search for the mailto link
        idx = line.find("mailto:")

        # if it is found
        if idx != -1:
            # find the end which will be the end quote after the mailto
            end = line[idx:].find("\"")

            # return the text between the end of the mailto: (7 characters) and the end
            return line[idx:][7:end]

    # if no link found, return UNKNOWN as the email address
    return "UNKNOWN"

Now we have code to build an URL, download it, and search it for the first email address that appears. We can now totally automate the searching of the UMW directory page. Of course we wouldn't want to enter the names with input() one by one which would hardly be easier than using the search page. Instead, we can roll through them from a file:


# for all lines in input (either file argument or stdin if none)
for line in fileinput.input():
    # read the first and last name from the line
    first, last = tuple(line.split())

    # lookup the email address and print it
    email = lookup(first, last)
    print("%s %s %s" % (first, last, email))

    # wait for a second so as not to hammer the server too fast
    time.sleep(1)

The program will run faster without the sleep call at the end, but I don't know if IT would be happy with requesting 50 pages from their web server inside of a second. I try to stay on their good side.

I doubt that this particular program will be useful to anyone else, but the general technique of automating the querying of a web page is something that can be really helpful - this isn't the first time I've written a script like this. As programmers, we should be loathe to submit to a repetitive task!