Programming The World (Part 3)

Data Acquisition & Processing

Nick Efford

Introduction

Part 1 and Part 2 of this series showed that the web is a rich source of open data in formats such as CSV, JSON and XML. But how do we get data from these sources into a program? How do we deal with the formatting and extract what we need from the data? This final part of the series shows how these goals can be achieved using the Python programming language.

Note that Python 3 is used here, rather than the older (and increasingly obsolete) Python 2. Note also that other languages besides Python can do these things, with comparable ease in some cases. Python is used here because it is a good choice for such tasks and because it also happens to be my favourite programming language!

Reading Data From The Web

This can be done using the urlopen function from the urllib.request module in Python's standard library (see the official module documentation for full details). To make a basic HTTP GET request, which is typically what you'll need to access data, this function requires just one argument: the URL of the resource being accessed. It returns an HTTPResponse object from which we can read the data.

Calling the read method of the HTTPResponse object returns the data as a Python bytes object - essentially a string of bytes. No assumptions are made about the nature of the data. Thus, if you were expecting a text-based format such as CSV, XML or JSON, you must do the translation of bytes to text yourself by calling the decode method on the string of bytes. The decode method accepts an encoding scheme as an optional argument, defaulting to UTF-8 if one isn't specified. This default will be suitable in most cases.

All of this leads to code like the following example, which acquires CSV data for earthquakes from the USGS Earthquake Hazards Program website discussed in Part 2:

from urllib.request import urlopen

# Construct feed URL for M4.5+ quakes in the past 7 days

base = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/"
feed_url = base + "4.5_week.csv"

# Open URL and read text from it

source = urlopen(feed_url)
text = source.read().decode()

Handling CSV

You now have a program that reads one of the earthquake CSV data feeds into the program as a single string of text. Imagine that you wish to process this dataset in order to find the mean and standard deviation of earthquake depths. An examination of the data feed will tell you that the fourth column holds the depth values you need.

Earthquake CSV data imported into LibreOffice Calc

Extraction of the depth values can be done using the csv module from Python's standard library (see the official module documentation for full details). If you are running Python 3.4 or newer, the standard library also has a module called statistics that makes computation of the mean and standard deviation trivial.

One approach is to create a reader object that can scan the lines in the text string containing all of the data. Calling the splitlines method on the string will return these lines in a list, which is then used to create the reader object. Iterating over the reader object will give us first the column headings (if present), then each record from the dataset. The record will be a list of string values, since the reader object doesn't know how the data should be interpreted. Depths will therefore need to be converted into floating-point values before being collected into a list. This list of values can then be passed to functions from the statistics module that compute mean and standard deviation.

Suitable code to do all this is shown below. (The code to read the data has been omitted but you should imagine it to be at the location of the ...)

import csv
import statistics

...

# Create reader for the dataset

reader = csv.reader(text.splitlines())

# Read column headings (which are not used here)

headings = next(reader)

# Fetch each record and collect depths into a list

depths = []
for record in reader:
    # Depth is fourth value, at index 3
    # Value is a string and must be converted to a float
    depth = float(record[3])
    depths.append(depth)

# Compute mean & standard deviation of depths

mean = statistics.mean(depths)
stdev = statistics.stdev(depths, mean)

print("Mean    =", mean)
print("Std dev =", stdev)

An alternative and slightly more user-friendly approach is to use a DictReader object. Unlike the normal reader object provided by the csv module, which gives you each record as a list of strings, a DictReader object will give you each record as a dictionary, in which the keys are the column headings. Here's an example generated from the earthquake data feed:

{'depth': '11.4', 'dmin': '2.379', 'time': '2014-04-24T03:10:12.880Z', 'updated': '2014-04-24T09:31:17.990Z', 'net': 'us', 'id': 'usb000px6r', 'nst': '', 'rms': '1.18', 'mag': '6.6', 'magType': 'mww', 'place': '94km S of Port Hardy, Canada', 'latitude': '49.8459', 'type': 'earthquake', 'longitude': '-127.444', 'gap': '41'}

If DictReader is used then the code needed to build a list of earthquake depths will change to something like this:

reader = csv.DictReader(text.splitlines())

depths = []
for record in reader:
    depth = float(record["depth"])
    depths.append(depth)

Note how "depth" is used to look up the depth value instead of an integer index. This makes the code a little easier to understand.

Handling JSON

This can be done using the json module from Python's standard library (see the official module documentation for full details). Let us consider how this module can be used to find artists who have been played more than ten times in the past week on BBC radio stations, using the JSON data feed that was mentioned in Part 2.

The first step is to read data from the feed into a single string of text, as discussed above. This string can then be passed to the loads function from the json module, which deserializes the JSON dataset, returning it as a dictionary. Here is the code that you need:

import json
from urllib.request import urlopen

feed_url = "http://www.bbc.co.uk/programmes/music/artists/charts.json"

# Open URL and read text from it

source = urlopen(feed_url)
text = source.read().decode()

# Deserialize the JSON data contained in the text

data = json.loads(text)

The dictionary will have the following format. (Note: this is real data, but records for only the first three artists are shown here.)

{
  "artists_chart" : {
    "artists" : [
      {
        "plays" : 17,
        "name" : "Drake",
        "previous_plays" : 15,
        "gid" : "9fff2f8a-21e6-47de-a2b8-7f449929d43f"
      },
      {
        "plays" : 16,
        "name" : "Nas",
        "previous_plays" : 4,
        "gid" : "cfbc0924-0035-4d6c-8197-f024653af823"
      },
      {
        "plays" : 15,
        "name" : "David Bowie",
        "previous_plays" : 12,
        "gid" : "5441c29d-3602-4898-b1a1-b77fa23b8e50"
      },

    ],
    "period" : "Past 7 days",
    "end" : "2014-04-24",
    "start" : "2014-04-17"
  }
}

You can see from this example that keys of "artists_chart" and "artists" are required in order to access the list of artist details. Each element of this list is itself a dictionary in which artist name and play count can be accessed using keys called "name" and "plays", respectively. This leads us to the following code:

...

artists = data["artists_chart"]["artists"]

for artist in artists:
    if artist["plays"] > 10:
        print(artist["name"], artist["plays"])

Simplifying Things With Requests

If you are willing and able to install third-party Python packages on your system, Kenneth Reitz's excellent Requests library can be used to simplify things considerably. Requests has a much cleaner API for issuing HTTP GET requests like those used in the preceding examples. It also greatly simplifies POST requests, file uploading and authentication. It even has built-in JSON deserialization capabilities.

Using Requests, the first six lines of code in the JSON example can be replaced with four simpler lines:

import requests

feed_url = "http://www.bbc.co.uk/programmes/music/artists/charts.json"

response = requests.get(feed_url)
data = response.json()

...

That's All Folks!

The source code for this article's examples is available in a Bitbucket repository.

I hope you've found this series of articles useful; feel free to get in touch if you have questions or comments!