Nick Efford

Programming The World (Part 3)

Thu, 24 Apr 2014 02:40:57 -1200

Introduction

Part 1 and Part 2 of this series showed that the web is a rich source of open data in formats such as CSV, JSON and XML. But how do we get data from these sources into a program? How do we deal with the formatting and extract what we need from the data? This final part of the series shows how these goals can be achieved using the Python programming language.

Note that Python 3 is used here, rather than the older (and increasingly obsolete) Python 2. Note also that other languages besides Python can do these things, with comparable ease in some cases. Python is used here because it is a good choice for such tasks and because it also happens to be my favourite programming language!

Reading Data From The Web

This can be done using the urlopen function from the urllib.request module in Python's standard library (see the official module documentation for full details). To make a basic HTTP GET request, which is typically what you'll need to access data, this function requires just one argument: the URL of the resource being accessed. It returns an HTTPResponse object from which we can read the data.

Calling the read method of the HTTPResponse object returns the data as a Python bytes object - essentially a string of bytes. No assumptions are made about the nature of the data. Thus, if you were expecting a text-based format such as CSV, XML or JSON, you must do the translation of bytes to text yourself by calling the decode method on the string of bytes. The decode method accepts an encoding scheme as an optional argument, defaulting to UTF-8 if one isn't specified. This default will be suitable in most cases.

All of this leads to code like the following example, which acquires CSV data for earthquakes from the USGS Earthquake Hazards Program website discussed in Part 2:

from urllib.request import urlopen

# Construct feed URL for M4.5+ quakes in the past 7 days

base = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/"
feed_url = base + "4.5_week.csv"

# Open URL and read text from it

source = urlopen(feed_url)
text = source.read().decode()

Handling CSV

You now have a program that reads one of the earthquake CSV data feeds into the program as a single string of text. Imagine that you wish to process this dataset in order to find the mean and standard deviation of earthquake depths. An examination of the data feed will tell you that the fourth column holds the depth values you need.

Extraction of the depth values can be done using the csv module from Python's standard library (see the official module documentation for full details). If you are running Python 3.4 or newer, the standard library also has a module called statistics that makes computation of the mean and standard deviation trivial.

One approach is to create a reader object that can scan the lines in the text string containing all of the data. Calling the splitlines method on the string will return these lines in a list, which is then used to create the reader object. Iterating over the reader object will give us first the column headings (if present), then each record from the dataset. The record will be a list of string values, since the reader object doesn't know how the data should be interpreted. Depths will therefore need to be converted into floating-point values before being collected into a list. This list of values can then be passed to functions from the statistics module that compute mean and standard deviation.

Suitable code to do all this is shown below. (The code to read the data has been omitted but you should imagine it to be at the location of the ...)

import csv
import statistics

...

# Create reader for the dataset

reader = csv.reader(text.splitlines())

# Read column headings (which are not used here)

headings = next(reader)

# Fetch each record and collect depths into a list

depths = []
for record in reader:
    # Depth is fourth value, at index 3
    # Value is a string and must be converted to a float
    depth = float(record[3])
    depths.append(depth)

# Compute mean & standard deviation of depths

mean = statistics.mean(depths)
stdev = statistics.stdev(depths, mean)

print("Mean    =", mean)
print("Std dev =", stdev)

An alternative and slightly more user-friendly approach is to use a DictReader object. Unlike the normal reader object provided by the csv module, which gives you each record as a list of strings, a DictReader object will give you each record as a dictionary, in which the keys are the column headings. Here's an example generated from the earthquake data feed:

{'depth': '11.4', 'dmin': '2.379', 'time': '2014-04-24T03:10:12.880Z', 'updated': '2014-04-24T09:31:17.990Z', 'net': 'us', 'id': 'usb000px6r', 'nst': '', 'rms': '1.18', 'mag': '6.6', 'magType': 'mww', 'place': '94km S of Port Hardy, Canada', 'latitude': '49.8459', 'type': 'earthquake', 'longitude': '-127.444', 'gap': '41'}

If DictReader is used then the code needed to build a list of earthquake depths will change to something like this:

reader = csv.DictReader(text.splitlines())

depths = []
for record in reader:
    depth = float(record["depth"])
    depths.append(depth)

Note how "depth" is used to look up the depth value instead of an integer index. This makes the code a little easier to understand.

Handling JSON

This can be done using the json module from Python's standard library (see the official module documentation for full details). Let us consider how this module can be used to find artists who have been played more than ten times in the past week on BBC radio stations, using the JSON data feed that was mentioned in Part 2.

The first step is to read data from the feed into a single string of text, as discussed above. This string can then be passed to the loads function from the json module, which deserializes the JSON dataset, returning it as a dictionary. Here is the code that you need:

import json
from urllib.request import urlopen

feed_url = "http://www.bbc.co.uk/programmes/music/artists/charts.json"

# Open URL and read text from it

source = urlopen(feed_url)
text = source.read().decode()

# Deserialize the JSON data contained in the text

data = json.loads(text)

The dictionary will have the following format. (Note: this is real data, but records for only the first three artists are shown here.)

{
  "artists_chart" : {
    "artists" : [
      {
        "plays" : 17,
        "name" : "Drake",
        "previous_plays" : 15,
        "gid" : "9fff2f8a-21e6-47de-a2b8-7f449929d43f"
      },
      {
        "plays" : 16,
        "name" : "Nas",
        "previous_plays" : 4,
        "gid" : "cfbc0924-0035-4d6c-8197-f024653af823"
      },
      {
        "plays" : 15,
        "name" : "David Bowie",
        "previous_plays" : 12,
        "gid" : "5441c29d-3602-4898-b1a1-b77fa23b8e50"
      },

    ],
    "period" : "Past 7 days",
    "end" : "2014-04-24",
    "start" : "2014-04-17"
  }
}

You can see from this example that keys of "artists_chart" and "artists" are required in order to access the list of artist details. Each element of this list is itself a dictionary in which artist name and play count can be accessed using keys called "name" and "plays", respectively. This leads us to the following code:

...

artists = data["artists_chart"]["artists"]

for artist in artists:
    if artist["plays"] > 10:
        print(artist["name"], artist["plays"])

Simplifying Things With Requests

If you are willing and able to install third-party Python packages on your system, Kenneth Reitz's excellent Requests library can be used to simplify things considerably. Requests has a much cleaner API for issuing HTTP GET requests like those used in the preceding examples. It also greatly simplifies POST requests, file uploading and authentication. It even has built-in JSON deserialization capabilities.

Using Requests, the first six lines of code in the JSON example can be replaced with four simpler lines:

import requests

feed_url = "http://www.bbc.co.uk/programmes/music/artists/charts.json"

response = requests.get(feed_url)
data = response.json()

...

That's All Folks!

The source code for this article's examples is available in a Bitbucket repository.

I hope you've found this series of articles useful; feel free to get in touch if you have questions or comments!

Programming The World (Part 2)

Mon, 14 Apr 2014 00:14:57 -1200

Introduction

Part 1 of this series looked at how devices in the 'Internet of Things' can sense their surroundings and make sensor measurements available over the web in formats such as CSV, XML and JSON. The same formats are used to publish data from a variety of other sources. This article gives a few examples of these other sources.

Data From The BBC

The British Broadcasting Corporation is a public body established by Royal Charter. As such it has a number of obligations, among them the expectation that it will deliver to the public the benefit of emerging communication technologies and services. Part and parcel of this is a commitment to linked data - which, in practice, means that the BBC is attempting to provide machine-readable data on its radio and TV programmes via the web.

Visit http://www.bbc.co.uk/programmes/developers and you will see details of the BBC's approach. This page describes the addressing scheme that the BBC has devised for publishing programme data and provides links to a couple of examples: an XML data feed giving the schedule for Radio 1 in England and a JSON data feed giving upcoming Sci-Fi programmes on TV. Try both of these out now. The screenshot below shows a portion of XML data from the first of them, as displayed by the Chrome browser.

Another interesting BBC data feed is this one, which breaks down radioplay by artist across the BBC's radio stations:

http://www.bbc.co.uk/programmes/music/artists/charts.json

(The same data can be retrieved as XML simply by replacing json with xml in the URL above.)

Earth & Environmental Data

The Met Office

The Met Office is the UK's national weather forecasting service. It is currently beta-testing a service called DataPoint, which it describes thus:

DataPoint is a way of accessing freely available Met Office data feeds in a format that is suitable for application developers. It is aimed at professionals, the scientific community and student or amateur developers, in fact anyone looking to re-use Met Office data within their own innovative applications.

DataPoint offers a wide range of useful meteorological data. For example, it provides a five-day forecast of temperature, wind speed & direction, precipitation and other variables for specific locations in the UK, either as a visualisation like that shown below or as raw data in XML or JSON formats.

DataPoint also provides map layers in the PNG image format for both weather forecasts and actual observations. Forecast layers show cloud cover, rainfall, temperature and pressure as isobars. Observation layers show rainfall, lightning storms, and satellite images in the visible and IR regions of the spectrum. Layer retrieval is a two-stage process in which you must first request details of all the available layers, in either XML or JSON formats. This information can then be used to construct the specific URL of the desired layer. The example shown here is a composite of a visible-spectrum satellite image with layers showing forecasted rainfall and pressure.

One important thing to note about DataPoint is that users must register with the service in order to obtain an API key. This is a unique string of characters that identifies you as a legitimate user of the service. It is used for authentication purposes and to track your usage of the service. All requests made to DataPoint must include your API key.

API keys are actually a fairly common requirement for use of web services. ThingSpeak, discussed in Part 1, requires one. Many services will provide an API key for free but will limit the number of times that you can invoke the service free of charge; for example, forecast.io will allow you to make up to 1,000 API calls per day for free but will charge you $1 per 10,000 calls thereafter.

USGS Earthquake Hazards Program

The United States Geological Survey's Earthquake Hazards Program is one of my favourite data source examples. Their website provides comprehensive real-time feeds of seismological data in a variety of different formats, as the screenshot below illustrates.

The Spreadsheet Applications link on this page takes you to another page containing various CSV data feeds. The Atom Syndication and QuakeML links are for two different XML-based formats, the former being for consumption by RSS readers and the latter for professional geoscientists. The Programmatic Access link is for a JSON-based format called GeoJSON.

For each format, feeds are grouped by time, covering the past hour, past day, past 7 days and past 30 days. In each of these groups there is an 'all earthquakes' feed plus separate feeds for different levels of severity, covering 'significant' earthquakes and those with magnitudes of 4.5 or more, 2.5 or more, 1.0 or more.

The quantity of data that you obtain from these feeds will very much depend on which one you choose; for example, the feed for significant earthquakes occurring in the past hour will be empty most of the time, whereas the feed for all earthquakes from the last 30 days will typically give you many thousands of events each time that you access it.

A GeoJSON feed is a list of seismic events, each of which is represented as shown below. A glossary explains what the various data fields mean. (A few of them have been omitted in the interests of clarity.) This particular example is for the magnitude 7.4 quake that occurred near the Solomon Islands on 13 April 2014.

{
  "type": "Feature",
  "properties": {
    "mag": 7.4,
    "place": "111km S of Kirakira, Solomon Islands",
    "time": 1397392578710,
    "updated": 1397421536312,
    "tz": 660,
    "felt": null,
    "cdi": null,
    "mmi": 7.51,
    "alert": "green",
    "status": "reviewed",
    "tsunami": 1,
    "sig": 842,
    "net": "us",
    "nst": null,
    "dmin": 2.89,
    "rms": 1.06,
    "gap": 17,
    "magType": "mww",
    "type": "earthquake",
    "title": "M 7.4 - 111km S of Kirakira, Solomon Islands"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [162.0692, -11.451, 35]
  },
  "id": "usc000piqj"
}

Open Data Initiatives

Data.gov.uk is at the heart of the UK government's Transparency agenda and currently (April 2014) makes almost 14,000 datasets available to the public. You can search for a dataset by keyword or conduct geographic searches based on postcode, latitude & longitude or a rectangular region dragged out on a map. You can also drill down via menus that classify datasets by license, theme or format. CSV and XML are well represented, but there are comparatively few JSON datasets currently available. Note that 'open' does not necessarily imply 'easily machine readable'; some of the datasets are provided only as Excel spreadsheets, PDF files or Microsoft Word documents, for example - formats that can be much harder to process using software.

The screenshot below shows the most popular health-related CSV datasets available from the site. You can view this page yourself by visiting

http://data.gov.uk/data/search?theme-primary=Health&res_format=CSV

Open data is having an impact at the local level, too. A good example is the recently-established Leeds Data Mill, promoted as "a place for organisations to share their open data to change the way we live, work and play in the city".

Leeds Data Mill's small but growing collection of datasets includes locations and number of available spaces in council car parks, details of completed and live roadworks in the city and footfall data for eight locations in the city centre.

Continued in Part 3...

Programming The World (Part 1)

Mon, 07 Apr 2014 03:36:53 -1200

Introduction

Our world is becoming increasingly programmable, due to a number of emerging trends. One trend is that increasing quantities of useful public data are being made available over the web in machine-readable forms. Another is that many of the devices around us are becoming 'smart' and connected, capable of feeding real-time information on their surroundings into the web and (to a more limited extent) of reacting in response to commands issued via the web. Then there's the fact that many of us these days carry smartphones: powerful computers with a near-permanent (depending on service provider) Internet connection. We therefore don't have to be sat in front of a PC to interact with this brave new world of data and devices.

This article, the first in a three-part series, looks at how data from devices becomes web-accessible and considers the different data formats that are commonly used. Part 2 surveys some open data sources. Part 3 explores how we can write programs in Python to acquire and process data from these sources.

Note: these three articles are aimed at people who have some experience of Python programming but who don't have much familiarity with data sources or data formats. The articles are based on material originally delivered as a workshop for IT teachers, with the aim of showing them some interesting projects that their students might do once they have learned a bit of Python.

The 'Internet of Things'

Advances in networking technology and falling hardware costs have resulted in a proliferation of small devices that sense their environment and make these measurements available over the web. One such device is the Kickstarter-funded Twine.

A Twine contains internal sensors for temperature, orientation and acceleration. You can also connect external moisture sensors and magnetic reed switches produced by Twine's manufacturer, or sensors of your own design via a special 'breakout board'.

A Twine is programmed with rules based on data from its sensors and uses a Wi-Fi connection to issue notifications via email, SMS or HTTP when these rules trigger. This Wi-Fi connection is also used to update the device with new or modified rules, which are programmed in a visual manner via a straightforward web-based interface.

Notice in the screenshot above how this particular Twine has been programmed to send data to a web API hosted at thingspeak.com. ThingSpeak promotes itself as an "open application platform designed to enable meaningful connections between things and people". Once you've registered with ThingSpeak, you can set up public or private channels for your devices, through which data are made available for visualisation or downloading.

Why not try this out now? Head over to https://thingspeak.com/channels/public to see a listing of some of ThingSpeak's public channels. Click on the link to one of these channels to see the data feed visualised, then click on the Developer Info tab at the top-right to see the formats in which you can download data from this channel.

Notice the links to three different formats: JSON, XML, CSV. Try clicking on these links to view the data. (Depending on your browser, you might see the data displayed in the browser window or it might be treated as a downloaded file; if the latter, just open the file in a text editor to view the data.)

Data Formats

CSV

'Comma-Separated Value' format is the simplest of the three formats offered by ThingSpeak, best suited to data that are tabular in nature. One big reason for its popularity is that spreadsheet applications such as Excel or LibreOffice Calc can open CSV files.

The first few lines of the CSV data for the ThingSpeak feed in the screenshot above look like this:

created_at,entry_id,field1
2014-04-03 11:47:02 UTC,9005,14.375
2014-04-03 12:02:07 UTC,9006,13.75
2014-04-03 12:17:11 UTC,9007,13
2014-04-03 12:32:18 UTC,9008,13.125

This is a dataset with three columns, representing a timestamp, a unique identifier for the measurement and the measurement itself (a temperature in this case). A comma is used to separate the values in each column. (If the value itself contains a comma, this must be protected in some way - e.g., by enclosing the entire value in quotes.) The first line contains the column headings.

For very uniform data where all the records have the same structure, CSV is a good choice, not least because it has a very good data-to-markup ratio. In this example, the markup consists of the first line and then only two commas on each subsequent line. Most of the text is useful data.

XML

Extensible Markup Language (XML) is very flexible because it allows you to define your own elements that describe the data. Most (though not all) elements enclose data within a start tag and end tag - for example, and . Attributes can also be associated with an element if required, using a 'key=value' format - for example, ....

The XML data for the ThingSpeak feed in the screenshot above looks like this:



   type="integer">135
  Thermometer
  
    Wireless outdoor thermometer
    (Electric Imp, TI TMP102 sensor, 4 x AA Energizer L91).
  
   type="decimal">55.652072
   type="decimal">12.546301
  Temperature
   type="dateTime">2011-02-23T22:43:37Z
   type="dateTime">2014-04-04T11:22:55Z
  20m
   type="integer">9092
   type="array">
    
       type="dateTime">2014-04-03T11:47:02Z
       type="integer">9005
      14.375
       type="integer" nil="true"/>
    
    
       type="dateTime">2014-04-03T12:02:07Z
       type="integer">9006
      13.75
       type="integer" nil="true"/>

This lengthy example includes just two of the measurements from the data feed! To be fair, this is partly because XML's flexibility means that other information can be included besides the measurements themselves - for example, a description of the sensor and its latitude, longitude and elevation. However, even if you ignore all this extra information, the part dealing with the measurements themselves still occupies over five times as much space as the equivalent CSV text!

This inherent verbosity is one of the main drawbacks of using XML. Another is that processing XML with a program is more difficult than processing CSV. Fortunately, there are libraries of code for all common programming languages that will parse XML for you. In many cases, these libraries are a standard part of the language.

JSON

JavaScript Object Notation (JSON) is a less formal alternative to XML, providing similar flexibility and descriptive capabilities but with reduced verbosity and a much improved data-to-markup ratio.

The JSON data for the ThingSpeak feed looks like this:

{
  "channel": {
    "id": 135,
    "name": "Thermometer",
    "description": "Wireless outdoor thermometer (Electric Imp, TI TMP102 sensor, 4 x AA Energizer L91).",
    "latitude": "55.652072",
    "longitude": "12.546301",
    "field1": "Temperature",
    "created_at": "2011-02-23T22:43:37Z",
    "updated_at": "2014-04-04T11:22:55Z",
    "elevation": "20m",
    "last_entry_id": 9092
  },
  "feeds": [
    {
      "created_at": "2014-04-03T11:47:02Z",
      "entry_id": 9005,
      "field1": "14.375"
    },
    {
      "created_at": "2014-04-03T12:02:07Z",
      "entry_id": 9006,
      "field1": "13.75"
    },

  ]
}

The use of name-value pairs rather than start and end tags helps to reduce the storage requirements considerably. The temperature measurements in this data feed occupy half the space of those in the XML data feed.

Continued in Part 2...