Our world is becoming increasingly programmable, due to a number of emerging trends. One trend is that increasing quantities of useful public data are being made available over the web in machine-readable forms. Another is that many of the devices around us are becoming 'smart' and connected, capable of feeding real-time information on their surroundings into the web and (to a more limited extent) of reacting in response to commands issued via the web. Then there's the fact that many of us these days carry smartphones: powerful computers with a near-permanent (depending on service provider) Internet connection. We therefore don't have to be sat in front of a PC to interact with this brave new world of data and devices.
This article, the first in a three-part series, looks at how data from devices becomes web-accessible and considers the different data formats that are commonly used. Part 2 surveys some open data sources. Part 3 explores how we can write programs in Python to acquire and process data from these sources.
Note: these three articles are aimed at people who have some experience of Python programming but who don't have much familiarity with data sources or data formats. The articles are based on material originally delivered as a workshop for IT teachers, with the aim of showing them some interesting projects that their students might do once they have learned a bit of Python.
The 'Internet of Things'
Advances in networking technology and falling hardware costs have resulted in a proliferation of small devices that sense their environment and make these measurements available over the web. One such device is the Kickstarter-funded Twine.
A Twine contains internal sensors for temperature, orientation and acceleration. You can also connect external moisture sensors and magnetic reed switches produced by Twine's manufacturer, or sensors of your own design via a special 'breakout board'.
A Twine is programmed with rules based on data from its sensors and uses a Wi-Fi connection to issue notifications via email, SMS or HTTP when these rules trigger. This Wi-Fi connection is also used to update the device with new or modified rules, which are programmed in a visual manner via a straightforward web-based interface.
Notice in the screenshot above how this particular Twine has been programmed to send data to a web API hosted at thingspeak.com. ThingSpeak promotes itself as an "open application platform designed to enable meaningful connections between things and people". Once you've registered with ThingSpeak, you can set up public or private channels for your devices, through which data are made available for visualisation or downloading.
Why not try this out now? Head over to https://thingspeak.com/channels/public to see a listing of some of ThingSpeak's public channels. Click on the link to one of these channels to see the data feed visualised, then click on the Developer Info tab at the top-right to see the formats in which you can download data from this channel.
Notice the links to three different formats: JSON, XML, CSV. Try clicking on these links to view the data. (Depending on your browser, you might see the data displayed in the browser window or it might be treated as a downloaded file; if the latter, just open the file in a text editor to view the data.)
'Comma-Separated Value' format is the simplest of the three formats offered by ThingSpeak, best suited to data that are tabular in nature. One big reason for its popularity is that spreadsheet applications such as Excel or LibreOffice Calc can open CSV files.
The first few lines of the CSV data for the ThingSpeak feed in the screenshot above look like this:
2014-04-03 11:47:02 UTC,9005,14.375
2014-04-03 12:02:07 UTC,9006,13.75
2014-04-03 12:17:11 UTC,9007,13
2014-04-03 12:32:18 UTC,9008,13.125
This is a dataset with three columns, representing a timestamp, a unique identifier for the measurement and the measurement itself (a temperature in this case). A comma is used to separate the values in each column. (If the value itself contains a comma, this must be protected in some way - e.g., by enclosing the entire value in quotes.) The first line contains the column headings.
For very uniform data where all the records have the same structure, CSV is a good choice, not least because it has a very good data-to-markup ratio. In this example, the markup consists of the first line and then only two commas on each subsequent line. Most of the text is useful data.
Extensible Markup Language (XML) is very flexible because it allows you to define your own elements that describe the data. Most (though not all) elements enclose data within a start tag and end tag - for example,
</name>. Attributes can also be associated with an element if required, using a 'key=value' format - for example,
The XML data for the ThingSpeak feed in the screenshot above looks like this:
<?xml version="1.0" encoding="UTF-8"?>
Wireless outdoor thermometer
(Electric Imp, TI TMP102 sensor, 4 x AA Energizer L91).
<id type="integer" nil="true"/>
<id type="integer" nil="true"/>
This lengthy example includes just two of the measurements from the data feed! To be fair, this is partly because XML's flexibility means that other information can be included besides the measurements themselves - for example, a description of the sensor and its latitude, longitude and elevation. However, even if you ignore all this extra information, the part dealing with the measurements themselves still occupies over five times as much space as the equivalent CSV text!
This inherent verbosity is one of the main drawbacks of using XML. Another is that processing XML with a program is more difficult than processing CSV. Fortunately, there are libraries of code for all common programming languages that will parse XML for you. In many cases, these libraries are a standard part of the language.
The JSON data for the ThingSpeak feed looks like this:
"description": "Wireless outdoor thermometer (Electric Imp, TI TMP102 sensor, 4 x AA Energizer L91).",
The use of name-value pairs rather than start and end tags helps to reduce the storage requirements considerably. The temperature measurements in this data feed occupy half the space of those in the XML data feed.
Continued in Part 2...