Online Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Lesson Objectives

  • Distinguish three types of online data
  • Break down how web services use HTTP
  • Learn Python tools for data acquisition

Specific Achievements

  • Programatically acquire data embedded in a web page
  • Request data through a REST API
  • Use the census package to acquire data
  • Use SQLite for caching

Why script data acquistion?

  • Too time intensive to acquire manually
  • Integrate updated or new data
  • Reproducibility
  • There’s an API between you and the data

Top of Section


Acquiring Online Data

Data is available on the web in many different forms. How difficult is it to acquire that data to run analyses? It depends which of three approaches the data source requires:

  • Web scraping
  • Web service or API
  • API wrapper

Web Scraping 🙁

If a web browser can read HTML and JavaScript and display a human readable page, why can’t you write a program (a “bot”) to read HTML and JavaScript and store the data?

Web Service or API 😉

An Application Programming Interface (API, as opposed to GUI) that is compatible with passing data around the internet using HTTP (Hyper-text Transfer Protocol). This is not the fastest protocol for moving large datasets, but it is universal (it underpins web browsers, after all).

API Wrapper 😂

Major data providers can justify writing a package, specific to your language of choice (e.g. Python or R), that facilitates accessing the data they provide through a web service. Sadly, not all do so.

Top of Section


Requests

That “http” at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol. After all, servers exist to serve.

The requests package provides a simple interface to issuing HTTP requests and handling the response. Here’s an example using an XKCD comic.

import requests

response = requests.get('https://xkcd.com/869')
response
<Response [200]>

The response is still binary. It takes a browser-like parser to translate the raw content into an HTML document. BeautifulSoup does a fair job, while making no attempt to “render” a human-readable page.

from bs4 import BeautifulSoup

doc = BeautifulSoup(response.text, 'lxml')
'\n'.join(doc.prettify().splitlines()[0:10])
'<!DOCTYPE html>\n<html>\n <head>\n  <link href="/s/7d94e0.css" rel="stylesheet" title="Default" type="text/css"/>\n  <title>\n   xkcd: Server Attention Span\n  </title>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <link href="/s/919f27.ico" rel="shortcut icon" type="image/x-icon"/>\n  <link href="/s/919f27.ico" rel="icon" type="image/x-icon"/>'

Searching the document for desired content is the hard part. This search uses a CSS query to find the image below a section of the document with attribute id = comic.

img = doc.select('#comic > img')
img
[<img alt="Server Attention Span" src="//imgs.xkcd.com/comics/server_attention_span.png" title="They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"/>]

It is possible to query by CSS for a single element and extract attributes such as the image title.

img = doc.select_one('#comic > img')
img['title']
"They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"

Was that so bad?

Pages designed for humans are increasingly harder to parse programmatically.

  • Servers provide different responses based on client “metadata”
  • JavaScript often needs to be executed by the client
  • The HTML <table> is drifting into obscurity (mostly for the better)

HTML Tables

Sites with easily accessible html tables nowadays may be specifically intended to be parsed programmatically, rather than browsed by a human reader. The US Census provides some documentation for their data services in a massive table:

https://api.census.gov/data/2017/acs/acs5/variables.html

> import pandas as pd
+ 
+ vars = (
+   pd
+   .read_html('https://api.census.gov/data/2017/acs/acs5/variables.html')
+   .pop()
+ )
+ vars.head()
          Name            Label  ...   Group Unnamed: 8
0       AIANHH        Geography  ...     NaN        NaN
1       AIHHTL        Geography  ...     NaN        NaN
2        AIRES        Geography  ...     NaN        NaN
3         ANRC        Geography  ...     NaN        NaN
4  B00001_001E  Estimate!!Total  ...  B00001        NaN

[5 rows x 9 columns]

We can use our data manipulation tools to search this unwieldy documentation for variables of interest.

idx = (
  vars['Label']
  .str
  .contains('Median household income')
)
vars.loc[idx, ['Name', 'Label']]
               Name                                              Label
11214   B19013_001E  Estimate!!Median household income in the past ...
11215  B19013A_001E  Estimate!!Median household income in the past ...
11216  B19013B_001E  Estimate!!Median household income in the past ...
11217  B19013C_001E  Estimate!!Median household income in the past ...
11218  B19013D_001E  Estimate!!Median household income in the past ...
11219  B19013E_001E  Estimate!!Median household income in the past ...
11220  B19013F_001E  Estimate!!Median household income in the past ...
11221  B19013G_001E  Estimate!!Median household income in the past ...
11222  B19013H_001E  Estimate!!Median household income in the past ...
11223  B19013I_001E  Estimate!!Median household income in the past ...
11932   B19049_001E  Estimate!!Median household income in the past ...
11933   B19049_002E  Estimate!!Median household income in the past ...
11934   B19049_003E  Estimate!!Median household income in the past ...
11935   B19049_004E  Estimate!!Median household income in the past ...
11936   B19049_005E  Estimate!!Median household income in the past ...
19332   B25099_001E           Estimate!!Median household income!!Total
19333   B25099_002E  Estimate!!Median household income!!Total!!Medi...
19334   B25099_003E  Estimate!!Median household income!!Total!!Medi...
19643   B25119_001E  Estimate!!Median household income in the past ...
19644   B25119_002E  Estimate!!Median household income in the past ...
19645   B25119_003E  Estimate!!Median household income in the past ...

Top of Section


Web Services

The US Census Bureau provides access to its vast stores of demographic data over the Web via their API at https://api.census.gov.

The I in GUI is for interface—it’s the same in API, where buttons and drop-down menus are replaced by functions and object attributes.

Instead of interfacing with a user, this kind of interface is suitable for use in programming another software application. In the case of the Census, the main component of the application is some relational database management system. There are several GUIs designed for humans to query the Census database; the Census API is meant for communication between your program (i.e. script) and their application.

You’ll often see the acronym “REST API.” In this context, REST stands for Representational state transfer. This refers to a set of standards that help ensure that the Web service works well with any computer system it may interact with.

Inspect this URL in your browser.

In a web service, the already universal system for transferring data over the internet, known as HTTP, is half of the interface. All you really need is documentation for how to construct the URL in a standards-compliant way that the service will recognize.

Section Description
https:// scheme
api.census.gov authority, or simply domain if there’s no user authentication
/data/2015/acs5 path to a resource within a hierarchy
? beginning of the query component of a URL
get=NAME first query parameter
& query parameter separator
for=county second query parameter
& query parameter separator
in=state:* third query parameter
# beginning of the fragment component of a URL
irrelephant a document section, it isn’t even sent to the server
path = 'https://api.census.gov/data/2017/acs/acs5'
query = {
  'get': 'NAME,B19013_001E',
  'for': 'tract:*',
  'in': 'state:24',
}
response = requests.get(path, params=query)
response
<Response [200]>

Response Header

The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a “header” with information about how to decode the body of the response.

Most REST APIs return as the “content” either:

  1. Javascript Object Notation (JSON)
    • a UTF-8 encoded string of key-value pairs, where values may be lists
    • e.g. {'a':24, 'b': ['x', 'y', 'z']}
  2. eXtensible Markup Language (XML)
    • a nested <tag></tag> hierarchy serving the same purpose

The header from Census says the content type is JSON.

response.headers['Content-Type']
'application/json;charset=utf-8'

Response Content

Use a JSON reader to extract a Python object. To read it into a pandas DataFrame, use pandas’ read_json.

data = pd.read_json(response.content)
data.head()
                                                0            1  ...       3       4
0                                            NAME  B19013_001E  ...  county   tract
1  Census Tract 105.01, Wicomico County, Maryland        68652  ...     045  010501
2  Census Tract 5010.02, Carroll County, Maryland        75069  ...     013  501002
3  Census Tract 5077.04, Carroll County, Maryland        88306  ...     013  507704
4  Census Tract 5061.02, Carroll County, Maryland        84810  ...     013  506102

[5 rows x 5 columns]

API Keys & Limits

Most servers request good behavior, others enforce it.

  • Size of single query
  • Rate of queries (calls per second, or per day)
  • User credentials specified by an API key

From the Census FAQ What Are the Query Limits?:

You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day… Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address.

Top of Section


Specialized Packages

The third tier of access to online data is much preferred, if it exists: a dedicated package in your programming language’s repository (PyPI or CRAN).

  • Additional guidance on query parameters
  • Returns data in native formats
  • Handles all “encoding” problems

The census package is a user-contributed suite of tools that streamline access to the API.

from census import Census

key = None
c = Census(key, year=2017)
c.acs5
<census.core.ACS5Client object at 0x7efe9cb75b00>

Compared to using the API directly via the requests package:

Pros

  • More concise code, quicker development
  • Package documentation (if present) is usually more user-friendly than API documentaion.
  • May allow seamless update if API changes

Cons

  • No guarantee of updates
  • Possibly limited in scope

Query the Census ACS5 (American Community Survey) for the variable B19001_001E (median annual household income, in dollars) and each entity’s NAME.

The American Community Survey (ACS) is a yearly survey that provides detailed population and housing information at fine geographic scale across the United States. Much of the census package is dedicated to accessing the ACS data. ACS5 refers to a five-year average of the annual surveys.

variables = ('NAME', 'B19013_001E')

This code pulls the variables NAME and B19001_001E from all census tracts and all counties in the state with ID 24 (Maryland). The census package converts the JSON string into a Python dictionary. (No need to check headers.)

response = c.acs5.state_county_tract(
    variables,
    state_fips='24',
    county_fips=Census.ALL,
    tract=Census.ALL,
)
response[0]
{'state': '24', 'county': '045', 'tract': '010501', 'NAME': 'Census Tract 105.01, Wicomico County, Maryland', 'B19013_001E': 68652.0}

The Pandas DataFrame() constructor will accept the list of dictionaries as the sole argument, taking column names from “keys”. This code also removes values less than zero.

df = (
  pd
  .DataFrame(response)
  .query("B19013_001E >= 0")
)

The seaborn package provides some nice, quick visualizations. Here we create boxplots showing the income distribution among census tracts within each county in Maryland.

import seaborn as sns

sns.boxplot(
  data = df,
  x = 'county',
  y = 'B19013_001E',
)

Top of Section


Paging & Stashing

A common strategy that web service providers take to balance their load is to limit the number of records a single API request can return. The user ends up having to flip through “pages” with the API, handling the response content at each iteration. Options for stashing data are:

  1. Store it all in memory, write to file at the end.
  2. Append each response to a file, writing frequently.
  3. Offload these decisions to database management software.

The data.gov API provides a case in point. Data.gov is a service provided by the U.S. federal government to make data available from across many government agencies. It hosts a catalog of raw data and of many other APIs from across government. Among the APIs catalogued by data.gov is the FoodData Central API. The U.S. Department of Agriculture maintains a data system of nutrition information for thousands of foods. We might be interested in the relative nutrient content of different fruits.

To repeat the exercise below at home, request an API key at https://api.data.gov/signup/, and store it in a file named api_key.py in your working directory. The file should contain the single line API_KEY = your many digit key.

Load the API_KEY variable by importing it from the file you saved it in.

> from api_key import API_KEY

Run an API query for all foods with "fruit" in their name.

import requests

api = 'https://api.nal.usda.gov/fdc/v1/'
path = 'foods/search'
query = {
    'api_key':API_KEY,
    'query':'fruit',
}
response = (
    requests
    .get(api + path, params=query)
)
doc = response.json()

Extract data from the returned JSON object, which gets mapped to a Python dictionary called doc. To inspect the return, we can list the dictionary keys.

> list(doc.keys())
['foods', 'foodSearchCriteria', 'totalHits', 'currentPage', 'totalPages']

We can print the value associated with the key totalHits to see how many foods matched our search term, "fruit".

> doc['totalHits']
18801

The purported claimed number of results is much larger than the length of the foods array contained in this response. The query returned only the first page, with 50 items.

> len(doc['foods'])
50

The following commands prepare Python to connect to a database-in-a-file, and create empty tables in the database if they do not already exist (meaning that it is safe to re-run after you have populated the database).

Step 1: Boilerplate

The SQLAlchemy package has a lot of features, and requires you to be very precise about how to get started.

from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

Step 2: Table Definition

Define the tables that are going to live in the database using Python classes. For each class, its attributes will map to columns in a table. Then create a session engine.

from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, Text, Numeric

Base = declarative_base()

class Food(Base):
    __tablename__ = 'food'
    
    id = Column(Integer, primary_key=True)
    name = Column(Text)
    sugar = Column(Numeric)
    
engine = create_engine('sqlite:///fruits.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)

For each fruit, we’ll store its name and the amount of sugar (grams of sugar per 100 grams of fruit) found in the API response.

> fruit = doc['foods'].pop()
+ fruit['description']
'Fruit peel, candied'

Extract the names and values of the first ten nutrients for the first item returned by the query.

> [ (nutrient['nutrientName'], nutrient['value']) for nutrient in fruit['foodNutrients'][:9] ]
[('Protein', 0.34), ('Total lipid (fat)', 0.07), ('Carbohydrate, by difference', 82.74), ('Energy', 322.0), ('Alcohol, ethyl', 0.0), ('Water', 16.7), ('Caffeine', 0.0), ('Theobromine', 0.0), ('Sugars, total including NLEA', 80.68)]

Step 3: Connect (and Initialize)

from schema import Session, Food

session = Session()
engine = session.bind

You could inspect the fruit database now using any sqlite3 client: you would find one empty “food” table with fields “id”, “name”, and “sugar”.

Add a new pageSize parameter to request 100 documents per page.

query['pageSize'] = 100

In each request, advance the query parameter pageNumber by one. The first record retrieved will be pageNumber * pageSize. Insert the fruits (the key:value pairs stored in values) in bulk to the database with engine.execute().

In each iteration of the loop, we use a list comprehension to extract the value corresponding to the amount of sugar from each of the foods in the page of results returned by the query.

table = Food.metadata.tables['food']
for i in range(0, 10):
    
    # advance page and query
    query['pageNumber'] = i 
    response = requests.get(api + path, params=query)
    page = response.json()
    fruits = page['foods']
    
    # save page with session engine
    values = [{'name': fruit['description'],
               'sugar': next(iter([ nutrient['value'] for nutrient in fruit['foodNutrients'] if nutrient['nutrientName'][0:5] == 'Sugar' ]), None) } for fruit in fruits]
               
    insert = table.insert().values(values)
    engine.execute(insert)
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9b686da0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9b492f60>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9db07160>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9dda33c8>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9db07cc0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9dda7630>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9da90cc0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9ddb05f8>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9da9ad68>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9ddb2278>

View the records in the database by reading everything we have so far back into a DataFrame.

import pandas as pd

df = pd.read_sql_table('food', engine)

Don’t forget to disconnect from your database!

engine.dispose()

Top of Section


Takeaway

  • Web scraping is hard and unreliable, but sometimes there is no other option.

  • Web services are the most common resource.

  • Search PyPI for an API you plan to use.

Web services do not always have great documentation—what parameters are acceptable or necessary may not be clear. Some may even be poorly documented on purpose if the API wasn’t designed for public use! Even if you plan to acquire data using the “raw” web service, try a search for a relevant package on Python. The package documentation could help.

For more resources, and to discover more APIs, visit:

  • DataOne
    A network of data repositories making data more accessible and usable.
  • SODA Developers
    An open data API to access open data resources from many organizations and governments.

  • Public APIs
    A repository listing free APIs for use in your research and projects.

Top of Section


Exercises

Exercise 1

Identify the name of the census variable in the table of ACS variables whose label includes “COUNT OF THE POPULATION”. Next, use the Census API to collect the data for this variable, for every county in the U.S. state with FIPS code ‘24’, into a pandas DataFrame.

Exercise 2

Request an API key for data.gov, which will enable you to access the FoodData Central API. Use the API to collect 3 “pages” of food results matching a search term of your choice. Modify schema.py to save the names and sugar contents of the foods into a new SQLite file.

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!