Pipelines for Online Data


Note: This lesson is in beta status! It may have issues that have not been addressed.

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.

Objectives for this lesson

  • Access data from a web service
  • Start a data “analysis” pipeline
  • Meet rasterio, for geospatial arrays

Specific achievements

  • Script aquistion from a NASA data portal
  • Get just the data you need with OPeNDAP
  • Mask rasters with shapefiles in Python

Top of Section

Drought Monitoring

The NOAA Climate Prediction Center provides forecasts to assist USAIDs food security programs.

In the report dated March 29 – April 4, 2018, moisture deficits shown for regions of southern Africa and Madagascar are inferred from regularly updated land surface models. The NASA program that maintains those products, the Famine Early Warning Systems Network Land Data Assimilation System (FLDAS), also distributes them through the EARTHDATA portal.

The latest report suggests trouble in the Limpopo basin.

Credit: River Awareness Kit


The NASA EARTHDATA site provides web access to the Earth Observing System Data and Information System (EOSDIS), which distributes products from multiple missions to users.

  1. Register for an Earthdata Login.
  2. Use the following in a terminal to create a “~/.netrc” file:
> echo "machine urs.earthdata.nasa.gov" > ~/.netrc
> echo "login ... " >> ~/.netrc
> echo "password ..." >> ~/.netrc
> chmod 600 ~/.netrc

Search for “FLDAS” datasets. Can you find a link to the data? The more comprehensive the data portal, the more opaque the interface. The data archivers (the DAACs) each provide their own portals for the data they archive, with more specific access instructions application.

Each DAAC requires it’s own “authorization” using your EARTHDATA login, described on the User Registration help page.


Narrow down the FLDAS datasets within the GES DISC.

Filter to:

  • 0.1 x 0.1 degree
  • Monthly

Select the Anomaly data for Southern Africa modeled with the Noah LSM.

The GES DISC portal provides large temporal and spatial extents. We have established the narrower task of investigating evidence of drought in the Limpopo basin.

  • Use the GUI to subset the resource (good once).
  • Script data aquistion for a reproducible and reusable pipeline.

OPeNDAP Server/Client

The GES DISC provides access through a web service called OPeNDAP. The OPenDAP server is listening for requests at a URL, just like a website server.

dap = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/FLDAS/'
resource = 'FLDAS_NOAH01_C_SA_MA.001/2013/FLDAS_NOAH01_C_SA_MA.ANOM201301.001.nc'
url = dap + resource

Some portals, GES DISC is one, require authentication to access the data, whether you use the GUI or request it programatically.

from netrc import netrc
username, _, password = netrc().hosts['urs.earthdata.nasa.gov']

The OPeNDAP server’s response is not a website. We need a OPeNDAP client to handle the response.

from pydap.client import open_url
from pydap.cas.urs import setup_session

session = setup_session(username, password, check_url = url)
dataset = open_url(url, session = session)
> dataset.keys()

Your client is ready to request variables from this dataset. All OPeNDAP data follows a standard data model (very similar to netCDF4 model) and is independent of how GES DISC stores the data.

varname = 'SoilMoi10_40cm_tavg'
variable = dataset[varname]

The data array is transmitted along with the dimensions ordered as (bands, rows, columns). In this call, we get all bands (there’s only one), the first two rows, and the first three columns.

> variable[:, :2, :3].data

The variable.shape is not huge, so get all the data. Our current objective is to figure out how to script the aquisition of a subset, and we don’t want to harass the server while we hack about.

> var = variable[:, :, :]
> var[:, :2, :3].data

The dataset and variable objects provide all the values and critical metadata.

data = var.array.data
dims = {k: v.data for k, v in var.maps.items()}
nodata = dataset.attributes['NC_GLOBAL']['missing_value']
> import matplotlib.pyplot as plt
> plt.imshow(data[0, :, :])

Data Alignment

The var.array.data attribute is a numpy array, allowing for some needed adjustments.

import numpy as np

data = np.flip(data, 1).astype('float32')
nodata = data.dtype.type(nodata)

The rasterio package provides an interface to GDAL/OGR and PROJ4 for georeferencing the numpy’s numerical arrays.

from rasterio import open as raster

meta = {
    'driver': 'GTiff',
    'dtype': data.dtype,
    'count': data.shape[0],
    'height': data.shape[1],
    'width': data.shape[2],
    'nodata': nodata,
with raster(varname + '.tif', 'w', **meta) as r:
    r.write(data[0, :, :], 1)
from rasterio.plot import show

with raster(varname + '.tif') as r:
    show(r.read(1, masked = True))

The warnings arise from the lack of a CRS and extent, or more generally an affine transform.

from rasterio.crs import CRS
from rasterio.transform import from_origin

crs = CRS.from_epsg(4326) # a guess!
attr = dataset.attributes['NC_GLOBAL']
transform = from_origin(
    dims['lon'][0].item(), # west
    dims['lat'][-1].item(), # north (flip!)
    attr['DX'], # xsize
    attr['DY']) # ysize

Update the metadata to include a CRS and the transform, then write to file.

    'crs': crs,
    'transform': transform,
with raster(varname + '.tif', 'w', **meta) as r:
    r.write(data[0, :, :], 1)

Now rasterio recognizes use the georeferencing when showing the image.

with raster(varname + '.tif') as r:
    show((r, 1))

More importantly, we can overlay a shapefile in the same CRS to mask the Limpopo basin.

import geopandas as gpd
fig, ax = plt.subplots()

basin = gpd.read_file(
with raster(varname + '.tif') as r:
    show((r, 1), ax = ax)
basin.plot(ax = ax,
    color='none', edgecolor = 'black')
> fig

Subsetted Requests

The OPeNDAP server allows us to subset the request for data, as opposed to requesting data and then subsetting. Create such a request by subsetting variable. Recall that variable is a connection to the server, not a numpy array.

But what is the subset? It is the window of the raster matching the extent of the polygon. We also want to mask out any remaining pixels outisde the basin. The mask() utility will accomplish both.

from rasterio.mask import mask
from shapely.geometry import mapping

feature = [mapping(g) for g in basin['geometry']]
with raster(varname + '.tif') as r:
    masked, transform = mask(r, feature)
    meta = r.meta.copy()

Write the masked data to a new file, using the original metadata.

with raster(varname + '_basin.tif', 'w', **meta) as r:

The file we just wrote will be the source of the critical information we need for our pipeline: both the window around the basin and the explit mask around the basin.

from rasterio.windows import get_data_window

with raster(varname + '_basin.tif') as r:
    var = r.read(1, masked = True)

y, x = get_data_window(var)
basin = var.mask[slice(*y), slice(*x)]

Slicing the mask with y and x reduce the mask to same shape as the data subset we are going to request.

> show(basin)

Statistics calculated on a masked variable refer only to the pixesl within the basin.

> type(var), var.mean()

While the x and y variables provide the slices we want to request, don’t forget to acconut for flipping the y-axis.

y = variable.shape[1] - y[1], variable.shape[1] - y[0]
var = variable[:, slice(*y), slice(*x)]

Loop though Time

Finally, we have everything we need to build a time series of basin means.

  1. a session to contact the server
  2. a time-varying resource to request
  3. a window to request with
  4. a basin mask the same size as the window
from pandas import Series
basin_ts = Series(index = [[],[]])

name = 'FLDAS_NOAH01_C_SA_MA'
resource = '{0}.001/{1}/{0}.ANOM{1}{2:02d}.001.nc'    

yr = 2016
mo = 0
while True:

    # connect to resource
    # for year and month
    url = dap + resource.format(name, yr, mo + 1)
        dataset = open_url(url,
            session = session)
    # request data subset
    # and store masked mean
    variable = dataset[varname]
    var = variable[:, slice(*y), slice(*x)]
    data = np.flip(var.array.data, 1)
    data = np.ma.array(
        mask = basin)
    basin_ts[yr, mo + 1] = data.mean()
    # increment month and year
    mo = (mo + 1) % 12
    yr = yr + 1 if mo == 0 else yr


Plot a simple bar chart to see how soil moisture anomolies have varied between months over the past couple years.

> basin_ts.plot.bar(color = 'b')

Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!