Vector Operations in R

Lead Poisoning in Syracuse

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Introduction

Vector data is a way to represent features on Earth’s surface, which can be points (e.g. the locations of ice cream shops), polygons (e.g. the boundaries of countries), or lines (e.g. streets or rivers). This lesson is a whirlwind tour through a set of tools in R that we can use to manipulate vector data and do data analysis on them to uncover spatial patterns.

The example dataset in this lesson is a set of measurements of soil lead levels in the city of Syracuse, New York, collected as part of a study investigating the relationship between background lead levels in the environment and lead levels in children’s blood. As a geospatial analyst and EPA consultant for the city of Syracuse, your task is to investigate the relationship between metal concentration (in particular lead) and population. In particular, research suggests higher concentration of metals in minorities.

Lesson Objectives

  • Dive into scripting vector data operations
  • Manipulate spatial feature collections like data frames
  • Address autocorrelation in point data
  • Understand the basics of spatial regression

Specific Achievements

  • Read CSVs and shapefiles
  • Intersect and dissolve geometries
  • Smooth point data with Kriging
  • Include autocorrelation in a regression on polygon data

Top of Section


Simple Features

In this section, we will perform multiple operations to explore the datasets, including soil lead concentration data (points) and census data (polygons representing block groups and census tracts). We will manipulate the datasets provided to join information using common table keys, spatially aggregate the block group data to the census tract level, and visualize vector spatial data with plot and ggplot commands for sf objects.

A standardized set of geometric shapes are the essence of vector data. The sf package puts sets of shapes in a tabular structure so that we can manipulate and analyze them.

library(sf)

lead <- read.csv('data/SYR_soil_PB.csv')

We read in a data frame with point coordinates called lead from an external CSV file, then create an sf object from the data frame. We use st_as_sf() with the coords argument to specify which columns represent the x and y coordinates of each point. The CRS must be specified with the argument crs via EPSG integer or proj4 string.

lead <- st_as_sf(lead,
  coords = c('x', 'y'),
  crs = 32618)

The lead data frame now has a “simple feature column” called geometry, which st_as_sf creates from a CRS and a geometry type.

Each element of the simple feature column is a simple feature geometry, here created from the "x" and "y" elements of a given feature. In this case, st_as_sf creates a point geometry because we supplied a single x and y coordinate for each feature, but vector features can be points, lines, or polygons.

> lead
Simple feature collection with 3149 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS:           EPSG:32618
First 10 features:
   ID      ppm                 geometry
1   0 3.890648 POINT (408164.3 4762321)
2   1 4.899391 POINT (405914.9 4767394)
3   2 4.434912   POINT (405724 4767706)
4   3 5.285548 POINT (406702.8 4769201)
5   4 5.295919 POINT (405392.3 4765598)
6   5 4.681277 POINT (405644.1 4762037)
7   6 3.364148 POINT (409183.1 4763057)
8   7 4.096946 POINT (407945.4 4770014)
9   8 4.689880 POINT (406341.4 4762603)
10  9 5.422257 POINT (404638.1 4767636)
geometry description
st_point a single point
st_linestring sequence of points connected by straight, non-self-intersecting line pieces
st_polygon one [or more] sequences of points in a closed, non-self-intersecting ring [with holes]
st_multi* sequence of *, either point, linestring, or polygon

Now that our data frame is a simple feature object, calling functions like print and plot will use methods introduced by the sf package.

For example, the print method we just used automatically shows the CRS and truncates the number of records displayed.

Using the plot method, the data are easily displayed as a map.

Here, the points are colored by lead concentration (ppm).

plot(lead['ppm'])

For ggplot2 figures, use geom_sf to draw maps. In the aes mapping for feature collections, the x and y variables are automatically assigned to the geometry column, while other attributes can be assigned to visual elements like fill or color.

library(ggplot2)

ggplot(data = lead,
       mapping = aes(color = ppm)) +
  geom_sf()

Feature Collections

More complicated geometries are usually not stored in CSV files, but they are usually still read as tabular data. We will see that the similarity of feature collections to non-spatial tabular data goes quite far; the usual data manipulations done on tabular data work just as well on sf objects. Here, we read the boundaries of the Census block groups in the city of Syracuse from a shapefile.

blockgroups <- st_read('data/bg_00')
Reading layer `bg_00' from data source `/nfs/public-data/training/bg_00' using driver `ESRI Shapefile'
Simple feature collection with 147 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS:           32618

The geometry column contains projected UTM coordinates of the polygon vertices.

Also note the table dimensions show that there are 147 features in the collection.

> blockgroups
Simple feature collection with 147 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS:           32618
First 10 features:
        BKG_KEY Shape_Leng Shape_Area                       geometry
1  360670001001  13520.233  6135183.6 POLYGON ((403476.4 4767682,...
2  360670003002   2547.130   301840.0 POLYGON ((406271.7 4770188,...
3  360670003001   2678.046   250998.4 POLYGON ((406730.3 4770235,...
4  360670002001   3391.920   656275.6 POLYGON ((406436.1 4770029,...
5  360670004004   2224.179   301085.7 POLYGON ((407469 4770033, 4...
6  360670004001   3263.257   606494.9 POLYGON ((408398.6 4769564,...
7  360670004003   2878.404   274532.3 POLYGON ((407477.4 4769773,...
8  360670004002   3605.653   331034.9 POLYGON ((407486 4769507, 4...
9  360670010001   2950.688   376786.4 POLYGON ((410704.4 4769103,...
10 360670010003   2868.260   265835.7 POLYGON ((409318.3 4769203,...

Simple feature collections are data frames.

> class(blockgroups)
[1] "sf"         "data.frame"

The blockgroups object is a data.frame, but it also has the class attribute of sf. This additional class extends the data.frame class in ways useful for feature collection. For instance, the geometry column becomes “sticky” in most table operations, like subsetting. This means that the geometry column is retained even if it is not explicitly named.

> blockgroups[1:5, 'BKG_KEY']
Simple feature collection with 5 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
CRS:           32618
       BKG_KEY                       geometry
1 360670001001 POLYGON ((403476.4 4767682,...
2 360670003002 POLYGON ((406271.7 4770188,...
3 360670003001 POLYGON ((406730.3 4770235,...
4 360670002001 POLYGON ((406436.1 4770029,...
5 360670004004 POLYGON ((407469 4770033, 4...

Table Operations

We can plot the polygon features using ggplot2 or do common data wrangling operations:

  • plot
  • merge or join
  • “split-apply-combine” or group-by and summarize
> ggplot(blockgroups,
+        aes(fill = Shape_Area)) +
+   geom_sf()

Merging with a regular data frame is done by normal merging on non-spatial columns found in both tables.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
census <- read.csv('data/SYR_census.csv')
census <- mutate(census, 
     BKG_KEY = as.character(BKG_KEY)
)

Here, we read in a data frame called census with demographic characteristics of the Syracuse block groups. As usual, there’s the difficulty that CSV files do not include metadata on data types, which have to be set manually. We do this here by changing the BKG_KEY column to character type using as.character().

Merge tables on a unique identifier (our primary key is "BKG_KEY"), but let the sf object come first or its special attributes get lost.

The inner_join() function from dplyr joins two data frames on a common key specified with the by argument, keeping only the rows from each data frame with matching keys in the other data frame and discarding the rest.

census_blockgroups <- inner_join(
  blockgroups, census,
  by = c('BKG_KEY'))
> class(census_blockgroups)
[1] "sf"         "data.frame"

The census data is now easily visualized as a map.

ggplot(census_blockgroups,
       aes(fill = POP2000)) +
  geom_sf()

Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.

  • split – group the features by some factor
  • apply – apply a function that summarizes each subset, including their geometries
  • combine – return the results as columns of a new feature collection
census_tracts <- census_blockgroups %>%
  group_by(TRACT) %>%
  summarise(
    POP2000 = sum(POP2000),
    perc_hispa = sum(HISPANIC) / POP2000)

Here, we use the dplyr function group_by to specify that we are doing an operation on all block groups within a Census tract. We can use summarise() to calculate as many summary statistics as we want, each one separated by a comma. Here POP2000 is the sum of all block group populations in each tract and perc_hispa is the sum of the Hispanic population divided by the total population (because now we have redefined POP2000 to be the population total). The summarise() operation automatically combines the block group polygons from each tract!

Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.

tracts <- st_read('data/ct_00')
Reading layer `ct_00' from data source `/nfs/public-data/training/ct_00' using driver `ESRI Shapefile'
Simple feature collection with 57 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS:           32618
ggplot(census_tracts,
       aes(fill = POP2000)) +
  geom_sf() +
  geom_sf(data = tracts,
          color = 'red', fill = NA)

By default, the sticky geometries are summarized with st_union. The alternative st_combine does not dissolve internal boundaries. Check ?summarise.sf for more details.

Top of Section


Spatial Join

To summarize average lead concentration by census tracts, we need to spatially join the lead data with the census tract boundaries.

The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.

ggplot(census_tracts,
       aes(fill = POP2000)) +
  geom_sf() +
  geom_sf(data = lead, color = 'red',
          fill = NA, size = 0.1)

In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality-based matches. Several different kinds of “intersections” can be specified that denote a successful match.


Image by Kraus / CC BY

Before doing any spatial join, it is essential that both tables share a common CRS.

> st_crs(lead)
Coordinate Reference System:
  User input: EPSG:32618 
  wkt:
PROJCS["WGS 84 / UTM zone 18N",
    GEOGCS["WGS 84",
        DATUM["WGS_1984",
            SPHEROID["WGS 84",6378137,298.257223563,
                AUTHORITY["EPSG","7030"]],
            AUTHORITY["EPSG","6326"]],
        PRIMEM["Greenwich",0,
            AUTHORITY["EPSG","8901"]],
        UNIT["degree",0.0174532925199433,
            AUTHORITY["EPSG","9122"]],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["metre",1,
        AUTHORITY["EPSG","9001"]],
    AXIS["Easting",EAST],
    AXIS["Northing",NORTH],
    AUTHORITY["EPSG","32618"]]
> st_crs(census_tracts)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137.0,298.257223563]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["False_Easting",500000.0],
    PARAMETER["False_Northing",0.0],
    PARAMETER["Central_Meridian",-75.0],
    PARAMETER["Scale_Factor",0.9996],
    PARAMETER["Latitude_Of_Origin",0.0],
    UNIT["Meter",1.0],
    AUTHORITY["EPSG","32618"]]

The st_join function performs a left join using the geometries of the two simple feature collections.

> st_join(lead, census_tracts)
Simple feature collection with 3149 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS:           EPSG:32618
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648  6102    2154 0.01578459 POINT (408164.3 4762321)
2   1 4.899391  3200    2444 0.10515548 POINT (405914.9 4767394)
3   2 4.434912   100     393 0.03053435   POINT (405724 4767706)
4   3 5.285548   700    1630 0.03190184 POINT (406702.8 4769201)
5   4 5.295919  3900    4405 0.23541430 POINT (405392.3 4765598)
6   5 4.681277  6000    3774 0.04027557 POINT (405644.1 4762037)
7   6 3.364148  5602    2212 0.08318264 POINT (409183.1 4763057)
8   7 4.096946   400    3630 0.01019284 POINT (407945.4 4770014)
9   8 4.689880  6000    3774 0.04027557 POINT (406341.4 4762603)
10  9 5.422257  2100    1734 0.09169550 POINT (404638.1 4767636)
  • Only the “left” geometry is preserved in the output
  • Matching defaults to st_intersects, but permits any predicate function (use join argument to specify)
> st_join(lead, census_tracts,
+   join = st_contains)
Simple feature collection with 3149 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS:           EPSG:32618
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648    NA      NA         NA POINT (408164.3 4762321)
2   1 4.899391    NA      NA         NA POINT (405914.9 4767394)
3   2 4.434912    NA      NA         NA   POINT (405724 4767706)
4   3 5.285548    NA      NA         NA POINT (406702.8 4769201)
5   4 5.295919    NA      NA         NA POINT (405392.3 4765598)
6   5 4.681277    NA      NA         NA POINT (405644.1 4762037)
7   6 3.364148    NA      NA         NA POINT (409183.1 4763057)
8   7 4.096946    NA      NA         NA POINT (407945.4 4770014)
9   8 4.689880    NA      NA         NA POINT (406341.4 4762603)
10  9 5.422257    NA      NA         NA POINT (404638.1 4767636)

This command would match on polygons within (contained by) each point, which is the wrong way around. Stick with the default.

The population data is at the coarser scale, so the lead concentration ought to be averaged within a census tract. Once each lead measurement is joined to TRACT, the spatial data can be dropped using st_drop_geometry().

lead_tracts <- lead %>%
    st_join(census_tracts) %>%
    st_drop_geometry()

Two more steps are needed: group_by() the data by TRACT and summarise() to average the lead concentrations within each TRACT.

lead_tracts <- lead %>%
    st_join(census_tracts) %>%
    st_drop_geometry() %>%
    group_by(TRACT) %>%
    summarise(avg_ppm = mean(ppm))

To visualize the average lead concentration from soil samples within each census tract, merge the data frame to the sf object on the TRACT column.

Again, we use inner_join to keep all rows with matching keys between census_tracts and lead_tracts.

census_lead_tracts <- census_tracts %>%
  inner_join(lead_tracts)

Make a plot of avg_ppm for each census tract.

ggplot(census_lead_tracts,
       aes(fill = avg_ppm)) +
  geom_sf() +
  scale_fill_gradientn(
    colors = heat.colors(7))

A problem with this approach to aggregation is that it ignores autocorrelation. Points close to each other tend to have similar values and shouldn’t be given equal weight in averages within a polygon.

Take a closer look at tract 5800 (around 43.025N, 76.152W), and notice that several low values are nearly stacked on top of each other.

library(mapview)

mapview(lead['ppm'],
        map.types = 'OpenStreetMap',
        viewer.suppress = TRUE) +
  mapview(census_tracts,
          label = census_tracts$TRACT,
          legend = FALSE)