Classification

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Lesson Objectives

Specific Achievements

Why learn Python? While much of the academic research community has dived into R for an open source toolbox for data science, there are plenty of reasons to learn Python too. Some find they can learn to write scripts more quickly in Python, others find its object orientation a real boon. This lesson works towards a simple “machine learning” problem, which has long been a speciality of Python’s Scikit-learn package.

Machine Learning

Machine Learning is another take on regression (for continuous response) and classification (for categorical response).

This lesson will step through a non-probabilistic classifier, because it’s at far end of the spectrum relative to generalized linear regression. Some classification methods have probabilistic interpretations: logistic regression is actually a classifier, whether or not you follow through to choosing the most likely outcome or are satisfied with estimating its probability. Others optimize the classifier based on other abstract quantities: SVMs maximize the distance between the “support vectors”.

Jupyter

Sign into JupyterHub and open up worksheet.ipynb. This worksheet is an Jupyter Notebook document: it is divided into “cells” that are run independently but access the same Python interpreter. Use the Notebook to write and annotate code.

After opening worksheet.ipynb, right click anywhere in your notebook and choose “Create Console for Notebook”. Drag-and-drop the tabs into whatever arrangement you like.

Top of Section


Variables

Variable assignment attaches the label left of an = to the return value of the expression on its right.

a = 'xyz'
a
'xyz'

Colloquially, you might say the new variable a equals 'xyz', but Python makes it easy to “go deeper”. There can be only one string 'xyz', so the Python interpreter makes a into another label for the same 'xyz', which we can verify by id().

The “in-memory” location of a returned by id()

id(a)
140241398229848

… is equal to that of xyz itself:

id('xyz')
140241398229848

The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.

a is 'xyz'
True

Equal but not the Same

The id() function helps demonstrate that “equal” is not the “same”.

b = [1, 2, 3]
id(b)
140241398331720
id([1, 2, 3])
140241398330440

Even though b == [1, 2, 3] returns True, these are not the same object:

b is [1, 2, 3]
False

Side-effects

The reason to be aware of what b is has to do with “side-effects”, an very import part of Python programming. A side-effect occurs when an expression generates some ripples other than its return value.

b.pop()
b
[1, 2]

Python is an object-oriented language from the ground up—everything is an “object” with some state to be more or less aware of. And side-effects don’t touch the label, they effect what the label is assigned to (i.e. what it is).

Question
Re-check the “in-memory” location—is it the same b?
Answer
Yes! The list got shorter but it is the same list.

Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:

c = b
b.pop()
c
[1]

The assignment to c does not create a new list, so the side-effect of popping off the tail of b ripples into c.

A common mistake for those coming to Python from R, is to write b = b.append(4), which overwrites b with the value None that happens to be returned by the append() method.

Top of Section


Data Types

The basic data types are

'int' integers
'float' real numbers
'str' character strings
'bool' True or False
'NoneType' None

Any object can be queried with type()

type('x')
str

Operators

Python supports the usual (or not) arithmetic operators for numeric types:

+ addition
- subtraction
* multiplication
/ floating-point division
** exponent
% modulus
// floor division

One or both of these might be a surprise:

5 ** 2
25
2 // 3

Some operators have natural extensions to non-numeric types:

a * 2
'xyzxyz'

Comparison operators are symbols or plain english:

== equal
!= non-equal
>, < greater, lesser
>=, <= greater or equal, lesser or equal
and logical and
or logical or
not logical negation
in logical membership

Top of Section


Data Structures

The built-in structures for holding multiple values are:

Tuple

The simplest kind of sequence, a tuple is declared with comma-separated values, optionally inside (...). The tuple is a common type of return value for functions with multiple outputs.

T = 'x', 3, True
type(T)
tuple

To declare a one-tuple without (...), a trailing “,” is required.

T = 'cat',
type(T)
tuple

List

The more common kind of sequence in Python is the list, which is declared with comma-separated values inside [...]. Unlike a tuple, a list is mutable.

L = [3.14, 'xyz', T]
type(L)
list

Subsetting Tuples and Lists

Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.

L[0]
3.14

Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.

L[-1]
('cat',)

The syntax L[i:j] selects a sub-list starting with the element at index i and ending with the element at index j - 1.

L[0:2]
[3.14, 'xyz']

A blank space before or after the “:” indicates the start or end of the list, respectively. For example, the previous example could have been written L[:2].

A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.

 0      1       2          3 
 | 3.14 | 'xyz' | ('cat',) |
-3     -2      -1

Positive indices are written at the top and negative indices at the bottom. L[i] returns the element to the right of i whereas L[i:j] returns elements between i and j.

Set

The third and last “sequence” data structure is a set, useful for operations like “union” and “difference”. Declare a set with comma-separated values inside {...} or by casting another sequence with set().

S1 = set(L)
S2 = {3.14, 'z'}
S1.difference(S2)
{('cat',), 'xyz'}

Python is a rather principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using [...].

Dictionary

Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers, called “keys”.

Construct a dictionary with comma-separated key: value pairs within {...}.

user = {
  'First Name': 'J.',
  'Last Name': 'Doe',
  'Email': 'j.doe@gmail.com',
}
type(user)
dict

Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.

user['Email']
'j.doe@gmail.com'

To add a single new element to the dictionary, define a new key:value pair by assigning a value to a novel key in the dictionary.

user['Age'] = 42

Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value. You would replace the current Gmail address with user['Email'] = doedoe1337@aol.com.

Top of Section


Flow Control

The syntax for “if … then” tests and for-loops involves plain English and few special characters—you will easily ingest it. Whitespace actually serves as the most important special character, although you may not think of it that way, and that is our main motivation for looking at flow control.

For-loops

A for loop takes any “iterable” object and executes a block of code once for each element in the object.

squares = []
for i in range(1, 5):
    j = i ** 2
    squares.append(j)
len(squares)

The range(i, j) function creates an object that iterates from i up through j - 1; just like in the case of list slices, the range is not inclusive of the upper bound.

Indentation

Note the pattern of the block above:

Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation to define “code blocks”. In a sense, indentation serves two purposes in Python, enhancing readiblity and defining complete statements, which is kinda how we use indentation too.

Nesting Indentation

Each level of indentation indicates blocks within blocks. Nesting a conditional within a for-loop is a common case.

The following example creates a contact list (as a list of dictionaries), then performs a loop over all contacts. Within the loop, a conditional statement (if) checks if the name is ‘Alice’. If so, the interpreter prints her email address; otherwise it prints an empty string.

users = [
    {'Name':'Alice', 'Email':'alice@email.com'},
    {'Name':'Bob', 'Email': 'bob@email.com'},
    ]
for u in users:
    if u['Name'] == 'Alice':
        print(u['Email'])
    else:
        print('')
alice@email.com

Top of Section


Methods

The period is a special character in Python that accesses an object’s attributes and methods. In either the Jupyter Notebook or Console, typing an object’s name followed by . and then pressing the TAB key brings up suggestions.

squares.index(4)

We call this index() function a method of lists (recall that squares is of type 'list'). A useful feature of having methods attached to objects is that their documentation is attached too.

help(squares.index)
Help on built-in function index:

index(...) method of builtins.list instance
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.

A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R it is common for a “dispatcher” to examine the attributes of a function call’s arguments and chooses the particular function to use.

Dictionary’s have an update() method, for merging the contents of a second dictionary into the one calling update().

user.update({
  'Nickname': 'Jamie',
  'Age': 24,
})
user

{'Age': 24,
 'Email': 'j.doe@gmail.com',
 'First Name': 'J.',
 'Last Name': 'Doe',
 'Nickname': 'Jamie'}

Note a couple “Pythonic” style choices in the above:

  1. Leave a space after the : when declaring key: value pairs
  2. Trailing null arguments are syntactically correct, even advantageous
  3. White space within (...) has no meaning and can improve readiability

The “style-guide” for Python scripting, PEP 8, is first authored by the inventor of Python and adhered too by many. It’s sometimes nice to know their is a preferred way of writing code, but don’t let it be a burden.

Top of Section


Pandas

If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.

The analagous data frame package for Python is pandas, which provides a DataFrame object type with methods to subset, filter reshape and aggregate tabular data.

After importing pandas, we call its read_csv function to load the County Business Patterns (CBP) dataset.

import pandas as pd
cbp = pd.read_csv('data/cbp15co.csv')
cbp.dtypes

FIPSTATE     int64
FIPSCTY      int64
NAICS       object
EMPFLAG     object
EMP_NF      object
EMP          int64
QP1_NF      object
QP1          int64
AP_NF       object
AP           int64
EST          int64
N1_4         int64
N5_9         int64
N10_19       int64
N20_49       int64
N50_99       int64
N100_249     int64
N250_499     int64
N500_999     int64
N1000        int64
N1000_1      int64
N1000_2      int64
N1000_3      int64
N1000_4      int64
CENSTATE     int64
CENCTY       int64
dtype: object

As is common with CSV files, the inferred data types were not quite right.

cbp = pd.read_csv(
    'data/cbp15co.csv',
    dtype={'FIPSTATE':'str', 'FIPSCTY':'str'},
    )
cbp.dtypes

FIPSTATE    object
FIPSCTY     object
NAICS       object
EMPFLAG     object
EMP_NF      object
EMP          int64
QP1_NF      object
QP1          int64
AP_NF       object
AP           int64
EST          int64
N1_4         int64
N5_9         int64
N10_19       int64
N20_49       int64
N50_99       int64
N100_249     int64
N250_499     int64
N500_999     int64
N1000        int64
N1000_1      int64
N1000_2      int64
N1000_3      int64
N1000_4      int64
CENSTATE     int64
CENCTY       int64
dtype: object

There are many ways to slice a DataFrame. To select a subset of rows and/or columns by name, use the loc attribute and [ for indexing.

cbp.loc[:, ['FIPSTATE', 'NAICS']]

        FIPSTATE   NAICS
0             01  ------
1             01  11----
2             01  113///
3             01  1133//
4             01  11331/
5             01  113310
6             01  115///
7             01  1151//
8             01  11511/
9             01  115112
10            01  21----
11            01  212///
12            01  2123//
13            01  21231/
14            01  212319
15            01  21232/
16            01  212321
17            01  22----
18            01  221///
19            01  2211//
20            01  22111/
21            01  221112
22            01  22112/
23            01  221122
24            01  2213//
25            01  22131/
26            01  221310
27            01  23----
28            01  236///
29            01  2361//
...          ...     ...
2126571       56  611///
2126572       56  6113//
2126573       56  61131/
2126574       56  611310
2126575       56  6117//
2126576       56  61171/
2126577       56  611710
2126578       56  62----
2126579       56  621///
2126580       56  6213//
2126581       56  62134/
2126582       56  621340
2126583       56  6214//
2126584       56  62142/
2126585       56  621420
2126586       56  6216//
2126587       56  62161/
2126588       56  621610
2126589       56  6219//
2126590       56  62199/
2126591       56  621999
2126592       56  81----
2126593       56  812///
2126594       56  8129//
2126595       56  81299/
2126596       56  812990
2126597       56  813///
2126598       56  8133//
2126599       56  81331/
2126600       56  813312

[2126601 rows x 2 columns]

As with lists, : by itself indicates all the rows (or columns). Unlike lists, the loc attribute returns both endpoints of a slice.

cbp.loc[2:4, 'FIPSTATE':'NAICS']

  FIPSTATE FIPSCTY   NAICS
2       01     001  113///
3       01     001  1133//
4       01     001  11331/

Use the iloc attribute of a DataFrame to get rows and/or columns by position, which behaves identically to list indexing.

cbp.iloc[2:4, 0:3]

  FIPSTATE FIPSCTY   NAICS
2       01     001  113///
3       01     001  1133//

The default indexing for a DataFrame, without using the loc or iloc attributes, is by column name.

cbp[['NAICS', 'AP']].head()

    NAICS      AP
0  ------  321433
1  11----    3566
2  113///    3551
3  1133//    3551
4  11331/    3551

The loc attribute also allows logical indexing, i.e. the use of a boolean array of appropriate length for the selected dimension. The subset of cbp with sector level NAICS codes can indexed with string matching.

logical_idx = cbp['NAICS'].str.match('[0-9]{2}----')
cbp = cbp.loc[logical_idx]
cbp.head()

   FIPSTATE FIPSCTY   NAICS EMPFLAG EMP_NF  EMP QP1_NF    QP1 AP_NF     AP   ...    N100_249  N250_499  N500_999  N1000  N1000_1  N1000_2  N1000_3  N1000_4  CENSTATE  CENCTY
1        01     001  11----     NaN      H   70      H    790     H   3566   ...           0         0         0      0        0        0        0        0        63       1
10       01     001  21----     NaN      H   82      H    713     H   3294   ...           0         0         0      0        0        0        0        0        63       1
17       01     001  22----     NaN      H  196      H   4793     H  18611   ...           0         0         0      0        0        0        0        0        63       1
27       01     001  23----     NaN      G  372      G   2891     G  13801   ...           0         0         0      0        0        0        0        0        63       1
93       01     001  31----     NaN      H  971      H  15386     H  64263   ...           0         0         1      0        0        0        0        0        63       1

[5 rows x 26 columns]

Creation of new variables, our old favorite the full FIPS identifier, is done by assignment to new column names, using another str method.

cbp['FIPS'] = cbp['FIPSTATE'].str.cat(cbp['FIPSCTY'])
cbp.head()

   FIPSTATE FIPSCTY   NAICS EMPFLAG EMP_NF  EMP QP1_NF    QP1 AP_NF     AP  ...    N250_499  N500_999  N1000  N1000_1  N1000_2  N1000_3  N1000_4  CENSTATE  CENCTY   FIPS
1        01     001  11----     NaN      H   70      H    790     H   3566  ...           0         0      0        0        0        0        0        63       1  01001
10       01     001  21----     NaN      H   82      H    713     H   3294  ...           0         0      0        0        0        0        0        63       1  01001
17       01     001  22----     NaN      H  196      H   4793     H  18611  ...           0         0      0        0        0        0        0        63       1  01001
27       01     001  23----     NaN      G  372      G   2891     G  13801  ...           0         0      0        0        0        0        0        63       1  01001
93       01     001  31----     NaN      H  971      H  15386     H  64263  ...           0         1      0        0        0        0        0        63       1  01001

[5 rows x 27 columns]

Index

The DataFrame object includes the notion of “primary keys” for a table in its Index construct. One or more columns that uniquely identify rows can be set aside from the remaining columns.

cbp = cbp.set_index(['FIPS', 'NAICS'])
cbp.head()

             FIPSTATE FIPSCTY EMPFLAG EMP_NF  EMP QP1_NF    QP1 AP_NF     AP  EST   ...    N100_249  N250_499  N500_999  N1000  N1000_1  N1000_2  N1000_3  N1000_4  CENSTATE  CENCTY
FIPS  NAICS                                                                         ...                                                                                             
01001 11----       01     001     NaN      H   70      H    790     H   3566    7   ...           0         0         0      0        0        0        0        0        63       1
      21----       01     001     NaN      H   82      H    713     H   3294    3   ...           0         0         0      0        0        0        0        0        63       1
      22----       01     001     NaN      H  196      H   4793     H  18611    9   ...           0         0         0      0        0        0        0        0        63       1
      23----       01     001     NaN      G  372      G   2891     G  13801   75   ...           0         0         0      0        0        0        0        0        63       1
      31----       01     001     NaN      H  971      H  15386     H  64263   24   ...           0         0         1      0        0        0        0        0        63       1

[5 rows x 25 columns]

Any hierarchical index can be “unstacked”, and all columns appropriately spread into a “wide” table layout.

cbp = cbp[['EMP', 'AP']].unstack(fill_value=0)

The last transformation kept only two variables for each FIPS and NAICS combination: the number of employees and the annual payroll (x $1000). By now, it may be obvious we are slowly working towards some goal!

The number of employees in just two sectors will serve as the set of variables (i.e. columns) by which we attempt to classify each county as “Metro” (i.e. urban) or not (i.e. rural). The first code is for Agriculture, Forestry, Fishing, and Hunting. The second is Accommodation and Food Services.

employment = cbp['EMP']
employment = employment.loc[:, ['11----', '72----']]
employment.head()

NAICS  11----  72----
FIPS                 
01001      70    2091
01003      14   12081
01005     180     594
01007      70     306
01009      10     679

Top of Section


Classification

The CBP dataset provides economic attributes of the three thousand and some US counties. The employment numbers are continuous attributes, and we expect those are somehow different between urban and rural counties.

We need a classifier suitable for:

  1. many continuous “predictors”
  2. a binary “response”

Continuous attributes paried with a binary class, which we hope to predict given a model and a set of training data, could be a problem for logistic regression. An alternative method, and a foundation of machine learning, is the Support Vector Machine.


Image by Alisneaky / CC BY-SA

The name arises from the main optimization task: choosing a subset of the training data attributes to be the “support vectors” (vector is another way of referring to a point in attribute space). The support vectors sit near (“support”) a barrier that more or less divides attribute space into regions dominated by a single class. The points sitting on the dashed lines above are the support vectors.

The USDA ERS classifies the counties of the US with a set of 9 Rural-Urban Continuum Codes. Joining these to the CBP attributes, and creating a binary “Metropolitan” class based on the the codes for urban-influenced areas, completes our dataset.

rural_urban = pd.read_csv(
    'data/ruralurbancodes2013.csv',
    dtype={'FIPS':'str'},
    ).set_index('FIPS')
rural_urban['Metro'] = rural_urban['RUCC_2013'] < 4

By default, the join will take place on the index, which serves as a primary key, for each table. It is a one-to-one join.

employment_rural_urban = employment.join(
    rural_urban['Metro'],
    how='inner',
    )

Only the single variable “Metro” is taken from the rural_urban table for use in the classification problem.

Model evaluation for SVMs and other machine learning methods lacking a probabilistic interpretion is based on subsetting data into “training” and “validation” sets.

import numpy as np

train = employment_rural_urban.sample(
    frac=0.5,
    random_state = np.random.seed(345))

Import the learning machine from sklearn, and pass the training data separately as the attributes (as X below) and class (as y).

from sklearn import svm
ml = svm.LinearSVC()

X = train.drop('Metro', axis=1).values[:, :2]
X = np.log(1 + X)
y = train['Metro'].values.astype(int)

ml.fit(X, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Why np.log(1 + X)? Remember, “whatever it takes” is the machine learning mantra. In this case the high skew of the employment numbers had attributes widely dispersed in attribute space, and the fit was worse (and harder to visualize). The offset with a positive number is necessary to keep counties with 0 employees in a sector.

Classifier Evaluation

The “confusion matrix” shows the largest numbers are on the diagonal—that is the number of metro and Non-metro data points correctly separated by the chosen support vectors.

from sklearn import metrics

metrics.confusion_matrix(y, ml.predict(X), (True, False))

array([[355, 238],
       [109, 869]])

A quick visualization using the mlxtend package shows the challeng of separating this attribute space!

from mlxtend.plotting import plot_decision_regions

plot_decision_regions(X, y, clf=ml, legend=2)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c7cd58358>

plot of ../images/svm_figure6_1.png

Kernel Method

The LinearSVM is rarely used in practice despite its speed—rare is the case when a hyperplane cleanly separates the attributes. The more general SVC machine accepts multiple kernel options that provide great flexibility in the shape of the barrier (no longer a hyperplane).

ml = svm.SVC(kernel = 'rbf', C=1, gamma='auto')
ml.fit(X, y)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The improvement in the confusion matrix is slight—we have not tuned the gamma value, which is automatically chosen based on the size of the dataset.

metrics.confusion_matrix(y, ml.predict(X), (True, False))

array([[322, 271],
       [ 35, 943]])

More important is to understand the nature of the separation, which can “wrap around” the attribute space as necessary.

plot_decision_regions(X, y, clf=ml, legend=2)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c7d430240>

plot of ../images/svm_figure9_1.png

Top of Section


Summary

Python is said to have a gentle learning curve, but all new languages take practice.

Key concepts covered in this lesson include:

Additional critical packages for data science:

Top of Section


Exercises

Exercise 1

Explore the use of in to test membership in a list. Create a list of multiple integers, and use in to test membership of some other numbers in your list.

View solution

Exercise 2

Based on what we have learned so far about lists and dictionaries, think up a data structure suitable for an address book of names and emails. Now create it! Enter the name and email address for yourself and your neighbor in a new variable called addr.

View solution

Exercise 3

Write a for-loop that prints all even numbers between 1 and 9. Use the modulo operator (%) to check for evenness: if i is even, then i % 2 returns 0, because % gives the remainder after division of the first number by the second.

View solution

Exercise 4

Tune the gamma parameter for the Support Vector Machine from above that used the rbf kernel method. It can take any postivie value. Can you improve the confusion matrix? How does gamma appear to affect the regions seen with plot_decision_regions? Does the consuion matrix look okay out-of-sample (i.e. making predictions for the counties not used for training)?

Solutions

Solution 1

answers = [2, 15, 42, 19]
42 in answers
True

Return

Solution 2

addr = [
  {'name': 'Alice', 'email': 'alice@gmail.com'},
  {'name': 'Bob', 'email': 'bob59@aol.com'},
]

Return

Solution 3

for i in range(1, 9):
  if i % 2 == 0:
    print(i)
2
4
6
8

Return

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!