# Extracting and Mining Texts for Structured Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.

## Structured Data

Structured data is a collection of multiple observations, each composed of one or more variables. Most analyses typically begin with structured data, the kind of tables you can view as a spreadhseet.

The key to structure is that information is packaged into well-defined variables, e.g. the columns of a tidy data frame. Typically, it took someone a lot of effort to get information into a useful structure.

## Well-defined variables

Cox, M. 2015. Ecology & Society 20(1):63.

## Variable classification

A variables should fit within one of four categories, notwithstanding the additional specification of ‘data types’ to use when measuring a givn variable.

Category Definition
Interval (or Numeric) Values separated by meaningful intervals
Ordered Ordered values without “distance” between them
Categorical Finite set of distinct, un-ordered values
Qualitative Unlimited, discrete, and un-ordered possibilities

What we call quantitative data is actually any one of the first three.

Qustion
What is one example of each of the three types of quantitative data (interval, ordered, and categorical) a biological survey might produce?
For example, a fisheries survey might record size, age class (juvenile, adult), and species.
Qustion
What is an example of qualitative data the same biological survey might collect?
Surveys often collect descriptive data, e.g. description of micro-habitat where an organism was found.

## Unstructured Data

Information that has not been carved up into variables is unstructured “data” – though some say that’s a misnomer. Any field researcher knows when they are looking raw information in the face, and puzzling over how to collect data.

Photo by trinisands / CC BY-SA and by Archives New Zealand / CC BY

Suppose you want to collect data on how businesses fail, so you download half a million e-mails from Enron executives that preceeded the energy company’s collapse in 2001.

Message-ID: <16986095.1075852351708.JavaMail.evans@thyme>
Date: Mon, 3 Sep 2001 12:24:09 -0700 (PDT)
From: greg.whalley@enron.com
To: kenneth.lay@enron.com, j..kean@enron.com
Subject: FW: Management Committee Offsite

I'm sorry I haven't been more involved is setting this up, but I think the agenda looks kond of soft.  At a minimum, I would like to turn the schedule around and hit the hard subjects like Q3, risk management, and ...

Structuring the data for analysis does not mean you quantify everything, although certainly some information can be quantified. Rather, turning unstructured information into structured data is a process of identifying concepts, definining variables, and assigning their values (i.e. taking measurements) from the textual, audio or video content.

Possible examples for variables of different classes to associate with the Enron e-mails.

Category Example
Interval (or Numeric) timestamp, e-mail length, occurrences of a given topic
Ordered sender’s position in the company, position in process-tracing sequence of events
Categorical sender’s department in the company, sender-recipient network connections
Qualitative message topics, sentiment
Question
What distinguishes qualitative data from unstructured information?
It is the measurement of a variable that relates to a well-defined concept
It is qualitative, i.e. categorical, un-ordered and taking any value

Processing of texts, surveys, recordings, etc. into variables (whether qualitative or not), is often described as qualitiative data analysis.

## Help from a computer

• Scraping
• Process digitized information (websites, texts, images, recordings) into structured data.
• e.g. capture sender, date, and greeting from a batch of e-mails as variables in a data frame.
• Text mining
• Processing text on the way to producing qual/quant data (i.e. this overlaps with scraping).
• e.g. bag-of-words matrix
• Coding
• Annotating a document collection with shared themes, sometimes called Computer Assisted Qualitative Data Analysis (CAQDA).
• e.g. manually labelling sections of each e-mail with [relational] codes/themes
• Topic modeling
• Algorithmic approach to coding extensive document collections.
• e.g. latent Dirichlet allocation (LDA)

These are different ways of performing “feature engineering”, which requires both domain knowledge and programing skill. The feature engineer faces the dual challenges of linking concepts to variables and of creating structured data about these variables from a source of raw information.

Top of Section

## Scraping

by Randall Munroe / CC BY-NC

RegEx is a very flexible, and very fast, program for parsing text.

Pattern String with match
Subject:.* Subject: Re: TPS Reports
\$[0-9,]+ The ransom of$1,000,000 to Dr. Evil.
\b\S+@\S+\b E-mail info@sesync.org or tweet @SESYNC for details!

lay court
0.45  0.21

## Latent Dirichlet allocation

The LDA algorithim is conceptually similar to dimensionallity reduction techniques for numerical data, such as PCA. Although, LDA requires you to determine the number of “topics” in a corpus beforehand, while PCA allows you to choose the number of principle components needed based on their loadings.

library(topicmodels)

seed = 12345
fit = LDA(dense_dtm, k = 4, control = list(seed=seed))
> terms(fit, 20)
Topic 1    Topic 2     Topic 3   Topic 4
[1,] "will"     "thank"     "will"    "can"
[2,] "get"      "lynn"      "thank"   "meet"
[3,] "ani"      "pleas"     "know"    "thank"
[4,] "look"     "let"       "can"     "know"
[5,] "let"      "get"       "pleas"   "work"
[6,] "know"     "like"      "want"    "will"
[7,] "need"     "agreement" "need"    "question"
[8,] "think"    "master"    "like"    "week"
[9,] "price"    "parti"     "ani"     "pleas"
[10,] "email"    "just"      "group"   "lynn"
[11,] "just"     "need"      "work"    "enron"
[12,] "time"     "call"      "lynn"    "trade"
[13,] "question" "back"      "dont"    "use"
[14,] "market"   "execut"    "just"    "send"
[15,] "send"     "receiv"    "see"     "get"
[16,] "lynn"     "want"      "talk"    "may"
[17,] "enron"    "servic"    "michell" "hope"
[18,] "call"     "take"      "time"    "veri"
[19,] "new"      "offic"     "make"    "schedul"
[20,] "one"      "enron"     "day"     "agreement"

The topic “weights” can be assigned back to the documents for use in future analyses.

topics <- posterior(fit, dense_dtm)\$topics
topics <- as.data.frame(topics)
colnames(topics) <- c('accounts', 'meeting', 'call', 'legal')
accounts   meeting      call     legal
10001529.1075861306591.txt 0.2470315 0.2497333 0.2549814 0.2482538
10016327.1075853078441.txt 0.2505420 0.2562433 0.2522828 0.2409319
10025954.1075852266012.txt 0.2509041 0.2477392 0.2492562 0.2521004
10029353.1075861906556.txt 0.2473255 0.2525143 0.2506372 0.2495230
10042065.1075862047981.txt 0.2483039 0.2513652 0.2464095 0.2539213
10050267.1075853166280.txt 0.2439128 0.2486203 0.2573323 0.2501346

Top of Section

## Content analysis

RQDA
A GUI tool (like NVivo, Atlas.ti) to assist manual coding of text.

Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!