Extracting and Mining Texts for Structured Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Structured Data

Structured data is a collection of multiple observations, each composed of one or more variables. Most analyses typically begin with structured data, the kind of tables you can view as a spreadhseet.

The key to structure is that information is packaged into well-defined variables, e.g. the columns of a tidy data frame. Typically, it took someone a lot of effort to get information into a useful structure.

Well-defined variables


Cox, M. 2015. Ecology & Society 20(1):63.

Variable classification

A variables should fit within one of four categories, notwithstanding the additional specification of ‘data types’ to use when measuring a givn variable.

Category Definition
Interval (or Numeric) Values separated by meaningful intervals
Ordered Ordered values without “distance” between them
Categorical Finite set of distinct, un-ordered values
Qualitative Unlimited, discrete, and un-ordered possibilities

What we call quantitative data is actually any one of the first three.

Qustion
What is one example of each of the three types of quantitative data (interval, ordered, and categorical) a biological survey might produce?
Answer
For example, a fisheries survey might record size, age class (juvenile, adult), and species.
Qustion
What is an example of qualitative data the same biological survey might collect?
Answer
Surveys often collect descriptive data, e.g. description of micro-habitat where an organism was found.

Unstructured Data

Information that has not been carved up into variables is unstructured “data” – though some say that’s a misnomer. Any field researcher knows when they are looking raw information in the face, and puzzling over how to collect data.


Photo by trinisands / CC BY-SA and by Archives New Zealand / CC BY

Suppose you want to collect data on how businesses fail, so you download half a million e-mails from Enron executives that preceeded the energy company’s collapse in 2001.

Message-ID: <16986095.1075852351708.JavaMail.evans@thyme>
Date: Mon, 3 Sep 2001 12:24:09 -0700 (PDT)
From: greg.whalley@enron.com
To: kenneth.lay@enron.com, j..kean@enron.com
Subject: FW: Management Committee Offsite

I'm sorry I haven't been more involved is setting this up, but I think the agenda looks kond of soft.  At a minimum, I would like to turn the schedule around and hit the hard subjects like Q3, risk management, and ...

Structuring the data for analysis does not mean you quantify everything, although certainly some information can be quantified. Rather, turning unstructured information into structured data is a process of identifying concepts, definining variables, and assigning their values (i.e. taking measurements) from the textual, audio or video content.

Possible examples for variables of different classes to associate with the Enron e-mails.

Category Example
Interval (or Numeric) timestamp, e-mail length, occurrences of a given topic
Ordered sender’s position in the company, position in process-tracing sequence of events
Categorical sender’s department in the company, sender-recipient network connections
Qualitative message topics, sentiment
Question
What distinguishes qualitative data from unstructured information?
Answer
It is the measurement of a variable that relates to a well-defined concept
It is qualitative, i.e. categorical, un-ordered and taking any value

Processing of texts, surveys, recordings, etc. into variables (whether qualitative or not), is often described as qualitiative data analysis.

Help from a computer

These are different ways of performing “feature engineering”, which requires both domain knowledge and programing skill. The feature engineer faces the dual challenges of linking concepts to variables and of creating structured data about these variables from a source of raw information.

Top of Section


Scraping

Text
by Randall Munroe / CC BY-NC

RegEx is a very flexible, and very fast, program for parsing text.

Pattern String with match
Subject:.* Subject: Re: TPS Reports
\$[0-9,]+ The ransom of $1,000,000 to Dr. Evil.
\b\S+@\S+\b E-mail info@sesync.org or tweet @SESYNC for details!

Note that “\” must be escaped in R, so the first pattern would be scripted as "\\$[0-9,]+".

Continuing with the Enron e-mails theme, begin by bringing the documents into an analysis with the tm package.

library(tm)
library(SnowballC)

docs <- VCorpus(DirSource("data/enron"))
> meta(docs[[1]])
  author       : character(0)
  datetimestamp: 2018-08-28 20:57:09
  description  : character(0)
  heading      : character(0)
  id           : 10001529.1075861306591.txt
  language     : en
  origin       : character(0)
> content(docs[[1]])
 [1] "Message-ID: <10001529.1075861306591.JavaMail.evans@thyme>"                                        
 [2] "Date: Wed, 7 Nov 2001 13:58:24 -0800 (PST)"                                                       
 [3] "From: dutch.quigley@enron.com"                                                                    
 [4] "To: frthis@aol.com"                                                                               
 [5] "Subject: RE: seeing as mark won't answer my e-mails...."                                          
 [6] "Mime-Version: 1.0"                                                                                
 [7] "Content-Type: text/plain; charset=us-ascii"                                                       
 [8] "Content-Transfer-Encoding: 7bit"                                                                  
 [9] "X-From: Quigley, Dutch </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DQUIGLE>"                                 
[10] "X-To: 'Frthis@aol.com@ENRON'"                                                                     
[11] "X-cc: "                                                                                           
[12] "X-bcc: "                                                                                          
[13] "X-Folder: \\DQUIGLE (Non-Privileged)\\Quigley, Dutch\\Sent Items"                                 
[14] "X-Origin: Quigley-D"                                                                              
[15] "X-FileName: DQUIGLE (Non-Privileged).pst"                                                         
[16] ""                                                                                                 
[17] "yes please on the directions"                                                                     
[18] ""                                                                                                 
[19] ""                                                                                                 
[20] " -----Original Message-----"                                                                      
[21] "From: \tFrthis@aol.com@ENRON  "                                                                   
[22] "Sent:\tWednesday, November 07, 2001 3:57 PM"                                                      
[23] "To:\tsiva66@mail.ev1.net; MarkM@cajunusa.com; Wolphguy@aol.com; martier@cpchem.com; klyn@pdq.net" 
[24] "Cc:\tRs1119@aol.com; Quigley, Dutch; john_riches@msn.com; jramirez@othon.com; bwdunlavy@yahoo.com"
[25] "Subject:\tRe: seeing as mark won't answer my e-mails...."                                         
[26] ""                                                                                                 
[27] "Kingwood Cove it is! "                                                                            
[28] "Sunday "                                                                                          
[29] "Tee Time(s):  8:06 and 8:12 "                                                                     
[30] "Cost - $33 (includes cart) - that will be be $66 for Mr. 2700 Huevos. "                           
[31] "ernie "                                                                                           
[32] "Anyone need directions?"                                                                          

The regex pattern ^From: .* matches any whole line that begins with “From: “. Parentheses cause parts of the match to be captured for substitution or extraction.

library(stringr)

txt <- content(docs[[1]])[1:16]
str_match(txt, '^From: (.*)')
      [,1]                            [,2]                     
 [1,] NA                              NA                       
 [2,] NA                              NA                       
 [3,] "From: dutch.quigley@enron.com" "dutch.quigley@enron.com"
 [4,] NA                              NA                       
 [5,] NA                              NA                       
 [6,] NA                              NA                       
 [7,] NA                              NA                       
 [8,] NA                              NA                       
 [9,] NA                              NA                       
[10,] NA                              NA                       
[11,] NA                              NA                       
[12,] NA                              NA                       
[13,] NA                              NA                       
[14,] NA                              NA                       
[15,] NA                              NA                       
[16,] NA                              NA                       

Extract structured data

The meta object for each e-mail was sparsely populated, but some of those variables can be extracted from the content.

for (i in seq(docs)) {
  txt <- content(docs[[i]])
  match <- str_match(txt, '^From: (.*)')
  row <- !is.na(match[ , 1])
  from <- match[row, 2]
  meta(docs[[i]], "author") <- from[[1]]
}
> meta(docs[[1]])
  author       : dutch.quigley@enron.com
  datetimestamp: 2018-08-28 20:57:09
  description  : character(0)
  heading      : character(0)
  id           : 10001529.1075861306591.txt
  language     : en
  origin       : character(0)

Top of Section


Extracting relational data

Relational data are tables that establish a relationship between entities from other tables. Suppose we have a table with a record for each unique address in the Enron e-mails, then a second table with a record for each pair of e-mail addresses that exchanged a message is relational data.

> doc <- docs[[2]]
> content(doc)[1:6]
[1] "Message-ID: <10016327.1075853078441.JavaMail.evans@thyme>"               
[2] "Date: Mon, 20 Aug 2001 16:14:45 -0700 (PDT)"                             
[3] "From: lynn.blair@enron.com"                                              
[4] "To: ronnie.brickman@enron.com, randy.howard@enron.com"                   
[5] "Subject: RE: Liquids in Region"                                          
[6] "Cc: ld.stephens@enron.com, team.sublette@enron.com, w.miller@enron.com, "

The “To:” field is slightly harder to extract, because it may include multiple recipients.

match <- str_match(content(doc), '^Subject:')
subject <- which(!is.na(match))
to <- paste(content(doc)[4:(subject[1] - 1)], collapse='')
to_list <- str_extract_all(to, '\\b\\S+@\\S+\\b')
> to_list
[[1]]
[1] "ronnie.brickman@enron.com" "randy.howard@enron.com"   

Embed the above lines in a for loop to build an edge list for the network of e-mail senders and recipients.

edgelist <- NULL
for (i in seq(docs)) {
  doc <- docs[[i]]
  from <- meta(doc, 'author')
  subject <- which(!is.na(str_match(content(doc), '^Subject:')))
  to <- paste(content(doc)[4:(subject[1] - 1)], collapse='')
  to_list <- str_extract_all(to, '\\b\\S+@\\S+\\b')
  edges <- t(rbind(from, to_list[[1]]))
  edgelist <- rbind(edgelist, edges)
}
> dim(edgelist)
[1] 10431     2

The network package provides convenient tools for working with relational data.

library(network)

g <- network(edgelist)
plot(g)

Question
Is a network qualitative or quantitative data?
Answer
It certainly doesn’t fall into line with traditional statistical methods, but the variables involved are categorical. Methods for fitting models on networks (e.g. ERGMs) are an active research area.

Scraping online

Scraping websites for data that, like the addresses in the Enron e-mails, are already stored as well-defined variables is a similar process. The structured data are in there, you just have to extract it. The httr package can assist with downloading web page content into R as a navigable HTML document.

Top of Section


Text mining

Developing measurements of quantitative variables from unstructured information is another component of the “field-work” in research projects that rely on texts for empirical observations.

Isolate unstructured information

Assuming the structured data in the Enron e-mail headers has been captured, strip down the content to the unstructured message.

for (i in seq(docs)) {
  lines <- content(docs[[i]])
  header_last <- str_match(lines, '^X-FileName:')
  header_last <- which(!is.na(header_last))
  message_begin <- header_last[[1]] + 1
  repeat_first <- str_match(lines, '--Original Message--')
  repeat_first <- which(!is.na(repeat_first))
  message_end <- c(repeat_first - 1, length(lines))[[1]]
  content(docs[[i]]) <- lines[message_begin:message_end]
}
> content(docs[[2]])
[1] ""                                                                                             
[2] "\tRonnie, I just got back from vacation and wanted to follow up on the discussion below."     
[3] "\tHave you heard back from Jerry?  Do you need me to try calling Delaine again?  Thanks. Lynn"
[4] ""                                                                                             

Functions for cleaning strings

These are some of the functions listed by getTransformations.

clean_docs <- docs
clean_docs <- tm_map(clean_docs, removePunctuation)
clean_docs <- tm_map(clean_docs, removeNumbers)
clean_docs <- tm_map(clean_docs, stripWhitespace)
> content(clean_docs[[2]])
[1] ""                                                                                       
[2] " Ronnie I just got back from vacation and wanted to follow up on the discussion below"  
[3] " Have you heard back from Jerry Do you need me to try calling Delaine again Thanks Lynn"
[4] ""                                                                                       

Additional transformations using base R functions can be used within a content_transformation wrapper.

clean_docs <- tm_map(clean_docs, content_transformer(tolower))
> content(clean_docs[[2]])
[1] ""                                                                                       
[2] " ronnie i just got back from vacation and wanted to follow up on the discussion below"  
[3] " have you heard back from jerry do you need me to try calling delaine again thanks lynn"
[4] ""                                                                                       

Customize document preparation with your own functions. The function must be wrapped in content_transformer if designed to accept and return strings rather than PlainTextDocuments.

collapse <- function(x) {
  paste(x, collapse = '')
}
clean_docs <- tm_map(clean_docs, content_transformer(collapse))  
> content(clean_docs[[2]])
[1] " ronnie i just got back from vacation and wanted to follow up on the discussion below have you heard back from jerry do you need me to try calling delaine again thanks lynn"

Stopwords and stems

Stopwords are the throwaway words that don’t inform content, and lists for different languages are complied within tm. Before removing them though, also “stem” the current words to remove plurals and other nuissances.

> clean_docs <- tm_map(clean_docs, stemDocument)
> clean_docs <- tm_map(clean_docs, removeWords, stopwords("english"))

Create Bag-Of-Words Matrix

dtm <- DocumentTermMatrix(clean_docs)
> as.matrix(dtm[1:6, 1:6])
                            Terms
Docs                         aaa aaron abacus abandon abb abba
  10001529.1075861306591.txt   0     0      0       0   0    0
  10016327.1075853078441.txt   0     0      0       0   0    0
  10025954.1075852266012.txt   0     0      0       0   0    0
  10029353.1075861906556.txt   0     0      0       0   0    0
  10042065.1075862047981.txt   0     0      0       0   0    0
  10050267.1075853166280.txt   0     0      0       0   0    0

Outliers may reduce the density of the matrix of term occurrences in each document.

char <- sapply(clean_docs, function(x) nchar(content(x)))
hist(log10(char))

inlier <- function(x) {
  n <- nchar(content(x))
  n < 10^3 & n > 10
}
clean_docs <- tm_filter(clean_docs, inlier)
dtm <- DocumentTermMatrix(clean_docs)
dense_dtm <- removeSparseTerms(dtm, 0.999)
dense_dtm <- dense_dtm[rowSums(as.matrix(dense_dtm)) > 0, ]
> as.matrix(dense_dtm[1:6, 1:6])
                            Terms
Docs                         abil abl abov absolut accept access
  10001529.1075861306591.txt    0   0    0       0      0      0
  10016327.1075853078441.txt    0   0    0       0      0      0
  10025954.1075852266012.txt    0   0    0       0      0      0
  10029353.1075861906556.txt    0   0    0       0      0      0
  10042065.1075862047981.txt    0   0    0       0      0      0
  10050267.1075853166280.txt    0   0    0       0      0      0

Term correlations

The findAssocs function checks columns of the document-term matrix for correlations.

assoc <- findAssocs(dense_dtm, 'ken', 0.2)
> assoc
$ken
  lay court 
 0.45  0.21 

Latent Dirichlet allocation

The LDA algorithim is conceptually similar to dimensionallity reduction techniques for numerical data, such as PCA. Although, LDA requires you to determine the number of “topics” in a corpus beforehand, while PCA allows you to choose the number of principle components needed based on their loadings.

library(topicmodels)

seed = 12345
fit = LDA(dense_dtm, k = 4, control = list(seed=seed))
> terms(fit, 20)
      Topic 1    Topic 2     Topic 3   Topic 4    
 [1,] "will"     "thank"     "will"    "can"      
 [2,] "get"      "lynn"      "thank"   "meet"     
 [3,] "ani"      "pleas"     "know"    "thank"    
 [4,] "look"     "let"       "can"     "know"     
 [5,] "let"      "get"       "pleas"   "work"     
 [6,] "know"     "like"      "want"    "will"     
 [7,] "need"     "agreement" "need"    "question" 
 [8,] "think"    "master"    "like"    "week"     
 [9,] "price"    "parti"     "ani"     "pleas"    
[10,] "email"    "just"      "group"   "lynn"     
[11,] "just"     "need"      "work"    "enron"    
[12,] "time"     "call"      "lynn"    "trade"    
[13,] "question" "back"      "dont"    "use"      
[14,] "market"   "execut"    "just"    "send"     
[15,] "send"     "receiv"    "see"     "get"      
[16,] "lynn"     "want"      "talk"    "may"      
[17,] "enron"    "servic"    "michell" "hope"     
[18,] "call"     "take"      "time"    "veri"     
[19,] "new"      "offic"     "make"    "schedul"  
[20,] "one"      "enron"     "day"     "agreement"

The topic “weights” can be assigned back to the documents for use in future analyses.

topics <- posterior(fit, dense_dtm)$topics
topics <- as.data.frame(topics)
colnames(topics) <- c('accounts', 'meeting', 'call', 'legal')
> head(topics)
                            accounts   meeting      call     legal
10001529.1075861306591.txt 0.2470315 0.2497333 0.2549814 0.2482538
10016327.1075853078441.txt 0.2505420 0.2562433 0.2522828 0.2409319
10025954.1075852266012.txt 0.2509041 0.2477392 0.2492562 0.2521004
10029353.1075861906556.txt 0.2473255 0.2525143 0.2506372 0.2495230
10042065.1075862047981.txt 0.2483039 0.2513652 0.2464095 0.2539213
10050267.1075853166280.txt 0.2439128 0.2486203 0.2573323 0.2501346

Top of Section


Content analysis

RQDA
A GUI tool (like NVivo, Atlas.ti) to assist manual coding of text.

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!