# Basic R

Lesson 1 with Ian Carroll

## Lesson Objectives

• Meet the R “Console”, “Editor”, and “Environment” within RStudio
• Understand that R “packages” extend it to the “bleeding-edge”
• Join a vast user community within statistics and ecology
• Learn what “free software” does for reproducible research

## Specific Achievements

• Use R interactively for data exploration
• Create an R script for non-interactive data crunching
• Perform general purpose programming operations

## What is R?

• Language: a vocabulary and a syntax (with lots of punctuation!)
• Interpreter: software that evaluates statements in the R language

Top of Section

## Console

The interpreter accepts commands interactively through the console.

Basic math, as you would type it on a calculator, is usually a valid command in the R language:

> 1 + 2

[1] 3

> 4^2

[1] 16

Question
Why is the output prefixed by [1]?
That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly:

> seq(1, 100)

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
[19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
[37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
[55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
[73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
[91]  91  92  93  94  95  96  97  98  99 100


The interpreter understands more than arithmatic operations. That last command told it to use (or “call”) the function seq().

Most of “learning R” involves getting to know a whole lot of functions, the effect of each function’s arguments (e.g. the input values 1 and 100), and what each function returns (e.g. the output vector).

## R as Calculator

A good place to begin learning R is with its built-in mathematical functionality.

## Arithmatic Operators

Try +, -, *, /, and ^ (for raising to a power).

> 5/3

[1] 1.666667


## Logical Tests

Test equality with == and inequality with <=, <, !=, >, or >=.

> 1/2 == 0.5

[1] TRUE


## Math Functions

Common mathematical functions like sin, log, and sqrt, exist along side some universal constants.

> sin(2 * pi)

[1] -2.449294e-16


## Generic Functions

Functions do more than math! Functions like ‘rep’, ‘sort’, and ‘range’ are pre-packaged instructions for processing user input.

> rep(2, 5)

[1] 2 2 2 2 2


## Parentheses

Sandwiching something with ( and ) has two possible meanings.

Group sub-expressions by parentheses where needed.

> (1 + 2) / 3

[1] 1


Call functions by typing their name and comma-separated arguments between parentheses.

> logb(2, 2)

[1] 1


Top of Section

## Environment

In the RStudio IDE, the environment tab displays any variables added to R’s vocabulary in the current session. In a brand new session, the R interpreter already recognizes many things, despite the environment being “empty”.

With an “empty” environment, the interpreter still recognizes:

• any number
• any string of characters
• nearly universal operators (e.g. + and /)
• operators specific to R (e.g. $ and %*%) • functions in “base R” To reference a number or function just type it in as above. To referece a string of characters, surround them in quotation marks. > 'ab.cd'  [1] "ab.cd"  Without quotation marks, the interpreter checks for things in the environment named ab.cd and doesn’t find anything: > ab.cd  Error in eval(expr, envir, enclos): object 'ab.cd' not found  Question Is it better to use ' or "? Answer Neither one is better. You will often encounter stylistic choices like this, so if you don’t have a personal preference try to mimic existing styles. ## Assignment You can expand the vocabulary known to the R interpreter by creating a new variable. Using the symbol <- is referred to as assignment: the output of any command to the right of <- gets the name given on its left. > x <- seq(0, 100)  You’ll notice that nothing prints to the console, because we assigned the output to a variable. We can print the value of x by evaluating it without assignment. > x   [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [19] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [37] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 [55] 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 [73] 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 [91] 90 91 92 93 94 95 96 97 98 99 100  Assigning values to new variables (to the left of a <-) is the only time you can reference something previously unknown to the interpreter. All other commands must reference things already in the interpreter’s vocabulary. Once assigned to a variable, a value becomes known to R and you can refer to it in other commands. > plot(x, sin(x * 2 * pi / 100))  The environment is dynamic, but under your control! • Variables do not persist between sessions (unless loaded from .Rdata 😢) • Variables only change their value on re-assignment Top of Section ## Editor The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The environment represents the state of a current session. The editor reads and writes files–it is where you head to compose your R scripts. R scripts are simple text files that contain code you intend to run again and again; code to process data, perform analyses, produce visualizations, and even generate reports. The editor and console work together in the RStudio IDE, which gives you multiple ways to send parts of the script you are editing to the console for immediate evaluation. Alternatively you can “source” the entire script or run it from a shell with Rscript. Open up “worksheet-1.R” in the editor, and follow along by replacing the ... placeholders with the code here. Then evalute just this line (Ctrl+Enter on Windows, ⌘+Enter on macOS) vals <- seq(1, 100)  Our call to the function seq could have been much more explicit. We could give the arguments by the names that seq is expecting. vals <- seq(from = 1, to = 100)  Run that code by moving your cursor anywhere within those two lines and clicking “Run”, or by using the keyboard shortcut Ctrl-Return or ⌘ Return. Question What’s an advantage of naming arguments? Answer One advantage is that you can put them in any order. A related advantage is that you can then skip some arguments, which is fine to do if each skipped argument has a default value. A third advantage is code readability, which you should always be conscious of while writing in the editor. ## Readability Code readability in the editor cuts both ways: sometimes verbosity is useful, sometimes it is cumbersome. The seq() function has an alternative form available when only the from and to arguments are needed. > 1:100   [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100  The : operator should be used whenever possible because it replaces a common, cumbersome function call with an brief, intuitive syntax. Likewise, the assign function duplicates the functionallity of the <- symbol, but is never used when the simpler operator will suffice. ## Function documentation How would you get to know these properties and the names of a function’s arguments? > ?seq  How would you even know what function to call? > ??sequence  Top of Section ## Load Data We will use the function read.csv() to load data from a Comma Separated Value file. The essential argument for the function to read in data is the path to the file, other optional arguments adjust how the file is read. Additional file types can be read in using read.table(); in fact, read.csv() is a simple wrapper for the read.table() function having set some default values for some of the optional arguments (e.g. sep = ","). Use the assignment operator “<-“ to read data into a variable for subsequent operations. Type read.csv( into the editor and then press tab to see what arguments this function takes. Hovering over each item in the list will show a description of that argument from the help documentation about that function. Specify the values to use for an argument using the syntax name = value. storm <- read.csv('data/StormEvents.csv')  Question How does read.csv determine the field names? Answer The read.csv command assumes the first row in the file contains column names. Look at ?read.csv to see the default header = TRUE argument. What exactly that means is described down in the “Arguments” section. After reading in the “StormEvents.csv” file, you can explore what types of data are in each column with the str function, short for “structure”. > str(storm)  'data.frame': 100 obs. of 42 variables:$ BEGIN_YEARMONTH  : int  200604 200601 200601 200601 200601 200601 200601 200601 200601 200601 ...
$BEGIN_DAY : int 7 1 1 1 1 30 30 28 28 28 ...$ BEGIN_TIME       : int  1515 0 0 0 0 500 500 800 1400 800 ...
$END_YEARMONTH : int 200604 200601 200601 200601 200601 200601 200601 200601 200601 200601 ...$ END_DAY          : int  7 31 31 31 31 31 31 29 29 29 ...
$END_TIME : int 1515 2359 2359 2359 2359 1400 1400 1300 500 1600 ...$ EPISODE_ID       : int  207534 202408 202409 202409 202409 202394 202394 202395 202396 202397 ...
$EVENT_ID : int 5501658 5482479 5482480 5482481 5482482 5482324 5482325 5482326 5482327 5482328 ...$ STATE            : Factor w/ 19 levels "ALASKA","ARKANSAS",..: 6 3 17 17 17 3 3 3 3 3 ...
$STATE_FIPS : int 18 8 49 49 49 8 8 8 8 8 ...$ YEAR             : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
$MONTH_NAME : Factor w/ 4 levels "April","February",..: 1 3 3 3 3 3 3 3 3 3 ...$ EVENT_TYPE       : Factor w/ 14 levels "Avalanche","Coastal Flood",..: 11 4 4 4 4 14 14 13 14 14 ...
$CZ_TYPE : Factor w/ 2 levels "C","Z": 1 2 2 2 2 2 2 2 2 2 ...$ CZ_FIPS          : int  51 12 23 29 24 4 13 13 5 4 ...
$CZ_NAME : Factor w/ 63 levels "ALAMANCE","ANIMAS RIVER BASIN",..: 27 63 20 9 19 21 23 23 60 21 ...$ WFO              : Factor w/ 13 levels "AFC","DTX","GJT",..: 9 3 3 3 3 3 3 3 3 3 ...
$BEGIN_DATE_TIME : Factor w/ 53 levels "01-JAN-06 00:00:00",..: 14 1 1 1 1 52 52 49 50 49 ...$ CZ_TIMEZONE      : Factor w/ 6 levels "AST","CST","CST-6",..: 2 5 5 5 5 5 5 5 5 5 ...
$END_DATE_TIME : Factor w/ 55 levels "01-JAN-06 09:20:00",..: 13 55 55 55 55 54 54 51 49 52 ...$ INJURIES_DIRECT  : int  0 0 0 0 0 0 0 0 0 0 ...
$INJURIES_INDIRECT: int 0 0 0 0 0 0 0 0 0 0 ...$ DEATHS_DIRECT    : int  0 0 0 0 0 0 0 0 0 0 ...
$DEATHS_INDIRECT : int 0 0 0 0 0 0 0 0 0 0 ...$ DAMAGE_PROPERTY  : Factor w/ 6 levels "0","0.00K","200K",..: 5 NA NA NA NA NA NA NA NA NA ...
$DAMAGE_CROPS : Factor w/ 3 levels "0","0.00K","UNKNOWN": NA NA NA NA NA NA NA NA NA NA ...$ SOURCE           : Factor w/ 14 levels "AMATEUR RADIO",..: 7 8 8 8 8 2 2 2 14 2 ...
$MAGNITUDE : num 61 NA NA NA NA NA NA NA NA NA ...$ MAGNITUDE_TYPE   : Factor w/ 2 levels "EG","MG": 1 NA NA NA NA NA NA NA NA NA ...
$BEGIN_RANGE : int 4 NA NA NA NA NA NA NA NA NA ...$ BEGIN_AZIMUTH    : Factor w/ 8 levels "E","N","NE","NW",..: 1 NA NA NA NA NA NA NA NA NA ...
$BEGIN_LOCATION : Factor w/ 20 levels "ATLANTA","COLUMBUS",..: 13 NA NA NA NA NA NA NA NA NA ...$ END_RANGE        : int  NA NA NA NA NA NA NA NA NA NA ...
$END_AZIMUTH : Factor w/ 7 levels "E","N","NE","NW",..: NA NA NA NA NA NA NA NA NA NA ...$ END_LOCATION     : Factor w/ 19 levels "ATLANTA","COLUMBUS",..: 13 NA NA NA NA NA NA NA NA NA ...
$BEGIN_LAT : num 38.4 NA NA NA NA ...$ BEGIN_LON        : num  -87.5 NA NA NA NA ...
$END_LAT : num 38.3 NA NA NA NA ...$ END_LON          : num  -87.3 NA NA NA NA ...
$EPISODE_NARRATIVE: Factor w/ 37 levels "3 CHICKEN HOUSES DESTROYED.",..: NA 36 35 35 35 7 7 37 15 8 ...$ EVENT_NARRATIVE  : Factor w/ 6 levels "1.3 inches of rain fell in 30 minutes.",..: 2 NA NA NA NA NA NA NA NA NA ...

> df$ed  [1] college highschool college middle middle Levels: middle highschool college  ## Subsets Multiple parts of a data structure are similarly accessed using single square brackets: [ and ]. What goes between the brackets, to specify the positions or names of the desired subset, may be of multiple forms. Parts Result positives elements at given positions negatives given positions omitted logicals elements where the corresponding position is TRUE nothing all the elements days <- c( 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday') weekdays <- days[2:6] weekend <- days[c(1, 7)]  > weekdays  [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"  > weekend  [1] "Sunday" "Saturday"  A logical test applied to a single column produces a vector of TRUE and FALSE values that’s the right length for subsetting the data. > df[df$ed == 'college', ]

       ed    in
1 college 32000
3 college 89000


Top of Section

## Functions

Functions package up a batch of commands. There are several reasons to develop functions in R for data analysis:

• reuse
• modularity
• consistency

Writing functions to use multiple times within a project prevents you from duplicating code, a real time-saver when you want to update what the function does. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

## Anatomy of a function

Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is function.

function(...) {
...
return(...)
}


Three components:

• arguments: control how you can call the function
• body: the code inside the function
• return value: controls what output the function gives

We’ll make a function to extract the first row of its argument, which we give a name to use inside the function:

function(a) {
result <- a[1, ]
return(result)
}


Note that a doesn’t exist until we call the function, which merely contains the instructions for how any a will be handled.

Finally, we need to give the function a name so we can use it like we used c() and seq() above.

first <- function(a) {
result <- a[1, ]
return(result)
}

> first(df)

       ed    in
1 college 32000

Question
Can you explain the result of entering first(income) into the console?
The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message. (It’s okay to ignore a “warning”.)

Top of Section

## Flow Control

A generic term for causing the interpreter to repeat or skip certain lines, using concepts like “for loops” and conditionals.

The R interpreter’s “focus” flows through a script (or any section of code you run) line by line. Without additional instruction, every line is processed from the top to bottom. Flow control refers mostly to the two main ways of directing the interpreter’s focus, via loops and conditions.

Flow control happens within blocks of code isolated between curly braces { and }, known as “statements”.

if (...) {
...
} else {
...
}


The keyword if must be followed by a logical test which determines, at runtime, what to do next. The R interpreter goes to the first statement if the logical value is TRUE and to the second statement if it’s FALSE.

An if/else conditional would allow the first function to avoid the error thrown by calling first(counts).

first <- function(dat) {
if (is.vector(dat)) {
result <- dat[[1]]
} else {
result <- dat[1, ]
}
return(result)
}

> first(df)

       ed    in
1 college 32000

> first(income)

[1] 32000


Top of Section

## Distributions and Statistics

Since it was designed by statisticians, R can easily draw random numbers from probability distributions and calculate probabilities.

To generate random numbers from a normal distribution, use the function rnorm()

rnorm(n = 10)

 [1] -0.2370723 -1.4292213  0.8261260  1.6549967  0.7993259  1.0656807
[7] -0.3397285 -0.2609470 -0.2047979 -0.1227364

Function Returns Notes
rnorm() Draw random numbers from normal distribution Specify n, mean, sd
dnorm() Probability density at a given number
pnorm() Cumulative probability up to a given number left-tailed by default
qnorm() The quantile given a cumulative probability opposite of pnorm

Statistical distributions and their functions. See Table 14.1 in R for Everyone by Jared Lander for a full table.

Distribution Functions
Normal *norm
Binomial *binom
Poisson *pois
Gamma *gamma
Exponential *exp
Uniform *unif
Logistic *logis

R has built in functions for handling many statistical tests.

x <- rnorm(n = 100, mean = 15, sd = 7)
y <- rbinom(n = 100, size = 20, prob = .85)


The samples above are drawn from different distributions with different means. The T-Test should easilly distinguish them, although it does not check assumptions!

> t.test(x, y)


Welch Two Sample t-test

data:  x and y
t = -3.0895, df = 107.92, p-value = 0.00255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3562439 -0.7327582
sample estimates:
mean of x mean of y
15.2155   17.2600


Shapiro’s test of normality provides one routine for verifying assumptions.

> shapiro.test(y)


Shapiro-Wilk normality test

data:  y
W = 0.94755, p-value = 0.0005739


Top of Section

## Review

In this introduction to R, we touched on several key parts of scripting for data analysis.

• RStudio panes
• variable assignment
• data structures
• subsetting data
• functions
• flow control
• probabilities

## Special characters in R

Perhaps more than most languages, an R script can appear like a jumble of archaic symbols. Here is a little table of characters to recognize as having special meaning.

Symbol Meaning
? get help
# comment
: sequence
::, ::: access namespaces (advanced)
<- assignment
$, [ ], [[ ]] subsetting % % infix operators, e.g. %*% { } statements . @ slot (advanced) The . in R has no fixed meaning and is often used as _ might be used to separate words in a variable name. Top of Section ## Exercises ### Exercise 1 Use the quadratic formula to find $x$ that satisfies the equation $1.5 x^2 + 0.3 x - 2.9 = 0$. ### Exercise 2 By default, all character data is read in to a data.frame as factors. Use the read.csv() argument stringsAsFactors to suppress this behavior, then subsequently modify the STATE column in storm to make it a factor. Remember that columns of a data.frame are identified to the R interpreter with the $ operator, e.g. storm$STATE. ### Exercise 3 Use the typeof function to inspect the data type of income, and do the same for another variable to which you assign a list of numbers. Why are they different? Use c to combine income with the new variable you just created and inspect the result with typeof. Does c always create vectors? ### Exercise 4 Create a data frame with two columns, one called “species” with four strings and another called “abund” with four numbers. Store your data frame as a variable called data. ### Exercise 5 1. Get weekdays using negative integers. 2. Get M-W-F using a vector of postitions generated by seq() that uses the by argument (don’t forget to ?seq for help). ### Exercise 6 The keywords else and if can be combined to allow flow control among more than two statements, as below. Expand the first function once again to differentiate between dat provided as a matrix and as a data.frame. It’s up to you what the “first” element of a matrix should be! if (...) { ... } else if { ... } else { ... }  ## Solutions ### Solution 1 (-0.3 + sqrt(0.3 ^ 2 - 4 * 1.5 * -2.9)) / (2 * 1.5)  [1] 1.294035  ### Solution 2 storm <- read.csv( 'data/StormEvents.csv', stringsAsFactors = FALSE) storm$STATE <- factor(storm\$STATE)


### Solution 3

x <- list(3, 4, 5, 7)
typeof(x)

[1] "list"


The variable x has a data type of list, so R does not restrict its elements to a particular type as it does for vectors.

typeof(c(income, x))

[1] "list"


The result of combining a list and vector is a list, because the list is the more flexible data structure.

### Solution 4

species <- c('ape', 'bat', 'cat', 'dog')
abund <- 1:4
data <- data.frame(species, abund)


### Solution 5

days[c(-1, -7)]

[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"

days[seq(2, 7, 2)]

[1] "Monday"    "Wednesday" "Friday"


### Solution 6

first <- function(dat) {
if (is.vector(dat)) {
result <- dat[[1]]
} else if (is.matrix(dat)) {
result <- dat[[1, 1]]
} else {
result <- dat[1, ]
}
return(result)
}

m <- matrix(1:9, nrow = 3, ncol = 3)
first(m)

[1] 1


Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!