# Bas(e)ic R

Lesson 2 with Kate Weiss

Top of Section

## Why R?

• High-level programming language good for interactive statistical analysis.
• General purpose programming language for scripting entire data-processing workflows.
• Large selection of “add-on” packages that extend the capabilities of “base R”.
• Large user community especially within statistics and ecology.
• Open source.

Top of Section

## The Console

The interpreter accepts R commands interactively through the console. Basic math, as you would type it on a calculator, is usually a valid command in the R language:

``````1 + 2
``````
``````[1] 3
``````
``````5/3
``````
``````[1] 1.666667
``````
``````4^2
``````
``````[1] 16
``````
Question
Why is the output prefixed by `[1]`?
That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly:

``````seq(1, 100)
``````
``````  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
[18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
[35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
[52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
[69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
[86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
``````

Rather than basic math, we used a function called `seq()` in this command. Most of “learning R” involves getting to know a whole lot of functions, what are their arguments (a.k.a. inputs) and what do they return (a.k.a. output).

One thing you’ll regularly want to do is store output of a function to a variable. Using the symbol `<-` is referred to as assignment: we assign to a variable on the left of the `<-` symbol the output of what is on its right.

``````x <- seq(1, 100)
``````

You’ll notice that nothing prints to the console, because we assigned the output to a variable. We can print `x` by evaluating it by itself.

``````x
``````
``````  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
[18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
[35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
[52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
[69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
[86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
``````

Commands must always reference things known to the interpreter. When you start a R session, there are many things already known, including

• any number
• any string of characters
• functions in `base R`

To reference a number or function you just type it in as above, but to referece a string of alpha-numeric characters you must surround them in quotation marks.

``````'ab#45'
``````
``````[1] "ab#45"
``````
Question
Is it better to use `'` or `"`?
There is no answer! You will often encounter stylistic choices like this, so if you don’t have a personal preference try to mimic existing styles.

Without quotes, the interpreter checks for things named `abc45.q` and doesn’t find anything:

``````ab#45
``````
``````Error in eval(expr, envir, enclos): object 'ab' not found
``````

Anything you assign to a variable becomes known to R, so you can reference it in any command.

``````y <- 'ab#45'
typeof(y)
``````
``````[1] "character"
``````

Top of Section

## The Editor

The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around.

The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even write reports.

These work together in RStudio, which has multiple ways to send parts of the script you are editing to the console for immediate evaluation. Alternatively you can “source” the entire script.

Open up “lesson-2.R” in the editor, and follow along by replacing the `...` placeholders with the code here. Then evalute just this line (Ctrl R on Windows, ⌘ R on Mac OS).

``````vals <- seq(1, 100)
``````

The elements of this statement, from right to left are:

• `)` is the closing paren of a function call
• `1` and `100` are both arguments, or parameters, to the function
• `(` is the opening paren of the function call
• `seq` is the name of the function
• ` <- ` is an operator that assigns what’s named on the left to equal the result of the expression on the right
• `vals` is the name of a variable
Question
Why call `vals` a “variable” and `seq` a “function”?
It is true they are both names of objects known to R, and could be called variables. But `seq` has the important distinguishing feature of being callable, which is indicated in documentation by writing the function name with empty parens, as in `seq()`.

Our call to the function `seq` could have been much more explicit. We could give the arguments by the names that `seq` is expecting.

``````vals <- seq(from = 1,
to = 100)
``````

Run this code either line-by-line, or highlight the section to run (optionally with keyboard shortcut Ctrl-Return or ⌘ Return).

Question
What’s an advantage of naming arguments?
One advantage is that you can put them in any order. A related advantage is that you can then skip some arguments, which is fine to do if each skipped argument has a default value.

How would you get to know the names of a function’s arguments?

``````?seq
``````

How would you even know what function to call?

``````??sequence
``````

Going the other direction, a shorthand form of the `seq()` function is:

``````1:100
``````
``````  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
[18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
[35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
[52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
[69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
[86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
``````

This shorthand is most commonly used while accessing parts of objects, as we’ll see below.

Do save the “lesson-2.R” file.

But I mean really save your work, by commiting it to your project and syncing up to a GitHub repository.

1. Go to the `git` tab in RStudio
2. Select `commit` to open the “Review Changes” window
3. Select the file(s) you want to commit.
4. Enter a descriptive message about the commit.
5. Commit!
6. Push!

Top of Section

## Data types and structures

Type Example
integer -4, 0, 999
double 3.1, -4, Inf, NaN
character “a”, “4”, “👏”
logical TRUE, FALSE
missing NA

Data structures, or more generally “objects”, are built from one or more of these data types and other objects.

## Vectors

Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function `c()`. Use the operator `:` for a sequence of numbers (forwards or backwards), otherwise separate elements with commas.

``````counts <- c(4, 3, 7, 5)
``````

All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.

``````c(1, 2, "c")
``````
``````[1] "1" "2" "c"
``````

## Lists

Lists are like vectors but their elements can be of any data type or structure, including another list! You construct lists by using `list()` instead of `c()`.

Compare the results of `list()` and `c()`

``````x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
``````
Question
What’s different about the structure of the variables `x` and `y`? Use the function `str()` to investigate.
The list contains two elements, a list and a vector. The vector `y` flattened the elements to a single element of the most flexible data type.

## Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

Use `factor()` to create a vector with factors, or `as.factor()` to convert an existing vector to factors.

``````education <- factor(
c("college", "highschool", "college", "middle"),
levels = c("middle", "highschool", "college"),
ordered = TRUE)
``````
``````str(education)
``````
`````` Ord.factor w/ 3 levels "middle"<"highschool"<..: 3 2 3 1
``````

Top of Section

## Multi-dimensional data structures

Data can be stored in several types of data structures depending on its complexity.

Dimensions Homogeneous Heterogeneous
1d* c() list()
2d matrix() data.frame()
nd array()

Of these, the data frame is far and away the most used.

## Data frames

Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and a factor in another.

It is the data structure most similar to a spreadsheet, with two key differences:

• Data frames are collections of equal-length vectors.
• As vectors, the columns are homogeneous and cannot hold values of the wrong type.

Creating a data frame from scratch can be done by combining vectors with the `data.frame()` function.

``````df <- data.frame(education, counts)
``````
``````df
``````
``````   education counts
1    college      4
2 highschool      3
3    college      7
4     middle      5
``````

Some functions to get to know your data frame are:

Function Output
`dim()` dimensions
`nrow()` number of rows
`ncol()` number of columns
`names()` (column) names
`str()` structure
`summary()` summary info
`head()` shows beginning rows
``````names(df)
``````
``````[1] "education" "counts"
``````

## Exercise 1

Create a data frame with two columns, one called “species” and another called “count” into a data frame. Store your data frame as a variable called `data`. You can do this with or without populating the data frame with values.

Top of Section

## Parts of an Object

Parts of objects are always accessible, either by their name or by their position, using square brackets: `[` and `]`.

## Position

``````counts[1]
``````
``````[1] 4
``````
``````counts[3]
``````
``````[1] 7
``````

## Names

Parts of an object can usually also have a name. The names can be given when you are creating a vector or afterwards using the `names()` function.

``````df['education']
``````
``````   education
1    college
2 highschool
3    college
4     middle
``````
``````names(df) <- c("ed", "ct")
``````
``````df['ed']
``````
``````          ed
1    college
2 highschool
3    college
4     middle
``````
Question
This use of `<-` with `names(x)` on the left is a little odd. What’s going on?
We are overwriting an existing variable, but one that is accessed through the output of the function on the left rather than the global environment.

In a multi-dimensional array, you separate the dimension along which a part is requested with a comma.

``````df[3, "ed"]
``````
``````[1] college
Levels: middle < highschool < college
``````

It’s fine to mix names and indices when selecting parts of an object.

## Subsetting ranges

There are multiple ways to simultaneously extract multiple parts of an object.

Use in brackets Subset instructions
positive integers elements at the specified positions
negative integers omit elements at the specified positions
logical vectors select elements where the corresponding value is TRUE
nothing return the original vector (all)
``````days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
weekdays <- days[2:6]
weekend <- days[c(1, 7)]
``````
``````weekdays
``````
``````[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"
``````
``````weekend
``````
``````[1] "Sunday"   "Saturday"
``````

## Exercise 2

Get weekdays using negative integers. Get M-W-F using a call to `seq()` to specify the position (don’t forget to `?seq`).

Top of Section

## Creating functions

Writing functions to use multiple times within a project can prevent you from duplicating code. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

## Anatomy of a function

Writing functions is also a great way to understand the terminology and workings of R. Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is `function`.

``````function(...) {
...
return(...)
}
``````

Three components:

• arguments: control how you can call the function
• body: the code inside the function
• return value: controls what output the function gives

We’ll make a function to extract the first row and column of its argument, for which we can choose an arbitrary name:

``````function(x) {
result <- x[1, 1]
return(result)
}
``````

Note that `x` doesn’t exist until we call the function, which gives the recipe for how `x` will be handled.

Finally, we need to give the function a name so we can use it like we used `c()` and `seq()` above.

``````first <- function(x) {
result <- x[1, 1]
return(result)
}
``````
``````first(df)
``````
``````[1] college
Levels: middle < highschool < college
``````
Question
Can you explain the result of entering `first(counts)` into the console?
The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message.

## Exercise 3

Subset the data frame by column name and row position to obtain the following output.

``````[1] highschool college
Levels: middle < highschool < college
``````

Top of Section

## Distributions and Statistics

Since it is designed for statistics, R can easily draw random numbers from statistical distributions and calculate distribution values.

To generate random numbers from a normal distribution, use the function `rnorm()`

``````ten_random_values <- rnorm(n = 10)
``````
Function Returns Notes
rnorm Draw random numbers from normal distribution Specify `n`, `mean`, `sd`
pnorm Estimate probability of a specific number occuring
qnorm Cumulative probability that a given number or smaller occurs left-tailed by default
dnorm Returns quantile given a cumulative probability opposite of pnorm

Statistical distributions and their functions. See Table 14.1 in R for Everyone by Jared Lander for a full table.

Distribution Random Number
Normal rnorm
Binomial rbinom
Poisson rpois
Gamma rgamma
Exponential rexp
Uniform runif
Logistic rlogis

R has built in functions for handling many statistical tests.

``````x <- rnorm(n = 100, mean = 25, sd = 7)
y <- rbinom(n = 100, size = 50, prob = .85)
``````
``````t.test(x, y)
``````
``````
Welch Two Sample t-test

data:  x and y
t = -25.67, df = 129.85, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-19.13697 -16.39828
sample estimates:
mean of x mean of y
24.54237  42.31000
``````

Linear regression with the `lm()` function uses a formula notation to specify relationships between variables (e.g. `y ~ x`).

``````fit <- lm(y ~ x)
``````
``````summary(fit)
``````
``````
Call:
lm(formula = y ~ x)

Residuals:
Min      1Q  Median      3Q     Max
-8.3766 -1.3968  0.3701  1.6185  5.8984

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.92549    1.02204  42.000   <2e-16 ***
x           -0.02508    0.04030  -0.622    0.535
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.577 on 98 degrees of freedom
Multiple R-squared:  0.003936,	Adjusted R-squared:  -0.006228
F-statistic: 0.3873 on 1 and 98 DF,  p-value: 0.5352
``````

## Exercise 4

Create a data frame from scratch that has three columns and 5 rows. In column “size” place a sequence from 1 to 5. For column “year”, create a factor with three levels representing the past three years. In column “prop”, place 5 random samples from a uniform distribution. Show the summary of a linear model following the formula “prop ~ size + year”.

Top of Section

## Flow control

As a general purpose programming language, you can write R scripts to take care of non-computational tasks.

“Flow control” is the generic term for letting variables whose value is determined at run time to dictate how the code evaluates. It’s things like “for loops” and “if/else” statements.

## Install missing packages

The last thing we’ll do before taking a break, is let R check for any packages you’ll need today that aren’t installed. But we’ll learn how to use flow control along the way.

First, aquire the list of any missing packages.

``````requirements <- c('tidyr',
'dplyr',
'RSQLite',
'sp',
'rgdal',
'rgeos',
'raster',
'shiny',
'leaflet',
'ggplot2')

missing <- setdiff(requirements,
rownames(installed.packages()))
``````

Check, from the console, your number of missing packages:

``````length(missing) == 0
``````
``````[1] TRUE
``````

Your result will be `TRUE` or `FALSE`, depending on whether you installed all the packages already. We can let the script decide what to do with this information.

The keyword `if` is part of the R language’s syntax for flow control. The statement in the body (between `{` and `}`) only evaluates if the argument (between `(` and `)`) evaluates to TRUE.

``````if (length(missing) != 0) {
install.packages(missing, dep=TRUE)
}
``````

Top of Section

## Reminder on important symbols

Symbol Meaning
`?` get help
`c()` combine
`#` comment
`:` sequence
`<-` assignment
`[ ]` selection

Top of Section

## Exercise solutions

### Solution 1

``````species <- c()
count <- c()
data <- data.frame(species, count)
``````
``````str(data)
``````
``````'data.frame':	0 obs. of  0 variables
``````

### Solution 2

``````sol2a <- days[c(-1, -7)]
sol2b <- days[seq(2, 7, 2)]
``````
``````sol2a
``````
``````[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"
``````
``````sol2b
``````
``````[1] "Monday"    "Wednesday" "Friday"
``````

### Solution 3

``````sol3 <- df[2:3, 'ed']
``````
``````sol3
``````
``````[1] highschool college
Levels: middle < highschool < college
``````

### Solution 4

``````df <- data.frame(
size = 1:5,
year = factor(
c(2014, 2014, 2013, 2015, 2015),
levels = c(2013, 2014, 2015),
ordered = TRUE),
prop = runif(n = 5))
fit <- lm(prop ~ size + year, data = df)
``````
``````summary(fit)
``````
``````
Call:
lm(formula = prop ~ size + year, data = df)

Residuals:
1       2       3       4       5
-0.0832  0.0832  0.0000  0.0832 -0.0832

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.4460     0.5053   0.883    0.540
size          0.0276     0.1664   0.166    0.895
year.L        0.3974     0.2278   1.744    0.331
year.Q       -0.3197     0.3311  -0.966    0.511

Residual standard error: 0.1664 on 1 degrees of freedom
Multiple R-squared:  0.9171,	Adjusted R-squared:  0.6685
F-statistic: 3.689 on 3 and 1 DF,  p-value: 0.3614
``````

Top of Section