Intro to R
R is a language and an environment, that is good for analyzing data and creating rich graphics. To get started make sure you have R installed on your computer. The latest version of R is available at the Comprehensive R Archive Network.
When you start R, an interpreter window is launched. You can type commands into it, or cut and paste them from a document.
Use R as a Calculator
The easiest way to get started is to simply use R as a calculator. Type some numerical expressions into R and see what happens (or cut and paste the code below). The lines that start with "#" are comments do not get evaluated.
# how many seconds in a day? 24*60*60 # what is 2 to the power of 16? 2^16 # what's the square root of pi? sqrt(22/7) # generate a series of numbers using the colon character 1:5 # what's the average value of the series of numbers above? mean(1:5)
The basic elements of R are variables and functions. Look, you've already used two functions! (bonus points: type the word pi on the command line, and you'll see that it's a predefined variable).
Variables
Variables are used to store data so we can operate on it. R has four basic kinds of variables: vector, matrix, dataframe, list. The way we assign data to a variable is to use an assignment operator "<-". An equals sign also works (=, common in other languages), but the convention in R is to use the little arrow.
Let's start with a vector. It can hold one or more values.
# create x and assign it the value of 2 x <- 2 # make other variables y <- 3.14 # you can name variables using letters, words, periods, underscores temperature <- 37 temperature2 <- 98.6 # but the name cannot start with a number! 2temperature <- 98.6 # we can assign character strings too, but they have to be enclosed in quotes genotype <- ″wt″
Now that you've created them, if you type the name any of the variables above, the value of the variable is returned.
To see all the objects you've created so far, use the ls() function. Type it on the command line, and the objects in your environment will be listed.
To assign more than one value to a variable, we often have to use a concatenation function: c()
# the concentration of DNA in my std curve samples dna <- c(5, 10, 20, 40, 80, 160, 320)
Now if you type the name of this variable, all the values will be returned. However, each value can be accessed using square brackets. The elements of the vector are numbered beginning with 1 (not 0 like some other languages).
# what concentration was the 3rd sample? dna[3] # We can use multiple subscripts dna[1:5] # I just need the odd samples dna[c(1,3,5,7)] # it's hard to evaluate numbers by looking at them # let's draw a bar plot to how they compare barplot(dna)
Sometimes we don't know how long a variable is, let's use the length() function to figure that out.
length(dna)
If we use a vector in an expression, it get's evaluated for each element.
# I need to divide my concentrations by 2 dna/2 # I can assign the results to a new vector dna_dilution <- dna/2
Vectors can have multiple values, but they must all be of the same type.
numeric vector |
23.2 |
45.8 |
63.7 |
character vector |
"red" |
"green" |
"blue" |
boolean vector |
TRUE |
FALSE |
TRUE |
Boolean vectors are an interesting concept. They allow us to test things and apply logic.
# which DNA concentrations are greater than 50? dna > 50 # we can capture the result, and use that as an "index vector" to return the values # that meet the criteria iv <- dna > 50 # see which values met our criteria dna[iv] # we can also use a shortcut, and use the logical expression directly dna[dna > 50]
Other variables: Matrix, Dataframe, List
The other main types of variables are matrices, data frames, and lists. Matrices and data frames are two dimensional tables. Matrices have to be all one data type, whereas data frames can have columns of different types (e.g. a column of gene names, and a column of expression values).
To create a matrix, we call a function, set some dimensions, and fill it with data. The cells of the matrix can be accessed using square brackets: matrix[row,column]
# create a 5 x 5 matrix mm <- matrix(1:25, nrow=5,ncol=5) # look at the result mm # access the 3rd element of the 4th row mm[4,3] # return the 3rd row (thus all columns) mm[3,] # return the 5th column (thus all rows) mm[,5] # re-assign a particular cell with a new number mm[4,3] <- 125
Data frames are similar except the data types can be mixed. We can create a data frame using a function.
mydata <- data.frame(fruit=c("apples", "pears", "peaches"), basket=c(1,3,2), kitchen=c(5,3,6)) # let's summarize the fruit count summary(mydata)
Lists are a little bit odd. A list is an ordered collection of objects. Each object can be a different type.
mylist <- list(driver="fred", mileage=c(2200, 1150, 5000), fruitsupply=mydata)
The list above contains a single element character vector in the first position, a three element numeric vector in the second position, and the dataframe we created above in the 3rd position. I think of lists like clotheslines, in that you can hang whatever you want at a given position.
Access the elements of a list using double square brackets.
# what's in the first position of the list? mylist[[1]] # what's the second mileage number? mylist[[2]][2] # since the list has "named" positions, we could also access # the information in a special way mylist$mileage[2]
Functions
plot
fit a line
generate some points
generate some series