Midnight Researcher Notes: Introduction to R

R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software and data analysis. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and now, R is developed by the R Development Core Team.
To download R and install it on your computer, you can get it at the Comprehensive R Archive Network (http://cran.r-project.org). One option that you may want to explore is RStudio (http://rstudio.org) which is a very nice front-end to R and works on all platforms.
The R System is divided into 2 conceptual parts:

The “base” R system that you download and install from CRAN. This part is required to run R, and it contains the most fundamental functions.
Everything else as can be downloaded as a separate package from CRAN.

Data Types

R has five basic types of objects: character, numeric (real numbers), integer, complex, logical.
- Numbers in R are numeric by default. If you want an integer, you need to specify the L suffix.
  - Ex: Entering 1 gives you a numeric object; entering 1L explicitly gives you an integer.
- Special number Inf represents infinity; e.g. 1 / Inf is 0
- The value NaN represents an undefined value; e.g. 0 / 0 is NaN
A vector is a container that can contain objects of the same types only. Empty vectors can be created with the vector() function.
A list can contain objects of different types.

Attributes

R objects can have attributes: names or dimnames, dimensions (e.g. matrices, arrays), class, length, or any other user-defined attributes/metadata. attributes of an object can be access using the attributes() function.

Getting Started with R prompt

Go to Start > Programs > R > R . This will open the R prompt which we will use to test basic statements. When you enter an expression into the R prompt and press Enter, R will evaluate that expression and display the results (if there are any).

as you can see the rules of precedence are applied here. Notice the weird “[1]” that accompanies each returned value. In R, any number that you enter in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1. In each of these cases, there is also only one element in the vector.
The <- symbol is the assignment operator. When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may/not be auto-printed. The [1] indicates the index of the element in the vector. The # character indicates a comment and anything right to it is ignored.

you can also assign an object on the left to a variable on the right

= means assign the value of the right hand side to the variable on the left hand side. == tests variables for equality

The : operator can be used to create integer sequence vector.

The numbers in the brackets on the left-hand side of the results indicate the index of the first element shown in each row.
THe c() function can be used to create vectors of objects.

You could use the vector() function to specify the vector type and length and it will create an empty vector for you.

Variables can be used in creating vectors, their values will replace their names

What about mixing different objects in a vector, implicit casting occurs so that every element in the vector is of the same class (least common dominator).
You can refer to specific member using its location,

For you reference: [ ] always returns an object of the same class as the original; can be used to select more than one element (retrieving matrix elements is exception to this rule, it returns a vector)

or members using location or expression.

or specific members using their indices as integer vector (they will be retrieved in the order of reference, not by their order in the original vector)

When you perform an operation on two vectors, R will match the elements of the two vectors pairwise and return a vector. This is called vectorized operation.

If the two vectors aren’t the same size, R will repeat the smaller sequence multiple times:

Note the warning if the second sequence isn’t a multiple of the first.

Explicit casting

Objects can be explicitly casted from one class to another using the as.* functions, if available. like :

Nonsensical casting results in NAs

Arrays

An array is a multidimensional vector. Vectors and arrays are stored the same way internally, but an array may be displayed differently and accessed differently. An array object is just a vector that’s associated with a dimension attribute. The dimension attribute itself is an integer vector of length 2 (nrow, ncol). Items can be referenced by its indices.

you can refer to part of the array by specifying separate indices for each dimension, separated by commas:

to get all values in one dimension, simply omit the indices for that dimension:

three dimensional arrays

Matrices

A Matrix is just a two-dimensional array.

Matrices are constructed column-wise, so entries can be thought of starting in the upper left corner and running down the columns.

based on the above fact, matrices can be created directly from vectors by adding a dimension attribute to a vector

Matrices can be created by column-binding cbind() or row-binding rbind(). the example explains it all

Lists

Lists are a special type of vector that can contain elements of different data types (notice that its printing is different)

For your reference: [[ ]] is used to extract elements of a list or a data frame; it can only be used to extract a single element and the type of the returned object doesn’t have to be a list or data frame. Doesn’t support partial name matching, passed name have to be exact.

You can name each element in a list. Items in a list may be referred by either location or name. $ is used to retrieve elements by name. It also supports partial name matching (passing part of the name, not all of it)

A list can even contain other lists (we will refer to previous list e):

Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. Its like an integer vector where each integer has a label, so you create a vector of any type that is treated internally by integers. The following example creates a factor from a vector of strings. When it prints, it prints the values and it has an attribute Levels that represents data categories of the factor elements.

Factors are treated specially by modeling functions like lm() and glm() which we will discuss later. Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
We can call table() function on factor c and it will give us the frequency table of each level (category)

We can also call unclass() function on the factor to strip out the factor categories and show us how it is stored underneath.

The order of the level can be set using the levels argument to factor(). This can be important in linear modeling because the first level is used as the baseline level (the first level in the factor). If you didn’t assign levels explicitly, it will be assigned alphabetically (that’s why “no” came before “yes” in the previous example).

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations. is.na() and is.nan() are used to test objects if they are NA or NaN, respectively. NA values have a class also, so there are integer NA, character NA, etc. A NaN value is also NA but the converse is not true.

Data Frames

A data frame is a list that contains multiple named vectors that are the same length. A data frame is a lot like a spreadsheet or a database table. Each vector represents a column in the table. Unlike matrices, and much like database tables, data frames can store different types of objects in each column. Data frames have a special attribute called row.names which represents rows’ names, which could be useful for annotating data. Data frames are usually created by calling read.table() or read.csv() (which we will discuss later when we come to reading data) or data.frame().

Here we create a data frame of two columns foo and bar, foo is an integer sequence, bar is a vector of TRUEs and FALSEs. nrow() returns the number of rows, ncol() returns the number of columns. Since we didn’t specified row names, we got 1,2,3,4 automatically (they printed on the left of each row).
You can refer to columns by name

You can retrieve a specific cell in the data frame by specifying the column name and expression to filter rows in this column. If you want to get the blood pressure of patient Mike:

Data frames can be converted to a matrix by calling data.matrix() which will cast data to make it of the same type, below it casted TRUEs and FALSEs to 1s and 0s.