Midnight Researcher Notes: Statistics

Showing posts with label Statistics. Show all posts

Friday, January 11, 2013

Introduction to R – Removing missing values

A common task in preparing you data is to remove missing values(NAs). One way to do that is to retrieve a vector of the missing values in your data, and then filter them out from your structure.

If you want to filter out the missing values from more than one vector , you can use complete.cases() to get the indices that contains good data in both vectors.

Here good will contain TRUE only for indices that hold good data in both vectors; FALSE otherwise. complete.cases() works with data frames also. In that case it removes any row that contains NA in any column.The returned data frame will not contain any NAs.
Stay tuned for more R notes.

Wednesday, January 9, 2013

Introduction to R – Getting Started

R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software and data analysis. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and now, R is developed by the R Development Core Team.
To download R and install it on your computer, you can get it at the Comprehensive R Archive Network (http://cran.r-project.org). One option that you may want to explore is RStudio (http://rstudio.org) which is a very nice front-end to R and works on all platforms.
The R System is divided into 2 conceptual parts:

The “base” R system that you download and install from CRAN. This part is required to run R, and it contains the most fundamental functions.
Everything else as can be downloaded as a separate package from CRAN.

Data Types

R has five basic types of objects: character, numeric (real numbers), integer, complex, logical.
- Numbers in R are numeric by default. If you want an integer, you need to specify the L suffix.
  - Ex: Entering 1 gives you a numeric object; entering 1L explicitly gives you an integer.
- Special number Inf represents infinity; e.g. 1 / Inf is 0
- The value NaN represents an undefined value; e.g. 0 / 0 is NaN
A vector is a container that can contain objects of the same types only. Empty vectors can be created with the vector() function.
A list can contain objects of different types.

Attributes

R objects can have attributes: names or dimnames, dimensions (e.g. matrices, arrays), class, length, or any other user-defined attributes/metadata. attributes of an object can be access using the attributes() function.

Getting Started with R prompt

Go to Start > Programs > R > R . This will open the R prompt which we will use to test basic statements. When you enter an expression into the R prompt and press Enter, R will evaluate that expression and display the results (if there are any).

as you can see the rules of precedence are applied here. Notice the weird “[1]” that accompanies each returned value. In R, any number that you enter in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1. In each of these cases, there is also only one element in the vector.
The <- symbol is the assignment operator. When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may/not be auto-printed. The [1] indicates the index of the element in the vector. The # character indicates a comment and anything right to it is ignored.

you can also assign an object on the left to a variable on the right

= means assign the value of the right hand side to the variable on the left hand side. == tests variables for equality

The : operator can be used to create integer sequence vector.

The numbers in the brackets on the left-hand side of the results indicate the index of the first element shown in each row.
THe c() function can be used to create vectors of objects.

You could use the vector() function to specify the vector type and length and it will create an empty vector for you.

Variables can be used in creating vectors, their values will replace their names

What about mixing different objects in a vector, implicit casting occurs so that every element in the vector is of the same class (least common dominator).
You can refer to specific member using its location,

For you reference: [ ] always returns an object of the same class as the original; can be used to select more than one element (retrieving matrix elements is exception to this rule, it returns a vector)

or members using location or expression.

or specific members using their indices as integer vector (they will be retrieved in the order of reference, not by their order in the original vector)

When you perform an operation on two vectors, R will match the elements of the two vectors pairwise and return a vector. This is called vectorized operation.

If the two vectors aren’t the same size, R will repeat the smaller sequence multiple times:

Note the warning if the second sequence isn’t a multiple of the first.

Explicit casting

Objects can be explicitly casted from one class to another using the as.* functions, if available. like :

Nonsensical casting results in NAs

Arrays

An array is a multidimensional vector. Vectors and arrays are stored the same way internally, but an array may be displayed differently and accessed differently. An array object is just a vector that’s associated with a dimension attribute. The dimension attribute itself is an integer vector of length 2 (nrow, ncol). Items can be referenced by its indices.

you can refer to part of the array by specifying separate indices for each dimension, separated by commas:

to get all values in one dimension, simply omit the indices for that dimension:

three dimensional arrays

Matrices

A Matrix is just a two-dimensional array.

Matrices are constructed column-wise, so entries can be thought of starting in the upper left corner and running down the columns.

based on the above fact, matrices can be created directly from vectors by adding a dimension attribute to a vector

Matrices can be created by column-binding cbind() or row-binding rbind(). the example explains it all

Lists

Lists are a special type of vector that can contain elements of different data types (notice that its printing is different)

For your reference: [[ ]] is used to extract elements of a list or a data frame; it can only be used to extract a single element and the type of the returned object doesn’t have to be a list or data frame. Doesn’t support partial name matching, passed name have to be exact.

You can name each element in a list. Items in a list may be referred by either location or name. $ is used to retrieve elements by name. It also supports partial name matching (passing part of the name, not all of it)

A list can even contain other lists (we will refer to previous list e):

Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. Its like an integer vector where each integer has a label, so you create a vector of any type that is treated internally by integers. The following example creates a factor from a vector of strings. When it prints, it prints the values and it has an attribute Levels that represents data categories of the factor elements.

Factors are treated specially by modeling functions like lm() and glm() which we will discuss later. Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
We can call table() function on factor c and it will give us the frequency table of each level (category)

We can also call unclass() function on the factor to strip out the factor categories and show us how it is stored underneath.

The order of the level can be set using the levels argument to factor(). This can be important in linear modeling because the first level is used as the baseline level (the first level in the factor). If you didn’t assign levels explicitly, it will be assigned alphabetically (that’s why “no” came before “yes” in the previous example).

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations. is.na() and is.nan() are used to test objects if they are NA or NaN, respectively. NA values have a class also, so there are integer NA, character NA, etc. A NaN value is also NA but the converse is not true.

Data Frames

A data frame is a list that contains multiple named vectors that are the same length. A data frame is a lot like a spreadsheet or a database table. Each vector represents a column in the table. Unlike matrices, and much like database tables, data frames can store different types of objects in each column. Data frames have a special attribute called row.names which represents rows’ names, which could be useful for annotating data. Data frames are usually created by calling read.table() or read.csv() (which we will discuss later when we come to reading data) or data.frame().

Here we create a data frame of two columns foo and bar, foo is an integer sequence, bar is a vector of TRUEs and FALSEs. nrow() returns the number of rows, ncol() returns the number of columns. Since we didn’t specified row names, we got 1,2,3,4 automatically (they printed on the left of each row).
You can refer to columns by name

You can retrieve a specific cell in the data frame by specifying the column name and expression to filter rows in this column. If you want to get the blood pressure of patient Mike:

Data frames can be converted to a matrix by calling data.matrix() which will cast data to make it of the same type, below it casted TRUEs and FALSEs to 1s and 0s.

Names

R objects can have names, which is a very useful for writing readable code and self-describing objects.

Lists can also have names.

also matrices can have names

In this post we introduced R and its data types and data structures. In future posts we will get more deep into R.

Tuesday, January 8, 2013

Basic statistics using Microsoft Excel

Microsoft Excel is a powerful software that have been used my end users and professionals for many years for various purposes. In this post we going to introduce basic statistical features of Excel. This post is using Microsoft Excel 2013, but it is easy to map it to your Excel version.

Basic Formulas

Computations in Excel are carried out using formulas. A formula is a mathematical expression that is used to calculate a value. In Excel, a formula is entered into the active cell by typing an expression. In certain situations, Excel will automatically insert a formula for you.

Excel formulas are always entered beginning with an = sign and followed by the expression. Without an = sign, the expression will be treated as text. The expression will consist of cell references and/or numbers, and contain one or more arithmetic operators such as +, -, *, / or ^. The operator, ^, represents exponentiation. If an expression contains more than one arithmetic operator, the computations are performed in a certain sequence called the order of precedence. The rules in the order of precedence are as follows:

Exponentiation (^)
Multiplication (*) or Division (/), whichever comes first working from left to right
Addition (+) or subtraction (-), whichever comes first working from left to right

If there are any parentheses in the formula, Excel executes the inner most parenthesis first and then works outwards. Within the parenthesis, the order of precedence also holds. Assume you had the following formula in an Excel cell:

=8+(2*(6+5^2*4))

When executed, this formula will enter a value of 220 in that cell. Here is how this value is calculated. First, the inner most parenthesis (6+5^2*4) will be executed. Within this set of parenthesis, the exponentiation will be computed first (5^2=25) to reduce the expression in the inner parenthesis to (6+25*4). Next, the multiplication (25*4=100) is done giving (6+100). This is followed by the addition (6+100=106). This reduces the original formula to =8+(2*106). Next, Excel executes the parenthesis (2*106) giving us 212. Finally, the addition 8+212 is done for a total of 220. Note that if the expression is =8+(2*6+5^2*4), that is without the inner set of parenthesis, the value would be 120.

Ways to enter formulas (like A1+B1+C1 ) :

Enter the whole formula in the cell that will store the result ( =A1+B1+C1 ) and press Enter.
Enter = and then click on cell A1 then enter the + sign using the keyboard then click on cell B1 then enter the + sign using the keyboard then click on cell C1 and press Enter.
If the cells to be added are adjacent, you can identify these cells using a range that could be passed to functions. In our case enter =SUM(A1:C2)
Many formulas have shortcut command on the Excel ribbon. In our case AutoSum command (represented by the button,, in the Editing group on the Home ribbon tab) is used to sum adjacent horizontal or vertical cells.

In many cases you need to repeat the equation for many cells (e.g. to sum all the rows the in the sheet below)

To repeat the equation for for the rest of D column, you could:

Select D1, Copy, select D2, Past, and repeat the operation for the D3, D4, D5, D6
Hover the mouse over the left bottom corner of D1 till the curser became small black square (called Fill Handle), then click and drag it to the desired range (D2:D6). Once you release the mouse button, Excel will execute the formula for your range.

You wondering that the equation we copied was for A1+B1+C1, how it worked for the second row and the next rows ? The answer is that Excel automatically adjusted the locations of the cells when the equation was copied from cell D1 to cell D2 since the cells in the expression were all relative reference cells. A relative reference is a cell reference that changes or shifts when the cell is copied to another location. In contrast, an absolute reference is a cell reference that does not change when you copy that cell to another location. There is also a mixed reference for cells. A mixed reference only changes either the row reference or the column reference, depending on how the mixed reference is set up.

All cell references are relative reference cells unless designated otherwise. That is, a cell has to be designated an absolute reference or a mixed reference cell. To create an absolute reference, you have to preface the cell row and column designations with a dollar sign ($). For example, to make Cell A1 an absolute reference cell, it would have to referred as $A$1. Without these dollar prefixes, a cell is always considered a relative location cell. The relative reference works as follows: When you enter the formula, =A1+B1+C1, into Cell D1, the relative location of Cell A1 to Cell D1 is three cells to the left. Similarly, Cell B1 is two cells to the left of Cell D2, and Cell C1 is one cell left. When the equation is copied down to Cell D2, the first cell in this equation will still be the cell three cells to the left. However, the cell three cells to the left of D2 is now Cell A2. Similarly, the cell two cells to the left of D2 is Cell B2, and so on. That is why when the equation is copied down one cell, it makes the automatic adjustment for the new relative locations. Similarly, if the equation is copied down to Cell D3, the equation in Cell D3 would be copied as =A3+B3+C3. If there was an absolute reference cell in this equation, its reference would not change when the equation is copied to another location. The same processing happens when using the Fill Handle.

Basic Functions

Excel have many functions like MIN, MAX, AVERAGE which work on ranges.

Logical Functions

Logic is handled within Excel by logical functions. A logical function is a “logical” statement that determines whether a condition is true or false. Excel has several logical functions. The most commonly used of these is the IF function. The general form of the IF function is: =IF (logical-test, value-if-true, value-if-false)

Example =IF (A2>=5, “Greater Than 5”, “Less Than 5”)

This function works as follows: logical-test is an expression that is either true or false. If the expression is true, then the expression value-if-true is executed. If the expression is false, then the expression value-if-false is executed. The logical-test is constructed using comparison operators. A comparison operator compares two expressions according to the comparison operator. There are six comparison operators. They are:

Greater than >
Less than <
Equal to =
Greater than or equal to >=
Less than or equal to <=
Not equal to <>

The AND Function

The IF function example above only used one logical test. It can be easily extended to include multiple logical tests. When there are multiple conditions involved, the AND function is used within the IF function as follows: =IF (AND(logical-test 1, logical-test 2, …..), value-if-true, value-if-false)

Each logical-test is an expression that can true or false. If both logical-test 1 and logical-test 2 are true, the AND function returns the value true. Otherwise, it returns the value false. If the AND function returns the value true, the value-if-true expression is executed. If the AND function returns the value false, the value-if-false expression is executed.

Example =IF(AND(A2>=5,B2>=5), “Both A and B Greater than 5”, “Either A or B Less than 5”)

The OR Function

The OR function is also a multiple logical test. When there are multiple conditions involved, the OR function is used within the IF function as follows: =IF (OR (logical-test 1, logical-test 2, …..), value-if-true, value-if-false)

Just like the AND function, each logical-test is an expression that can be true or false. However, with the OR function, if either logical-test 1 or logical-test 2 is true, the OR function returns the value true. If both tests are false, it returns the value false. If the OR function returns the value true, the value-if-true expression is executed. If the OR function returns the value false, the value-if-false expression is executed.

Example =IF(OR(A2>=5,B2>=5), “Either A or B Greater than 5”, “Both A and B Less than 5”)

Lookup

Suppose you want to categorize your data or assign certain classes to your data (like assign class grades based on the total percentage). Excel provides VLOOKUP function. A VLOOKUP function is simply a table with the ranges built in. Excel then matches a data value against these ranges to determine a category or class (grade in our example). To use the VLOOKUP table, you have to first create the VLOOKUP table and then access this VLOOKUP table by using the VLOOKUP function within an Excel statement. The VLOOKUP table can be created anywhere in your sheet. like the one below

Now we need to use the VLOOKUP function to access the VLOOKUP table to assign grades to column B cells. The general form for VLOOKUP is =VLOOKUP(lookup-value, table-array range, col-index-number, [range lookup])

where:

lookup-value is the cell reference of the cell that contains the value to be looked up. In this case, the value will be the final percentage scores from the course. That is, the percentage scores in Column O.
table-array range is the range of the VLOOKUP table. Note that the headers are not included in this range. Also, the range of the table is always entered using absolute references.
col-index-number is the column number of the value of interest in the VLOOKUP table. For example, in this table, the value of interest is the grade which is in Column 2. Therefore, the col-index-number is 2. If the value of interest is the grade point, then the col-index-number will be 3.
[range lookup] is an optional argument with a value of true or false. A true value means that look-up value is being evaluated against ranges in the table-array. A false value means that the look-up value must exactly match the value in the table-array. Since this argument is optional, the default value is true.

With the total percentage for the first record in cell A2, the VLOOKUP statement in cell B2 is entered as =VLOOKUP(A2, $E$2:$F$11,2,TRUE)

This statement works as follows: The score in Cell A2 is checked against the first range. Excel forms the first range using the values in Rows 1 and 2 from Column 1 of the VLOOKUP table. These values are 0% and 70%. Therefore, Excel sets the range as 0-70%. Logically, the range is set up as >=0% and <70%. Thus, a score of exactly 70% is not part of this range. If the score from Cell A2 is in this range, a grade of F is returned and recorded in Cell B2. If it is not in this range, Excel forms the second range using the values in Rows 2 and 3 from Column 1. That is, 70% and 72%. As before, this range is set up as >=70% and <72%. If the score is in this range, a grade of C- is returned and recorded in Cell B2. If it is not in this range, Excel forms the third range using the values in Rows 3 and 4 of Column 1 of the VLOOKUP table. That is, 72% and 77%. This range is interpreted as >=72% and <77%. If the score is in this range, a grade of C is returned and recorded in Cell B2. If it is not in this range, Excel continues by forming the fourth range using the values in Rows 4 and 5 of Column 1, and then the fifth range and so on until it finds a range for the score in Cell A2. The last range formed in this table will be 92% to 97% using the values in Rows 9 and 10 of Column 1. If the score is not in this range, then it assigns a grade of A+, the last grade, by default.

To calculate the rest of the grades using this grade criterion, the VLOOKUP function in Cell B2 is copied down to Cell B24.

Basics Of Probability

Probability : a number between 0 and 1 that indicates how likely it is that some event, or set of events, will occur. A probability of 0 indicates it will never occur. A probability of 1 indicates it will always occur. P(event) = 0.2

Experiment : any situation capable of being replicated under essentially stable conditions. The replications may be feasible, such as flipping a coin over and over. Or, the experiment may only replicated in theory, such as investing $1,000 in a friend’s promising invention. In this case you could imagine repeating such an investment over and over many times, even though you intend to do it only once. When an experiment is defined it is important to carefully specify all possible sample outcomes. These are all the possible results of the experiment. For example, if a coin is tossed once, then the sample outcomes are {Head, Tail}. If the experiment is defined as tossing a coin twice, then one way of specifying the sample outcomes would be: {Head, Head}, {Head, Tail}, {Tail, Head}, and {Tail, Tail}. The outcomes of an experiment must be mutually exclusive. Mutually exclusive means there is no overlap between the outcomes. The outcomes of an experiment must also exhaustive. This means one of the defined sample outcomes must occur. There are two basic properties necessary for defining the probability of an outcome:

0 ≤ P(Outcome) ≤ 1
P(Outcomes) = 1

Probability Rules:

Rule for Complements: P() = 1 – P(A). represents all sample outcomes except A
Additive Rule for Mutually Exclusive Events: P(A or B) = P(A) + P(B)
Joint Probability for Two Independent Events: P(A and B) = P(A)*P(B). Independent means that the occurrence of A have ne influence on the occurrence of B.

Random Variables : The outcomes of an experiment are said to take place randomly, since they cannot, by definition, occur in any particular order or pattern. Such outcomes, called random variables. Once an experiment and its outcomes have been clearly stated, and the random variable of interest has been defined, then the probability of each value for the random variable can be specified. This specification is called a probability mass function. A probability mass function (p.m.f.) gives the probability for each outcome of a random variable.

Example : Let take a statistics course, receive a grade, letting random variable Y = grade point equivalent.

Outcome	Y	Probability P(Y)
F grade	0	P(Y=0)=0.05
D grade	1	P(Y=1)=0.10
C grade	2	P(Y=2)=0.35
B grade	3	P(Y=3)=0.30
A grade	4	P(Y=4)=0.20

Expected Value : An expected value is the average or mean of a random variable. The expected value of the random variable X is denoted by the symbol E[X], where E[X] = XP(X)

Standard Deviation : The standard deviation of a random variable X measures the average amount by which the values of the random variable X deviate from their expected value. This standard deviation is denoted by σ_X, where σ is the Greek letter sigma.

Discrete random variable: a variable whose values can be separated from one another and counted then the probability of each value for the random variable can be specified. This specification is called a probability mass function.

Continuous random variable: a variable whose values cannot be separated from one another, and are measured rather than counted. Continuous probability distributions are often called probability density functions because it defined over a range of X values (X represents the random variable). The probability that any specific value within this range is so small that it is assumed to be equal zero. Probability P(a < X < b) is the Shaded Area

Determining Normal Probabilities : For data that follow normal distribution this cumulative probability for a normal random variable X P(X<k) is calculated using the following Excel command: =NORM.DIST(k, mean-of-X , Standard-Deviation-of-X,1)

Example: Suppose that the average person’s IQ in US is 100 and the standard deviations for past IQs is 15, then the probability a randomly selected person in the U.S. will have an IQ between 110 and 120 is P(110 < X < 120) = NORMDIST(120,100,15,1)- NORMDIST(110,100,15,1) = 0.9088 - 0.7475 = 0.1613

Normal Inverse : If we want to reverse the previous process, get the X value giving its probability. This is done using the Excel command : =NORM.INV(Probability,mean-of-X , Standard-Deviation-of-X)

Histograms

Histograms allow you to summarize large amounts of data graphically by determining the frequencies over set ranges of specified size (called classes). The width of a class in Excel is called a bin width. The ranges of the classes are called class intervals. The graph of a frequency distribution is called a histogram and is the same as a column chart. To construct a histogram in Excel, we first have to determine the bin width. If you want equal bin widths like the ones below, enter the first and second values (4.20, 4.40), then select both cells and use the fill handle to fill more cells using the same incremental value used in the first two cells (0.20). In order to create histograms, we need to initialize the analysis toolpak addin that comes with Excel.

Initialize the Analysis Toolpak: Click on the Office button >> Excel Options in the lower right hand corner >> choose the Add-ins section on the left hand side >> click the Go button next to the Manage Excel Add-ins drop down box >> check the box next to Analysis Toolpak and Analysis Toolpak VBA >> press OK.

To create your histogram, go to the Data ribbon tab and click on the Data Analysis button in the Analysis group. In the dialog box that appears, choose Histogram and click Ok. In the resulting dialog table, enter the Input Range (A4:A26), the Bin Range (C2:C15), the worksheet name (histogram1), check the Chart Output option and click on OK. These steps will give a “basic” histogram (like the one below) which can be formatted later.

A common addition to histogram representation, is to convert the frequencies into percentages. For histogram1, calculate the total number of data items by entering a sum function in cell B17. In cell B17, type the formula: =SUM(B2:B16)

Then in cell C2, type the following formula and copy it down to cell C16: = B2/$B$17

Lastly, format C2:C16 to be percentage with one decimal place. From ribbon Home tab, Number group

In this post we introduced basic statistics operation that can be done easily in Excel. To explain these operations we reviewed some statistics background and general Excel knowledge. In the upcoming posts we will dig deeper using Excel and other specialized software like SAS and R language.