you need to know

- the range of the values in your data
- how these values are distributed in the range
- how values in different variables relate to each other

> x = 1:10 > range (x) [1] 1 10 > quantile (x) 0% 25% 50% 75% 100% 1.00 3.25 5.50 7.75 10.00

the greater the variability in the data, the greater will be your uncertainty and lower your ability to distinguish between competing hypotheses

two populations can have different means but the same variance

> x = c (1, 3, 5) > z = c (2, 4, 6) > mean (z) ; mean (x) [1] 3 [1] 4 > var (z) ; var (x) [1] 4 [1] 4

two populations can have the same mean but different variances

> x = c (1, 3, 5) > y = c (0, 3, 6) > mean (y) ; mean (x) [1] 3 [1] 3 > var (x) ; var (y) [1] 4 [1] 9

comparing means when the variances are different is an extremely bad idea

> x = scan () 1: 123 2: 567 890 34 5: 34.6 6: Read 5 items > x [1] 123.0 567.0 890.0 34.0 34.6 > y <- scan (what = " ") 1: old young bible 4: stout 5: Read 4 items > y [1] "old" "young" "bible" "stout"

- scan (filename, what, sep, n, nlines, fileEncoding)
- what: ‘logical’, ‘integer’, ‘numeric’, ‘complex’, ‘character’, ‘raw’, ‘list’

n: integer: the maximum number of data values to be read, defaulting to no limit

nlines: if positive, the maximum number of lines of data to be read

sep: by default ‘white-space’ delimited input fields - read.csv (filename, head, sep)

read.table (filename, head, sep) - the second argument indicates whether or not the first row is a set of labels

the third argument indicates sepatate sign between each number of each line - write.csv (obj, filename)

write.table (obj, filename) - save (obj, file=filename)
- load (filename)

file "a.dat":

1 2 3 4 5 6 7 8 9

> y = scan ("a.dat") Read 9 items > y [1] 1 2 3 4 5 6 7 8 9 > is.numeric(y) [1] TRUE

> y = rnorm (50, 7.0, 0.1) > write (file="filename.dat", y) > z = read.csv (file="bbb", header=0, sep=" ") > is.list(z) [1] TRUE

> z <- read.fwf ("http://www.ats.ucla.edu/stat/data/test_fixed.txt", width = c (8, 1, 3, 1, 1, 1))

you use the **width** argument to indicate the number of signs of each variable. in a fixed format file you do not have the names of the variables on the first line, and therefore they must be added after you have read in the data

> z V1 V2 V3 V4 V5 V6 1 general 0 70 4 1 1 2 vocati 1 121 4 2 1 3 general 0 86 4 3 1 4 vocati 0 141 4 3 1 5 academic 0 172 4 2 1 6 academic 0 113 4 2 1 7 general 0 50 3 2 1 8 academic 0 11 1 2 1 > s <- scan ("http://www.ats.ucla.edu/stat/data/names.txt", what = character ()) Read 6 items > s [1] "prgtyp" "gender" "id" "ses" "schtyp" "level"

number

string

logical T[RUE] | F[ALSE]

array array(data = NA, dim = length(data), dimnames = NULL)

can have one, two or more dimensions. it is a vector which is stored with additional attributes giving the dimensions (attribute ‘"dim"’) and optionally names for those dimensions (attribute ‘"dimnames"’)

list list(x1,x2,..,xn) list(tag1=x1,tag2=x2,..,tagn=xn)

vector

> x <- c (10, 20, 30, 40, 50, 60) > y <- c (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) > x[4] > y[3:5] > sum (x) > sum (sort (y)[1:3]) > prod (rev (x)) > rep (5.3, 17) > rep (1:6, 2)

- integer (length = 0)

as.integer (x, ...)

is.integer (x) - double (n)
- creates a double-precision vector of the specified length. the elements of the vector are all equal to ‘0’
- round (x)

ceiling (x)

floor (x) - log (x)

exp (x) - numeric (n)
- creates empty (zeroed) sample of size n
- as.numeric (x)
- convert to numeric (for example : from factor)
- is.numeric (x)
- check
- seq (from, to, by)

seq (from, to, lenght.out = ..)

seq (from, by, along x)

seq (from, by, along = 1:20) - the 'along' option allows you to map a sequence onto an existing vector (to ensure equal lengths) or if you know how many numbers you want but you can't be bothered to work out the final value of a series, you can do this

> numbers <- 30:1 > numbers [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 [18] 13 12 11 10 9 8 7 6 5 4 3 2 1 > numbers[5] [1] 26 > numbers[c(5,11,3)] [1] 26 20 28 > indices <- c(5,11,3) > numbers[indices] [1] 26 20 28

matrix

> y <- c (3, 4, 7, 2, 8, 3, 4, 7, 1, 6, 7, 8, 9, 3, 7) > m <- matrix (y, nrow = 5) > n <- matrix (y, ncol = 3) > m [,1] [,2] [,3] [1,] 3 3 7 [2,] 4 4 8 [3,] 7 7 9 [4,] 2 1 3 [5,] 8 6 7 > n [,1] [,2] [,3] [1,] 3 3 7 [2,] 4 4 8 [3,] 7 7 9 [4,] 2 1 3 [5,] 8 6 7

factor

there is a way to tell R to treat the some column as a set of factors. you specify that a variable is a factor using the **factor** command. in the following example you convert column "x$month" (which can contain month's names) into a factor:

> x$month <- factor (x$month)

once a vector is converted into a set of factors then R treats it DIFFERENTLY - a set of factors have a DISCRETE SET of possible values, and it does not make sense to try to find averages or other NUMERICAL descriptions

> meteo$Month = factor (meteo$Month, ordered = T, levels = c ("Jan","Feb","Mar","Apr", "May","Jun","Jul","Aug", "Sep","Oct","Nov","Dec")) > is.factor (month) [1] TRUE > plot (meteo$Month, meteo$MeanTemp) # or > boxplot (meteo$MeanTemp ~ meteo$Month, col = "orange") > dev.off () > meteo$MeanTemp[meteo$Month == "Jan"]

dataframe

a dataframe is an object with rows and columns

the rows contain different observations/measurements from your experiment

the columns contain different variables

the values in the body of the dataframe can be numbers, but they could also be text; they could be calendar dates (like 23/5/04); or they could be logical variables

> d <- c (7, 4, 6, 8, 9, 1, 0, 3, 2, 5, 0) > r <- rank (d) > s <- sort (d) > o <- order (d) > v <- data.frame (d, r, s, o) > v d r s o 1 7 9.0 0 7 2 4 6.0 0 11 3 6 8.0 1 6 4 8 10.0 2 9 5 9 11.0 3 8 6 1 3.0 4 2 7 0 1.5 5 10 8 3 5.0 6 3 9 2 4.0 7 1 10 5 7.0 8 4 11 0 1.5 9 5

> w <- read.table ("some.txt", head = 1) > attach (w) > month <- factor (month) > h <- read.csv (file = "simple.csv", head = TRUE, sep = ", ")

> names(v1) <- c("x","y","sum")

**read.table** would fail if there were any spaces in any of the variable names in row 1 of the dataframe (the header row) or between any of the words within the same factor level

detach (objname)

you can get a quick summary of the data by calculating a frequency table. a frequency table is a table that represents the number of occurrences of every unique value in the variable

- lapply (x, function, ...)
- apply a function over a List or Vector

returns a list of the same length as x, each element of which is the result of applying FUN to the corresponding element of x - sapply (x, function, ..., simplify=TRUE, USE.NAMES=TRUE)
- apply a function over a List or Vector

is a user-friendly version and wrapper of lapply - vapply (x, function, function.VALUE, ..., USE.NAMES=TRUE)
- apply a function over a List or Vector

is similar to sapply, but has a pre-specified type of return value, so it can be safer - apply (x, margin, function, ...)
- Apply Functions Over Array Margins

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix

MARGIN: a vector giving the subscripts which the function will be applied over

for a matrix ‘1’ indicates rows, ‘2’ indicates columns, ‘c(1, 2)’ indicates rows and columns.

Where x has named dimnames, it can be a character vector selecting dimension names - mapply (column, factor, function)
- Apply a Function to Multiple List or Vector Arguments

is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on - tapply (column, factor, function)
- apply a function to each (non-empty) group of values given by a unique combination of the levels of certain factors
- aggregate (column, list (title=factor ,...), function)
- by (column, list (title=factor,...), function)

> tapply (meteo$MeanTemp, meteo$Year, mean) > aggregate (meteo$MeanTemp, list (period = meteo$Year), mean) > by (meteo$MeanTemp, list (period = meteo$Year), mean) # exclude variables field1, field2, field3 m1 <- names (mydata) %in% c ("field1", "field2", "field3") n1 <- mydata[!m1] # exclude 3rd and 5th field n2 <- mydata[c (-3,-5)] # take first five observations n3 <- mydata[1:5,] # sql-style n4 <- mydata[ which (mydata$gender == 'F' & mydata$age > 65), ] # or attach (newdata) newdata <- mydata[ which (gender == 'F' & age > 65),] detach (newdata) # using subset function newdata <- subset (mydata, age > = 20 | age < 10, select = c (ID, Weight))

> p = read.csv ("temperature.csv") > p$Month = factor (p$Month, ordered = T, levels = c ("Jan","Feb","Mar","Apr", "May","Jun","Jul","Aug", "Sep","Oct","Nov","Dec")) > boxplot (p$Month, p$Temp, col="orange", main="raw data")

> mm = tapply (p$Temp, p$Month, median) > mm Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec -5.8 -5.8 -0.7 6.9 12.3 18.2 21.0 19.9 15.7 9.5 3.4 -2.8 > as.numeric (mm) [1] -5.8 -5.8 -0.7 6.9 12.3 18.2 21.0 19.9 15.7 9.5 3.4 -2.8 > boxplot (p$Month, p$Temp - as.numeric (mm), col="orange", main="deseasoned data") > dev.off()

- min (x)
- mean (x)
- median (x)
- max (x)
- var (x)
- sd (x)

> mean (x$Field_Name) ; median (x$Field_Name) [1] 0.7649074 [1] 0.72 > quantile (x$Field_Name) 0% 25% 50% 75% 100% 0.1300 0.4800 0.7200 1.0075 1.7600 > min (x$Field_Name) ; max (x$Field_Name) [1] 0.13 [1] 1.76 > var (x$Field_Name) ; sd (x$Field_Name) [1] 0.1429382 [1] 0.3780717

- summary (x)
- min, 25th quantile, mean, median, 75th quantile, max
- fivenum (x)
- min, lower-hinge, median, upper-hinge, max
- quantile (x)

quantile (x, grades)

> z = rnorm (1000) > mean (z) [1] -0.02373456 > quantile (z, c (.1, .3, .7, .9)) 10% 30% 70% 90% -1.3458127 -0.5073407 0.5294266 1.2001054

- sum (xs)
- prod (xs)
- factorial (n)
- choose (n, x)
- binomial coefficients n! / (x! * (n - x)!)
- mad (xs)

mad (xs, center, const) - const * median (abs (xi - center))

by default: center = median and const = 1.4826 - for asymptotically normal consistency - cov (xs, ys)
- covariance between two series

cov = mean (xs * ys) - mean (xs) * mean (ys)

- cor (xs, ys)
- correlation between two series

cor = cov²(xs,ys) / ( σ²xs * σ²ys)

> w1 <- read.table ("filename", header = T) > stripchart (w1$vals) > stripchart (w1$vals, method = "stack")

the histogram graphically shows the following:

- center (i.e., the location) of the data
- spread (i.e., the scale) of the data
- skewness of the data
- presence of outliers and
- presence of multiple modes in the data

> hist (w1$vals, col = 'grey', breaks = 12, xlim = c (0.9, 1.3))

box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data

> boxplot (w1$vals, main = 'Main Title of the Plot', xlab = 'x axe label', horizontal = TRUE) > boxplot (e$MeanTemp~e$Month, col = "orange")

> plot (x, y, type = "l", pch = 3)

the default plotting character (pch = 1) is ο

if you want Δ, use pch = 2

if you want + (plus signs), use pch = 3

if you want x use pch = 4

if you want ♦ use pch = 5

to draw the regression line through the data, you employ the straight line drawing directive

> abline (intercept, slope)

> abline (lm (y~x))

to plot four graphics (two-in-rows) use the command:

> par (mfrow = c (2, 2))

to plot two different funcs on the same plot:

> plot (t, p1, ylim = c (-6,4), type = "l", col = "red" ) > par (new = TRUE) > plot (t, p2, ylim = c (-6,4), type = "l", col = "green" )

> p1 = matrix (p1) > p2 = matrix (p2) > matplot (t, cbind (p1, p2), pch = 19)

> plot (t, p1, ylim = c (-6,5)) > points (t, p2)

> plot (t, p1, ylim = c (-6,5)) > lines (t, p2)

- par ()
- to look at current graphical params
- par (col.lab="red")
- to set the parameter

> hist (mtcars$mpg, col.lab="red")

function | output to |
---|---|

pdf ("mygraph.pdf") | pdf file |

png ("mygraph.png") | png file |

jpeg ("mygraph.jpg") | jpeg file |

bmp ("mygraph.bmp") | bmp file |

postscript ("mygraph.ps") | postscript file |

dev.off () | returns output to the terminal |

so to save a jpg file called "plot01.jpg" containing a plot of x and y:

> jpeg ('plot01.jpg') > plot (x, y) > dev.off ()

- runif (n)
- generates n random numbers between 0 and 1 from a uniform distribution
- rnorm (n, mean = 0, sd = 1)
- n: number of observations

mean: vector of means

sd: vector of standard deviations - dnorm (x, mean = 0, sd = 1)
- given a set of values it returns the PDF at each point
- pnorm (q, mean = 0, sd = 1)
- given a number (or a list) it returns CDF
- rnbinom (k, meanval, sdval)
- generates k random numbers from a negative binomial distr whose mean is meanval and sd is sdval
- rbinom (n, size, prob)
- n: number of observations, size: number of trials, prob: probability of success on each trial
- dbinom(x, size, prob)
- pbinom(q, size, prob)
- rpois (n, lambda)
- x: vector of (non-negative integer) quantiles, lambda: vector of (non-negative) means
- dpois (x, lambda)
- returns PDF
- ppois(q, lambda)
- returns CDF
- are the values normally distributed or not ?
- are there outliers in the data ?
- if data were collected over a period of time, is there evidence for correlation?
- binom.test (x, n, p = 0.5)
- test of a simple null hypothesis about the probability of success in a Bernoulli experiment

x: number of successes (or a vector of length 2 giving the numbers of successes and failures, respectively)

n: number of trials (ignored if ‘x’ has length 2)

p: hypothesized probability of success - shapiro.test (x)
- test of normality
- qt (tailed-confidence, df)
- the function gives the quantiles of the t distribution

the first argument is the probability and

the second is the degrees of freedom of your population

> my <- numeric (1000) > for (i in 1:1000) { y <- rnbinom (30, 1, 0.2) ; my[i] <- mean (y) } > hist (my) > dev.off()

in order to be reasonably confident that your inferences are correct, you need to establish some facts about the distribution of the data:

many statistical tests are based on the assumption of normality. the assumption of normality often leads to tests that are simple and powerful compared to tests that do not make the normality assumption. unfortunately, many real data sets are in fact not approximately normal

to be sure that your data sampling x is close to normal distribution use **Shapiro test**

use Student' t-test when the means are independent and the errors are normally distributed

non-normality, outliers and serial correlation can all invalidate inferences made by Student' t-test. much better in such cases to use a non-parametric technique - Wilcoxon' signed-rank test. use Wilcoxon' rank sum test when the means are independent but errors are not normally distributed

neither the t-test nor the w-test can cope properly with situations where the variances are different, but the means are the same. this draws attention to a very general point: scientific importance and statistical significance are not the same thing

**t-values** associated with different levels of confidence available in R:

confidence intervals are always 2-tailed. thus, if you want to establish a 95% confidence interval you need to calculate (or look up) **t-value**

> qt (.025, 9) [1] -2.262157

or

> qt (.975, 9) [1] 2.262157

this says that

so, finally, you can write down the formula for the confidence interval of a mean based on a small sample:

> x = c(3, 5, 7) > qt (.975, 2) * var (x) [1] 17.21061 > qt (.025, 2) * var (x) [1] -17.21061

> t.test (A, B) Welch Two Sample t-test data: A and B t = -3.873, df = 18, p-value = 0.001115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates: mean of x mean of y 3 5

you typically use 5% as the chance of rejecting the null hypothesis

when the covariance of A and B is positive, this is a great help because it reduces the variance of the difference, and should make it easier to detect significant differences between the means.

pairing is not always effective, because the correlation between A and B may be weak

> x <- c (20, 15, 10, 5, 20, 15, 10, 5, 20, 15, 10, 5, 20, 15, 10, 5) > y <- c (23, 16, 10, 4, 22, 15, 12, 7, 21, 16, 11, 5, 22, 14, 10, 6) > t.test (x, y) > t.test (x, y, paired = T)

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

78 | 24 | 64 | 45 | 64 | 52 | 30 | 50 | 64 | 50 | 78 | 22 | 84 | 40 | 90 | 72 |

78 | 24 | 62 | 48 | 68 | 56 | 25 | 44 | 56 | 40 | 68 | 36 | 68 | 20 | 58 | 32 |

> t = read.table ("marks.dat", header = T) > x = as.numeric (t[, 2]); x1 = x[1:16] > y = as.numeric (t[, 3]); y1 = y[1:16] > t.test (x1, y1, paired = T) Paired t-test data: x1 and y1 t = 2.2353, df = 15, p-value = 0.04103 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3600356 15.1399644 sample estimates: mean of the differences 7.75

Wilcoxon non-parametric test is used if the errors looked to be non-normal

this non-parametric test can be more powerful than the t-test if the distribution is strongly skewed

the Kolmogorov-Smirnov test is used to decide if a sample comes from a population with a specific distribution. the test is based on the empirical cumulative distribution function (ECDF)

ks.test (x, 'pnorm')

if ‘y’ is numeric, a two-sample test of the null hypothesis that ‘x’ and ‘y’ were drawn from the **same _continuous_ distribution** is performed

‘y’ can be a character string naming a __continuous__ CDF, or such a function. in this case, a one-sample test is carried out of the null that the distribution function which generated ‘x’ is distribution ‘y’

the K-S test has several important limitations:

- applies to continuous distributions only
- more sensitive near the center of the distribution than at the tails
- the distribution must be fully specified

suppose you are interested in investigating the assosiation between x and y

isn't just enought to caclulate the correlation ρ between x and y?

perhaps for this dataset (ρ=0.83) it is enought

what about this dataset? ρ=0.5

and this? ρ=-0.8

one of the best methods to describe some data is by fitting a statistical model. the model parameters tell you much more about the relationship between x and y than correlation coefficient. in statical modeling you are inerested in estimating the unknown parameters from your data

models have parameters some of which are unknown. you are intrested in the inferring the unknown parameters from your data. parameter's estimation needs be done in formal way

- is there a trend in the data?
- what is the slope of the trend (positive or negative)?
- is the trend linear or curved?
- is there any pattern to the scatter around the trend?

> x [1] 1 2 3 4 5 6 7 8 9 10 > y [1] 0.9252341 0.8546417 3.1241370 6.0860490 6.2729448 8.7434938 [7] 5.1923348 9.1469443 4.9776600 7.9872857you begin with the simplest possible linear model; the straight line: y = a * x + b

b - interseption a - slope

- - the intercept a is greater than zero?
- - the slope b is negative?
- - the variance in y is constant?

> z = lm (x~y) > summary (z)

the first step is to fit a horizontal line though the data, using **abline (intersept,slope)**, showing the average value of y:

> plot (x, y) > abline (lm (x ~ y))

TODO

TODO

TODO

TODO

the likelihood function, when evaluated in certain point (args of the function), gives the probability of observing the data

the data is treated as fixed quantity and model's parameters treated as random variables

priors should be choosen before we see the data. if you know nothing about the parameter you should assign to it so-called **uninformative** prior

sayHello <- function () { print ('hello') } sayHello ()

how can you run this via command-line?

if you want the output to print to the terminal it is best to use Rscript

$> Rscript a.R

note that when using R CMD BATCH a.R that instead of redirecting output to standard out and displaying on the terminal a new file called a.Rout will be created.

>R CMD BATCH a.Rcheck the output

>cat a.Rout

if you really want to use the *./a.R* way of calling the script you could add an appropriate *#!* to the top of the script

you need to use them when you do something DIFFERENT to each element of an object

the **break** statement can be used to terminate any loop, possibly abnormally. this is __the only way__ to terminate **repeat** loops

#!/usr/local/bin/Rscript #binomial distribution args <- commandArgs (TRUE) z = as.numeric (args[1]) p = as.numeric (args[2]) g <- function (x) { choose (z, x) * p^x * (1-p)^(z-x) } x = 1:z plot (x, g (x), type = "l")

$> Rscript bintrial.r 100 0.9 $> display Rplots.pdf

#!/usr/local/bin/Rscript #erlang distribution args <- commandArgs(TRUE) k = as.numeric(args[1]) λ = as.numeric(args[2]) erlang <- function (t) { λ^k * t^(k - 1) * exp (- λ * t) / factorial (k - 1) } t = seq (from = 0, by = .1, to = 100) plot (t, erlang (t), type = "l")

$> Rscript erltrial.r 3 .07 $> display Rplots.pdf

> search () > library () > installed.packages () > list.of.packages <- c ("dlm", "hawkes") > new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] > if (length (new.packages)) install.packages (new.packages)

and after that:

> require (dlm)