set.seed(k=integer)
is the recommended way to specify seeds in R-soft
if you add two vectors of the same length, you get another vector of the same length, where each entry is the sum of that entry in the other two vectors
x = c (3, 1, 9)
y = c (2, 5, 6)
each entry adds
x + y [1] 5 6 15
we can index into vectors with square brackets []. for example, we can pull out the second entry of a vector:
x = c(3, 1, 9) #print 2nd element x[2] [1] 1
the next bit of code
s = rep (NA, 100)
NA is short for ‘not available.’ R cannot handle NA when doing calculations; for example, we couldn’t take the mean of a vector with an NA in it. this is why it is good practice to fill a vector with NA before filling it with our actual data; if we make a mistake and accidentally don’t fill a specific entry in the vector, R will let us know because we can’t even take a mean
to create a vector from 1 to 10, increment by 1
seq (from = 1, to = 10, by = 1) [1] 1 2 3 4 5 6 7 8 9 10
> mean (x$Field_Name) ; median (x$Field_Name) [1] 0.7649074 [1] 0.72 > quantile (x$Field_Name) 0% 25% 50% 75% 100% 0.1300 0.4800 0.7200 1.0075 1.7600 > min (x$Field_Name) ; max (x$Field_Name) [1] 0.13 [1] 1.76 > var (x$Field_Name) ; sd (x$Field_Name) [1] 0.1429382 [1] 0.3780717
> z = rnorm (1000) > mean (z) [1] -0.02373456 > quantile (z, c (.1, .3, .7, .9)) 10% 30% 70% 90% -1.3458127 -0.5073407 0.5294266 1.2001054
> x = 1:10 > range (x) [1] 1 10 > quantile (x) 0% 25% 50% 75% 100% 1.00 3.25 5.50 7.75 10.00
the greater the variability in the data, the greater will be your uncertainty and lower your ability to distinguish between competing hypotheses
two populations can have different means but the same variance
> x = c (1, 3, 5) > z = c (2, 4, 6) > mean (z) ; mean (x) [1] 3 [1] 4 > var (z) ; var (x) [1] 4 [1] 4
two populations can have the same mean but different variances
> x = c (1, 3, 5) > y = c (0, 3, 6) > mean (y) ; mean (x) [1] 3 [1] 3 > var (x) ; var (y) [1] 4 [1] 9
comparing means when the variances are different is an extremely bad idea
in order to be reasonably confident that your inferences are correct, you need to establish some facts about the distribution of the data:
round (rnorm (1), 2) [1] 0.97
> choose(3,2) [1] 3
number
> numbers <- 30:1 > numbers [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 [18] 13 12 11 10 9 8 7 6 5 4 3 2 1 > numbers[5] [1] 26 > numbers[c(5,11,3)] [1] 26 20 28 > indices <- c(5,11,3) > numbers[indices] [1] 26 20 28
string
logical T[RUE] | F[ALSE]
array
list
vector
> x <- c (10, 20, 30, 40, 50, 60) > y <- c (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) > x[4] > y[3:5] > sum (x) > sum (sort (y)[1:3]) > prod (rev (x)) > rep (5.3, 17) > rep (1:6, 2)
matrix
> y <- c (3, 4, 7, 2, 8, 3, 4, 7, 1, 6, 7, 8, 9, 3, 7) > m <- matrix (y, nrow = 5) > n <- matrix (y, ncol = 3) > m [,1] [,2] [,3] [1,] 3 3 7 [2,] 4 4 8 [3,] 7 7 9 [4,] 2 1 3 [5,] 8 6 7 > n [,1] [,2] [,3] [1,] 3 3 7 [2,] 4 4 8 [3,] 7 7 9 [4,] 2 1 3 [5,] 8 6 7
length.out numeric. desired length of the output vector, inputs being recycled as needed real numeric vector imaginary numeric vector modulus numeric vector argument numeric vector
> cs1 = complex(2, c(1,2,3)) > cs1 [1] 1+0i 2+0i 3+0i > cs2 = complex(7, c(1,2,3)) > cs2 [1] 1+0i 2+0i 3+0i 1+0i 2+0i 3+0i 1+0i
useful functions:
z an object of mode complex
две рабочие лошадки: readline
и print
> x = as.numeric (readline (prompt="your num: ")) > print (x)
> x = scan () 1: 123 2: 567 890 34 5: 34.6 6: Read 5 items > x [1] 123.0 567.0 890.0 34.0 34.6 > print (x) [1] 123.0 567.0 890.0 34.0 34.6 > y <- scan (what = " ") 1: old young bible 4: stout 5: Read 4 items > y [1] "old" "young" "bible" "stout" > print (y) [1] "old" "young" "bible" "stout"
> fd = file ("aaa.dat", "w") > close (fd)
"r" or "rt" reading in text mode "w" or "wt" writing in text mode "a" or "at" appending in text mode "rb" reading in binary mode "wb" writing in binary mode "ab" appending in binary mode "r+" or "r+b" reading and writing "w+" or "w+b" reading and writing, truncating file initially "a+" or "a+b" reading and appending
file "a.dat":
1 2 3 4 5 6 7 8 9
> y = scan ("a.dat") Read 9 items > y [1] 1 2 3 4 5 6 7 8 9 > is.numeric(y) [1] TRUE
> y = rnorm (50, 7.0, 0.1) > write (file="filename.dat", y) > z = read.csv (file="bbb", header=0, sep=" ") > is.list(z) [1] TRUE
> z <- read.fwf ("http://www.ats.ucla.edu/stat/data/test_fixed.txt", width = c (8, 1, 3, 1, 1, 1))
you use the width argument to indicate the number of signs of each variable. in a fixed format file you do not have the names of the variables on the first line, and therefore they must be added after you have read in the data
> z V1 V2 V3 V4 V5 V6 1 general 0 70 4 1 1 2 vocati 1 121 4 2 1 3 general 0 86 4 3 1 4 vocati 0 141 4 3 1 5 academic 0 172 4 2 1 6 academic 0 113 4 2 1 7 general 0 50 3 2 1 8 academic 0 11 1 2 1 > s <- scan ("http://www.ats.ucla.edu/stat/data/names.txt", what = character ()) Read 6 items > s [1] "prgtyp" "gender" "id" "ses" "schtyp" "level"
in the binary data file, information is stored in groups of binary digits. each binary digit is a zero or one (and eight binary digits grouped together is a byte)
in order to successfully read binary data, you must know how pieces of information have been parsed into binary
for example, if your data consists of integers, how may bytes should you interpret as representative of one integer in your data?
or if your data contains both positive and negative numbers, how can you distinguish the two?
how many pieces of information do you expect to find in the binary data?
ideally, you know the answers to these questions before starting to read in the binary file
to get started, we establish a connection to a file and indicate that we will be using the connection to read in binary data. we do this with the
file
command, providing first the pathname, and the
rb
for “reading binary”
next, we use the readBin
command to begin. if we think the file contains integers, we can start by reading in the first integer and hoping that the size of the integer does not require further specifications. different platforms store binary data in different ways, and which end of a string of binary values represents the greatest values or smallest values is a difference that can yield very different results from the same set of binary values. this characteristic is called the “endian”. the binary files in the examples on this page were written using a PC, which suggests they are little-endian. when reading in binary data that may or may not have been written on a different platform, indicating an "endian" can be crucial. for example, without adding endian = “little” to the command below while running R on a Mac, the command reads the first integer as 16777216
> readBin (to.read, integer(), endian = “little”) [1] 1
thus, it looks like the first integer in the file is 1. as we repeatedly use readBin
commands, we will work our way through the binary file until we hit the end. we can read in multiple integers at once by adding an n=
option to our command. if the n you specify is greater than the number of integers you specified, readBin
will read and display as much as is available, so there is no danger of guessing too large an n. since we have already read in the first integer, this command will begin at the second
> readBin (to.read, integer(), n = 4, endian = “little”) [1] 2 3 4 5
if you know have additional information about what is in your file, you should incorporate that into the readBin
command. for example, if you know that you wish to read in integers stored on 4 bytes each, you can indicate this with the size
option:
> readBin (to.read, integer(), n = 2, size = 4, endian = “little”) [1] 6 7
similarly, if you know that your file contains characters, complex numbers, or some other type of information, you would adjust the readBin
command accordingly, changing integer() to character() or complex()
since you will likely want to do more than just look at what is contained in the binary file, you will need some strategies for formatting data as you read it in
for example, suppose you are given a binary file with the following description:
- three numeric variables collected from 200 subjects,
- the three variable names appear first in the file,
- the numeric values are integers store on two bytes each, and
- all of the values for the first variables are followed by all the values for the second and then all of the values for the third (as if they have be read in as columns, not rows)
first, open a connection to the data
> newdata = file (“https://stats.idre.ucla.edu/stat/r/faq/bindata.dat”, “rb”)
next, let’s read in the variable names and save them to a vector in R
> varnames = readBin (newdata, character(), n=3) > varnames [1] “read” “write” “math”
to read in the integer values, we can opt to read all 600 onto one vector, and then separate it out into the three variables:
> datavals = readBin (newdata, integer(), size = 4, n = 600, endian = “little”) > readvals = datavals[1:200] > writevals = datavals[201:400] > mathvals = datavals[401:600]
or we can read in each variable’s values with a separate readBin
command:
> readvals = readBin (newdata, integer(), size = 4, n = 200, endian = “little”) > writevals = readBin (newdata, integer(), size = 4, n = 200, endian = “little”) > mathvals = readBin (newdata, integer(), size = 4, n = 200, endian = “little”)
then, we can combine our three value vectors into one data frame with the variable names as our column names:
> rdata = cbind (readvals, writevals, mathvals) > colnames (rdata) = varnames > rdata[1:5,] read write math [1,] 57 52 41 [2,] 68 59 53 [3,] 44 33 54 [4,] 63 44 47 [5,] 47 52 57
lastly, since we have finished reading data from the binary file, we can close the connection:
> close (newdata)
> xs = c(1,2,4,5) > ys = as.integer (xs) > fd = file ("mybin.dat","wb") > writeBin (ys, fd) > close (fd) > fd = file("mybin.dat", "rb") > readBin (fd, integer(), 4) [1] 1 2 4 5
there is a way to tell R to treat the some column as a set of factors. you specify that a variable is a factor using the factor
command. in the following example you convert column "x$month" (which can contain month's names) into a factor:
> x$month <- factor (x$month)
once a vector is converted into a set of factors then R treats it DIFFERENTLY - a set of factors have a DISCRETE SET of possible values, and it does not make sense to try to find averages or other NUMERICAL descriptions
> meteo$Month = factor (meteo$Month, ordered = T, levels = c ("Jan","Feb","Mar","Apr", "May","Jun","Jul","Aug", "Sep","Oct","Nov","Dec")) > is.factor (month) [1] TRUE > plot (meteo$Month, meteo$MeanTemp) # or > boxplot (meteo$MeanTemp ~ meteo$Month, col = "orange") > dev.off () > meteo$MeanTemp[meteo$Month == "Jan"]
dataframe
a dataframe is an object with rows and columns
the rows contain different observations/measurements from your experiment
the columns contain different variables
the values in the body of the dataframe can be numbers, but they could also be text; they could be calendar dates (like 23/5/04); or they could be logical variables
> d <- c (7, 4, 6, 8, 9, 1, 0, 3, 2, 5, 0) > r <- rank (d) > s <- sort (d) > o <- order (d) > v <- data.frame (d, r, s, o) > v d r s o 1 7 9.0 0 7 2 4 6.0 0 11 3 6 8.0 1 6 4 8 10.0 2 9 5 9 11.0 3 8 6 1 3.0 4 2 7 0 1.5 5 10 8 3 5.0 6 3 9 2 4.0 7 1 10 5 7.0 8 4 11 0 1.5 9 5
> w <- read.table ("some.txt", head = 1) > attach (w) > month <- factor (month) > h <- read.csv (file = "simple.csv", head = TRUE, sep = ", ")
> names(v1) <- c("x","y","sum")
read.table would fail if there were any spaces in any of the variable names in row 1 of the dataframe (the header row) or between any of the words within the same factor level
you can get a quick summary of the data by calculating a frequency table. a frequency table is a table that represents the number of occurrences of every unique value in the variable
> tapply (meteo$MeanTemp, meteo$Year, mean) > aggregate (meteo$MeanTemp, list (period = meteo$Year), mean) > by (meteo$MeanTemp, list (period = meteo$Year), mean) # exclude variables field1, field2, field3 m1 <- names (mydata) %in% c ("field1", "field2", "field3") n1 <- mydata[!m1] # exclude 3rd and 5th field n2 <- mydata[c (-3,-5)] # take first five observations n3 <- mydata[1:5,] # sql-style n4 <- mydata[ which (mydata$gender == 'F' & mydata$age > 65), ] # or attach (newdata) newdata <- mydata[ which (gender == 'F' & age > 65),] detach (newdata) # using subset function newdata <- subset (mydata, age > = 20 | age < 10, select = c (ID, Weight))
> p = read.csv ("temperature.csv") > p$Month = factor (p$Month, ordered = T, levels = c ("Jan","Feb","Mar","Apr", "May","Jun","Jul","Aug", "Sep","Oct","Nov","Dec")) > boxplot (p$Month, p$Temp, col="orange", main="raw data")
> mm = tapply (p$Temp, p$Month, median) > mm Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec -5.8 -5.8 -0.7 6.9 12.3 18.2 21.0 19.9 15.7 9.5 3.4 -2.8 > as.numeric (mm) [1] -5.8 -5.8 -0.7 6.9 12.3 18.2 21.0 19.9 15.7 9.5 3.4 -2.8 > boxplot (p$Month, p$Temp - as.numeric (mm), col="orange", main="deseasoned data") > dev.off()
> x = seq (1,7,1) > x [1] 1 2 3 4 5 6 7 > y = rep (1,7) > y [1] 1 1 1 1 1 1 1 > sum (y) [1] 7 > sum (y[x>3]) [1] 4 > sum (y[x==2]) [1] 1
> w1 <- read.table ("filename", header = T) > stripchart (w1$vals) > stripchart (w1$vals, method = "stack")
the histogram graphically shows the following:
> hist (w1$vals, col = 'grey', breaks = 12, xlim = c (0.9, 1.3))
box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data
> boxplot (w1$vals, main = 'Main Title of the Plot', xlab = 'x axe label', horizontal = TRUE) > boxplot (e$MeanTemp~e$Month, col = "orange")
> plot (x, y, type = "l", pch = 3)
‘type’ possibilities:
the default plotting character (pch = 1) is ο
if you want Δ, use pch = 2
if you want + (plus signs), use pch = 3
if you want x use pch = 4
if you want ♦ use pch = 5
to draw the regression line through the data, you employ the straight line drawing directive
> abline (intercept, slope)
> abline (lm (y~x))
to plot four graphics (two-in-rows) use the command:
> par (mfrow = c (2, 2))
to plot two different funcs on the same plot:
> plot (t, p1, ylim = c (-6,4), type = "l", col = "red" ) > par (new = TRUE) > plot (t, p2, ylim = c (-6,4), type = "l", col = "green" )
> p1 = matrix (p1) > p2 = matrix (p2) > matplot (t, cbind (p1, p2), pch = 19)
> plot (t, p1, ylim = c (-6,5)) > points (t, p2)
> plot (t, p1, ylim = c (-6,5)) > lines (t, p2)
a second way to specify graphical parameters is by providing the optionname=value pairs directly to a plotting function
> hist (mtcars$mpg, col.lab="red")
function | output to |
---|---|
pdf ("mygraph.pdf") | pdf file |
png ("mygraph.png") | png file |
jpeg ("mygraph.jpg") | jpeg file |
bmp ("mygraph.bmp") | bmp file |
postscript ("mygraph.ps") | postscript file |
dev.off () | returns output to the terminal |
so to save a jpg file called "plot01.jpg" containing a plot of x and y:
> jpeg ('plot01.jpg') > plot (x, y) > dev.off ()
runif
generates a set number of random draws, n, from an interval with a specified lower bound, min, and upper bound, max (the r stands for random, and the unif means uniform, obviously)
rnorm
can generate random values from a normal distribution
runif(n = 10, min = 0, max = 5)
generates 10 random draws from 0 to 5. but if you don’t include the min and
max arguments…
runif(n = 10)
## [1] 0.2836422 0.7407605 0.3658010 0.2006554 0.8325042 0.9261639 0.7242986
0.1317093
## [9] 0.2777294 0.5944356
…the function defaults to the standard uniform with min of 0 and max of 1
#normal
?rnorm()
#binomial/bernoulli
?rbinom()
#geometric
?rgeom()
#exponential
?rexp()
#beta
?rbeta()
#gamma
?rgamma()
#poisson
?rpois()
#uniform
?runif()
many statistical tests are based on the assumption of normality. the assumption of normality often leads to tests that are simple and powerful compared to tests that do not make the normality assumption. unfortunately, many real data sets are in fact not approximately normal
use Student' t-test when the means are independent and the errors are normally distributed
non-normality, outliers and serial correlation can all invalidate inferences made by Student' t-test. much better in such cases to use a non-parametric technique - Wilcoxon' signed-rank test. use Wilcoxon' rank sum test when the means are independent but errors are not normally distributed
neither the t-test nor the w-test can cope properly with situations where the variances are different, but the means are the same. this draws attention to a very general point: scientific importance and statistical significance are not the same thing
to be sure that your data sampling x is close to normal distribution use Shapiro test
t-values associated with different levels of confidence available in R:
confidence intervals are always 2-tailed. thus, if you want to establish a 95% confidence interval you need to calculate (or look up)
> qt (.025, 9) [1] -2.262157
> qt (.975, 9) [1] 2.262157
this says that
so, finally, you can write down the formula for the confidence interval of a mean based on a small sample:
> x = c(3, 5, 7) > qt (.975, 2) * var (x) [1] 17.21061 > qt (.025, 2) * var (x) [1] -17.21061
there is a built-in function t.test
> t.test (A, B) Welch Two Sample t-test data: A and B t = -3.873, df = 18, p-value = 0.001115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates: mean of x mean of y 3 5
you typically use 5% as the chance of rejecting the null hypothesis
when the covariance of A and B is positive, this is a great help because it reduces the variance of the difference, and should make it easier to detect significant differences between the means.
pairing is not always effective, because the correlation between A and B may be weak
> x <- c (20, 15, 10, 5, 20, 15, 10, 5, 20, 15, 10, 5, 20, 15, 10, 5) > y <- c (23, 16, 10, 4, 22, 15, 12, 7, 21, 16, 11, 5, 22, 14, 10, 6) > t.test (x, y) > t.test (x, y, paired = T)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
78 | 24 | 64 | 45 | 64 | 52 | 30 | 50 | 64 | 50 | 78 | 22 | 84 | 40 | 90 | 72 |
78 | 24 | 62 | 48 | 68 | 56 | 25 | 44 | 56 | 40 | 68 | 36 | 68 | 20 | 58 | 32 |
> t = read.table ("marks.dat", header = T) > x = as.numeric (t[, 2]); x1 = x[1:16] > y = as.numeric (t[, 3]); y1 = y[1:16] > t.test (x1, y1, paired = T) Paired t-test data: x1 and y1 t = 2.2353, df = 15, p-value = 0.04103 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3600356 15.1399644 sample estimates: mean of the differences 7.75
Wilcoxon non-parametric test is used if the errors looked to be non-normal
this non-parametric test can be more powerful than the t-test if the distribution is strongly skewed
the Kolmogorov-Smirnov test is used to decide if a sample comes from a population with a specific distribution. the test is based on the empirical cumulative distribution function (ECDF)
if ‘y’ is numeric, a two-sample test of the null hypothesis that ‘x’ and ‘y’ were drawn from the same _continuous_ distribution is performed
‘y’ can be a character string naming a continuous CDF, or such a function. in this case, a one-sample test is carried out of the null that the distribution function which generated ‘x’ is distribution ‘y’
the K-S test has several important limitations:
suppose you are interested in investigating the assosiation between x and y
isn't just enought to caclulate the correlation ρ between x and y?
perhaps for this dataset (ρ=0.83) it is enought
what about this dataset? ρ=0.5
and this? ρ=-0.8
one of the best methods to describe some data is by fitting a statistical model. the model parameters tell you much more about the relationship between x and y than correlation coefficient. in statical modeling you are inerested in estimating the unknown parameters from your data
models have parameters some of which are unknown. you are intrested in the inferring the unknown parameters from your data. parameter's estimation needs be done in formal way
> x [1] 1 2 3 4 5 6 7 8 9 10 > y [1] 0.9252341 0.8546417 3.1241370 6.0860490 6.2729448 8.7434938 [7] 5.1923348 9.1469443 4.9776600 7.9872857
you begin with the simplest possible linear model; the straight line: y = a * x + b
b - interseption a - slope
> z = lm (x~y) > summary (z)
the first step is to fit a horizontal line though the data, using abline (intersept,slope), showing the average value of y:
> plot (x, y) > abline (lm (x ~ y))
TODO
TODO
TODO
TODO
the likelihood function, when evaluated in certain point (args of the function), gives the probability of observing the data
the data is treated as fixed quantity and model's parameters treated as random variables
priors should be choosen before we see the data. if you know nothing about the parameter you should assign to it so-called uninformative prior
sayHello <- function () { print ('hello') } sayHello ()
how can you run this via command-line? if you want the output to print to the terminal it is best to use Rscript
$> Rscript a.R
note that when using R CMD BATCH a.R that instead of redirecting output to standard out and displaying on the terminal a new file called a.Rout will be created
>R CMD BATCH a.R
>cat a.Rout
if you really want to use the ./a.R way of calling the script you could add an appropriate #! to the top of the script
if you can use vectorized functions then loops should be a last resort
you need to use them when you do something DIFFERENT to each element of an object
the break
statement can be used to terminate any loop, possibly abnormally. this is the only way to terminate repeat
loops
#!/usr/local/bin/Rscript #binomial distribution args <- commandArgs (TRUE) z = as.numeric (args[1]) p = as.numeric (args[2]) g <- function (x) { choose (z, x) * p^x * (1-p)^(z-x) } x = 1:z plot (x, g (x), type = "l")
$> Rscript bintrial.r 100 0.5 $> display Rplots.pdf
#!/usr/local/bin/Rscript #erlang distribution args <- commandArgs(TRUE) k = as.numeric(args[1]) λ = as.numeric(args[2]) erlang <- function (t) { λ^k * t^(k - 1) * exp (- λ * t) / factorial (k - 1) } t = seq (from = 0, by = .1, to = 100) plot (t, erlang (t), type = "l")
$> Rscript erltrial.r 3 .07 $> display Rplots.pdf
> search () > library () > installed.packages () > list.of.packages <- c ("dlm", "hawkes") > new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] > if (length (new.packages)) install.packages (new.packages)
and after that:
> require (dlm)