вероятностная мера ограничена снизу нулем, сверху единицей и аддитивна для взаимно-исключающих событий

**Marginal Probability** is just the probability of a single event occurring (not worrying about anything else). so, the marginal probability of rolling a 2 on a fair dice is simply 1/6

**Conditional Probability** is a step beyond marginal. while it still gives the probability of a single event occurring, it now establishes some condition, or requirement that some other event occurs. an easy example to visualize is the question “Given that you roll above a 2 on a fair die, what is the probability you roll a 5?” essentially, this parses the sample space down because it sets some outcome in addition to the event we are curious about. the sample space goes from {1,2,3,4,5,6} to {3,4,5,6}, because we are given that we rolled above 2. so, you would be able to probably piece out that the answer here is 1/4

**Joint Probability** is just the probability of multiple events occurring. so, while the probability of flipping Heads on a coin and rolling a 3 on a die are both marginal, the probability of both occurring is a joint probability, which in this case is equal to 1/12

**Bayes Rule** :

P(A|B) = P(B|A) * P(A) / P(B)

the best way to think of a random variable is as a sort of machine that randomly spits out numbers in some fashion

compare this to some function y(x). say you have a function y(x)=2*x. the behavior of the output of this function is very well defined and completely certain: whatever you plug in, you get twice back. unlike a function, the output of a random variable has uncertainty; we’re never as sure as we are with the function

this seems a little dicey thus far, having some random ‘machine’ that spits out numbers willy-nilly. thankfully, there are some properties of random variables that help us work out and nail down their nature

random variables have distributions. this is essentially the name or type of the random variable. while there are many different types of random variables you could imagine, there are a few very common ones: Binomial, Bernoulli, Poisson, Uniform and Normal distributions

the simplest example of a property of a random variable is it’s **Expectation**. a random variable’s expectation can be thought of as it’s average, and it is denoted as E(X) (which means, in english, the expectation of random variable X). The expectation of the amount of heads in one coin flip (which is a random variable) is .5

the general equation for expectation :

E(X) = Σ x_i * P(X = x_i)so, the expectation of random variable X equals the sum of all it’s values x_i times the probability that each value occurs. The best way to think of this is as a weighted average, where each value/outcome x_i is weighted by the probability that it occurs

and for variance :

Var(X) = Σ (x_i - μ)² * P(X=x_i)\]so, the variance is the sum of the distances from the mean squared of all the values weighted by the probability that each value/outcome occurs

we perform n trials, each with only two outcomes (usually success or failure) and with a probability of success p that stays constant from trial to trial, where all trials are independent. sounds a little tricky, but consider flipping a fair coin n times and hoping to get heads. this is the prototypical Binomial random variable. it meets all the requirements: there are a set of n trials, each trial has only two outcomes (heads, which is success here, and tails, which is failure here), the probability of success/heads is .5 for every trial, and each flip of the coin is independent

for a Binomial random variable X, we have:

E(X) = n*p Var(X) = n*p*(1-p)so, the expectation of successes in a Binomial random variable is the total number of trials times the probability of success on each trial (intuitive) and Variance is the number of trials times the probability of success times the complement of the probability of success (unfortunately not intuitive)

вы подбрасываете честную монетку p(H) = p(T) = .5

если вы делаете это n раз, тогда

E(#H) ≅ n/2 ± √n/2 E(H) = E(#H)/n ≅ 1/2 ± 1/(2*√n)и при увеличении n шанс устремляется к .5

если монета смещенная, тогда

E(#H) = p*n ± √ (p*(1-p)*n) E(H) = p ± √ (p*(1-p)/n)

for example T ~ Bin(30,.5), where T represents number of tails in 30 coin flips, we expect .5*30=15 tails on average, with the variance of these tails being 30*.5*(1-.5)=7.5 tails (remember, the standard deviation is the square root of the variance, so the standard deviation is 2.74)

we have the PDF of a Binomial. recall that a PDF gives the probability that a random variable X takes on any one value x: P(X=x) for all x. for a Binomial, we have:

P(X = x) = choose (n, x) * p^x * q^(n-x)

choose (n, x) = n! / [(n-x) * x!]this is called the

a simple example would be finding the probability that exactly 3 heads occur in the 5 flips. plugging in 3 for x, we would get:

P(X = 3) = choose (5,3) * .5^3 * .5^(5-3) = .3125

imagine the shape of a uniform distribution as a rectangle with area 1 (since total probability must add up to 1). the simplest example is the **Standard Uniform**, which is a uniform distribution from 0 to 1. if X is Standard Uniform, then abbreviation is X~Unif(0,1)

since the uniform distribution is a rectangle and the probability density is the same as the area in the rectangle, density problems with the uniform are really just exercises in geometry. for the general case of a random variable X~Unif(a,b):

E(X) = (a + b) / 2 Var(X) = (b - a)² / 12the expected value makes sense, since it is just the average of the endpoints. the variance makes no intuitive sense

the CDF of a Uniform random variable is:

(x - a) / (b - a)when x is between a and b

if X is normally distributed, we use abbreviation X~N(μ,σ²). this means that μ is the mean and σ² is the variance

let go through the **Standard Normal**
- N(0,1). that’s because its PDF is simple:

1/√2π * exp (-x²/2)

we would like to convert whatever Normal we get into a Standard Normal. this is a process called **standardization**. so, how do we get to a Standard Normal from any other Normal?

when you multiply every value in a data set by a constant, the mean of the data set is multiplied by that constant and the variance is multiplied by the square of the constant, and when you add every value in a data set by a constant, the mean is increased by that constant and the variance is unchanged

if we start from Z, how do we get to N? we know that if we multiply Z by some constant, then we will multiply the variance by the square of that constant. the variance of Z is currently 1, and we want to go from this variance of 1 to the variance of the general Normal, σ². what do we multiply be 1 to get to σ²? no tricks here, just σ². remember, though, when we multiply by a constant the variance increases by the square of that constant, so we want to actually multiply Z by σ, as this will multiply 1 by σ² to get to the variance of the general Normal, σ²

but we still have to get the same mean. we know that multiplying by a constant multiplies the mean by the same constant; however, the mean of the standard normal is 0, so multiplying Z by σ does not affect the mean. so, we just have to add some constant to the mean of Z, which is 0, to get to the mean of the general Normal, μ. what do you add to 0 to get μ? well, μ of course. again, variance is not affected by adding constants, so we are still at the desired variance of σ²

basically, what we have done is taken a general Normal random variable and converted it into a standard normal random variable, which is much easier to use

say that the weights W in a certain college are distributed normally with W~N(120,5²). if you were asked “What is the probability of weighing less than 150 pounds” how would you solve it?

this clearly requires the use of a CDF, since we have to find P(W≤150). however, it’s not easy to apply the CDF to this Normal, since the CDF is so complicated. instead, we can convert to a standardized normal, which is easier to work with

plugging in the mean, variance and value of 150 for N, we get z=(150-120)/5=6

so, what does that mean?

well, since the standard deviation of the standard normal is 1 (a very nice property) the z=6 essentially gives us how many standard deviations away from the mean we are in a standard normal. that is why it is so easy to read, since it immediately gives us a general clue of how extreme a point is. you couldn’t really wrap your head around how extreme 150 pounds was, but it’s much easier to think about being 6 standard deviations away from the mean

подбрасывая смещенную монету n раз вы получили серию из m Heads и k Tails. общее количество таких серий есть choose(n,m)=n!/(m!*k!). сколько информации I содержится в одной конкретной серии из этой коллекции?

I = log [n!/(m!*k!)] = log (n!) - log (m!) - log ((n-m)!) = 1/ln(2) * (ln n! - ln m! - ln ((n-m)!)) = 1/ln(2) * [n * ln (n) - n - m * ln (m) + m - (n - m) * ln (n - m) + n - m)] = n/ln(2) * [ln (n) - m/n * ln (m) - (1 - m/n) * ln (n-m)] ln (n) = (m/n + 1 - m/n) * ln (n) = n/ln(2) * [ -m/n * ln (m/n) - (1 - m/n) * ln (1 - m/n)] = n/ln(2) * [ -p * ln p - (1 - p) * ln (1 - p)] = n * [ -p * log p - (1 - p) * log (1 - p)]

и, в общем случае, при наличии k событий A_{1},A_{2},...,A_{k} с вероятностями p_{1},p_{2},...,p_{k} количество информации на один символ

k I/n = - Σ pэто называется "количество информации на символ"_{i}* log p_{i}i=1

и если все вероятности равны (например, в случае игральной кости), тогда

I = log k

analog info:

descrete info:

the Kraft inequality determines which sets of codeword lengths are possible for preﬁx-free codes. given a discrete memoryless source, we want to determine what set of codeword lengths can be used to minimize the expected length of a preﬁx-free code for that source

**Th: Kraft inequality for prefix-free codes**

every prefix-free code for an alphabet X={a_{1},..,a_{M}} with codeword lengths l(a_{j}) :1≤j≤M } satisfies

in the very early days of information theory, a number of heuristic algorithms were suggested for choosing codeword lengths l_{j} to approximate

− log pboth Claude Shannon and Robert Fano had suggested such heuristic algorithms by 1948. it was conjectured at that time that, since this was an integer optimization problem, its optimal solution would be quite diﬃcult_{j}

сырец на Окамле для генерации кода Фано

it was quite a surprise therefore when David Huffman came up with a very simple and straightforward algorithm for constructing optimal (in the sense of minimal L) preﬁx-free codes. Huffman developed the algorithm in 1950 as a term paper in Robert Fano’s information theory class at MIT. Huffman’s trick was to “think outside the box”. he ignored the Kraft inequality, and looked at the binary code tree to establish properties that an optimal prefix-free code should have. after discovering a few simple properties, he realized that they led to a simple recursive procedure for constructing an optimal code

сырец на Окамле для генерации кода Хаффмана

the noise should be modeled probabilistically. this is partly because the noise is a priori unknown, but can be expected to behave in statistically predictable ways. it is also because encoders and decoders are designed to operate successfully on a variety of different channels, all of which are subject to different noise waveforms

the noise is usually modeled as zero mean, since a mean can be trivially removed

in order to specify a random process {Z(t); t∈ℝ}, some kind of rule is required from which joint distribution functions can, at least in principle, be calculated. that is, for all positive integers n, and all choices of n epochs t_{1},t_{2},..,t_{n}, it must be possible in principle to find the joint distribution function,

Ffor all choices of the real numbers z_{Z(t1),..,Z(tn)}(z_{1},...,z_{n}) = Pr (Z(t_{1})≤ z_{1},...,Z(t_{n}) ≤ z_{n})

equivalently, if densities exist, it must be possible in principle to find the joint density

f_{Z(t1),...,Z(tn)}(z_{1},...,z_{n}) = ∂ⁿ F_{Z(t1),...,Z(tn)}(z_{1},...,z_{n}) ----------------------------- ∂z_{1}... ∂z_{n}

zero-mean Gaussian rv’s are important in modeling noise and other random phenomena for the following reasons:

- they serve as good approximations to the sum of many independent zero-mean rv’s (CLT)
- they have a number of extremal properties; they are, in several senses, the most random rv’s for a given variance
- they are easy to manipulate analytically, given a few simple properties
- they serve as common channel noise models, and in fact the literature often assumes that noise is modeled by zero-mean Gaussian rv’s without explicitly stating it

**Def**: a set of n of random variables, Z_{1},...,Z_{n} is zero-mean jointly Gaussian if there is a set of iid normal rv’s W_{1},...,W_{l} such that each Z_{k}, 1≤k≤ n, can be expressed as

Z = A * W

assume that the columns a_{1}, ..., a_{n} are linearly independent. this means that the columns must form a basis for ℝ^{n}, and thus that every vector z is some linear combination of these columns. the matrix A must then be invertible, i.e., there is a matrix A⁻ such that

A * A⁻ = A⁻ * A = Iwhere I is the n by n identity matrix

the matrix A maps the unit vectors of ℝ^{n} into the vectors a_{1},...,a_{n} and the matrix A⁻ maps a_{1},...,a_{n} back into the unit vectors

if the columns of A are not linearly independent, i.e., a is not invertible, then A maps the unit cube in ℝⁿ into a subspace of dimension less than n. in terms of fig. above, the unit cube would be mapped into a straight line segment. the area, in 2-dimensional space, of a straight line segment is 0, and more generally, the volume in n-space of a lower dimensional set of points is 0. in terms of the determinant, det A = 0 for any noninvertible matrix A

since w = A⁻ * z , we get the explicit formula for **pdf**
:

the above density relies on A being nonsingular. if A is singular, then at least one of its rows is a linear combination of the other rows, and thus, for some m, 1≤m≤n, Z*m is a linear combination of the other Z_{k} . the random vector Z is still jointly Gaussian, but the joint probability density does not exist (unless one wishes to view the density of Z_{m} as a unit impulse at a point speciﬁed by the sample values of the other variables)

it is important to understand that there is a large difference between rv’s being statistically dependent and linearly dependent:

- if they are linearly dependent, then one or more are deterministic functions of the others,
- whereas statistical dependence simply implies a probabilistic relationship

an important property of any jointly-Gaussian n-dimensional rv Z is the following: for any m by n real matrix B, the rv Y=B*Z is also jointly Gaussian. to see this, let Z=A*W where W is a normal rv. then Y=B*Z=B*(A*W)=(B*A)*W. since B*A is a real matrix, Y is jointly Gaussian. a useful application of this property arises when A is diagonal, so Z has arbitrary independent Gaussian components. this implies that Y=B*Z is jointly Gaussian whenever a rv Z has independent Gaussian components

**Def**
: {Z(t); t∈ℝ} is a zero-mean Gaussian process if, for all positive integers n and all finite sets of epochs t_{1},...,t_{n} , the set of random variables Z(t_{1}),...,Z(t_{n}) is a (zero-mean) jointly-Gaussian set of random variables

if the covariance, K_{Z}(t,τ)=E[Z(t)*Z(τ)], is known for each pair of epochs t, τ, then for any finite set of epochs t_{1},...,t_{n} , E[Z(t_{k})Z(t_{m})] is known for each pair (t_{k},t_{m}) in that set

the covariance of two zero-mean rv’s Z_{1}, Z_{2} is E[Z_{1}*Z_{2}]. for a rv Z=(Z_{1},...,Z_{n})^{T} the covariance between all pairs of random variables is very conveniently represented by the n by n covariance matrix, K_{Z}=E[Z*Z^{T}]

an n by n real matrix A for which A*A^{T}=I is called an **orthogonal** matrix or **orthonormal** matrix

for Z=A*W, where W is iid normal and A is orthogonal, K_{Z}=A*A^{T}=I. thus K_{Z}⁻ = I also. this means that A transforms W into a random vector Z with the same probability density, and thus the components of Z are still normal and iid. thus, for the 2-dimensional example above, the unit vectors e_{1}, e_{2} are mapped into orthonormal vectors a_{1}, a_{2}, so that the transformation simply rotates the points in the plane. although it is difficult to visualize such a transformation in higher dimensional space, it is still called a rotation, and has the property that

|A * w|² = w^{T}* A^{T}* A * w

the matrix K_{z} has n real orthonormal eigenvectors, q_{1},..,q_{n}, with corresponding nonnegative (but not necessarily distinct) real eigenvalues, λ_{1},..,λ_{n}

if the normalized covariance ρ=0, the axes of the ellipse are the horizontal and vertical axes of the plane; if σ1=σ2, the ellipse reduces to a circle, and otherwise the ellipse is elongated in the direction of the larger standard deviation. if ρ>0, the density in the first and third quadrants is increased at the expense of the second and fourth, and thus the ellipses are elongated in the first and third quadrants. this is reversed, of course, for ρ<0