you have two gamblers, named A and B, such that A starts with K dollars and B starts with N-K dollars (so that there are N total dollars in the game. they play repeated rounds; for each round, player A has probability p of winning the round, and player B has probability q=1-p of winning. the rounds are independent (i.e., the outcome of a previous round does not affect the outcome of the current round). if A wins a round, B gives a dollar to A; if B wins a round, A gives a dollar to B. they play until one player is out of money (has 0 dollars, so that the winning player has all of the money, or N dollars

consider tracking the amount that A has throughout the game. this value oscillates between 0 and N, each step moving up one with probability p and moving down one with probability q, until hitting 0 or N (i.e., A goes bankrupt or wins all of the money in the game). this process is called a 'random walk'

one might ask how long the average game takes, or how much variance there is in game length. perhaps the most enticing question is “What is the probability that player A wins?”

one more interesting result about the Gambler’s Ruin (which we won’t fully solve here) is that you can solve for the probability that gambler B wins, and this probability, plus the probability that A wins, is 1. that is, there is probability 1 that someone wins, meaning that the game will end; we are certain that it won’t continue forever. this may sound obvious, but it is not necessarily a given when we are working with random walks and, by extension, stochastic processes

Markov chains

Def: a Markov chain is an integer-time process, {X_n, n≥0} for which the sample values for each rv X_n, n≥1, lie in a countable set S and depend on the past only through the most recent rv X_n−1

Markov chains where the transition probabilities can vary with n are called a non-homogeneous

Def : a recurrent state is a state that is accessible from all states that are accessible from it
a transient state is a state that is not recurrent

Def : the period of a state i is the greatest common divisor (gcd) of those values of n for which P_iiⁿ > 0
if the period is 1 the state is aperiodic
if the period is 2 or more, the state is periodic

Def : a class C of states is a non-empty set of states such that each i ∈ C communicates with every other state j ∈ C and does not communicates with j ∉ C

Def : an ergodic class is a class that is both recurrent and aperiodic
a Markov chain consisting entirely of one ergodic class is called an ergodic chain

Ergodic Unichains

Def: a unichain is a finite-state Markov chain that contains a single recurrent class plus, perhaps, some transient states
an ergodic unichain is a unichain for which the recurrent class is ergodic

the matrix P of transition probabilities of a Markov chain is called a stochastic matrix; that is, a stochastic matrix is a square matrix of nonnegative terms in which the elements in each row sum to 1

the transition matrix of a unichain
the block of zeroes in the lower left corresponds to the absence of transitions from recurrent to transient states

the idea is that each transient state eventually has a transition (via P_TR) to a recurrent state, and the class of recurrent states lead to steady state as before

for every stochastic matrix we have P * e = e, where e is the column vector with all 1's

there is a probability row vector π such that π * P = π (i.e., a steady-state vector) for all Markov chains

Markov chains wih 2 states

if P₁₂ = P₂₁ = 0 (the chain has two recurrent classes),
then λ = 1 has multiplicity 2
otherwise λ = 1 has multiplicity 1

if P₁₂ = P₂₁ = 1 (the chain is periodic),
then λ2 = −1
otherwise |λ2|<1

notes for dimension n>2

we have seen that there is always one eigenvalue that is 1, with an accompanying steady-state vector π as a left eigenvector and the unit vector e=(1,..,1) as a right eigenvector. the other eigenvalues and eigenvectors can be complex, but it is almost self-evident from the fact that P is a stochastic matrix that |λ_k|≤1. the rate of convergence is the value of this second eigenvalue. and that's a pretty general result. Pⁿ converged like the second-largest eigenvalue

if you want to know how fast Pⁿ approaches the steady state vector π, all you have to do is find that second-largest eigenvalue, and that tells you how fast the convergence is

if a recurrent chain is periodic with period d, it turns out that there are d eigenvalues of magnitude 1, and these are uniformly spaced around the unit circle in the complex plane - Pⁿ does not converge in that case

if some of the eigenvalues of P are not distinct, the question arises as to how many linearly independent left (or right) eigenvectors exist for an eigenvalue of multiplicity m. perhaps the ugliest part of linear algebra is the fact that an eigenvalue of multiplicity m need not have m linearly independent eigenvectors

block matrices and eigenvectors

if π is a left eigenvector of P_R, then (0,...,0,π1,...,πr) is a left eigenvector of P, i.e.,

the left eigenvectors of P_T are more complicated but not very interesting for stochastic process behavior

if π is a left eigenvector of P_R, then (0,...,0,π,0....0) is a left eigenvector of P

if π' is a left eigenvector of P_R', then (0,...0,π') is a left eigenvector of P

the Jordan form case

there are cases in which one or more eigenvalues of P are repeated (as roots of det[P−λI]) but where the number of linearly independent right eigenvectors for a given eigenvalue is less than the multiplicity of that eigenvalue. in this case, there are not enough eigenvectors to span the space, so there is no m by m matrix whose columns are linearly independent eigenvectors. thus P can not be expressed as U*Λ*U⁻ where Λ is the diagonal matrix of the eigenvalues, repeated according to their multiplicity. the Jordan form is the cure for this unfortunate situation

the Jordan form for a given P is the following modification of the diagonal matrix of eigenvalues: we start with the diagonal matrix of eigenvalues, with the repeated eigenvalues as neighboring elements. then for each missing eigenvector for a given eigenvalue, a 1 is placed immediately to the right and above a neighboring pair of appearances of that eigenvalue, as seen by example below:

there is a theorem in linear algebra that says that an invertible matrix B exists and a Jordan form J exists such that P=B*J*B⁻

example of bad matrix

an example of a very simple Markov chain with m=3 but only two linearly independent eigenvectors is given below. these eigenvectors do not span m-space

$> octave -q bad_case.m 
A =

   0.50000   0.50000   0.00000
   0.00000   0.50000   0.50000
   0.00000   0.00000   1.00000

U =

   1.00000  -1.00000   0.57735
   0.00000   0.00000   0.57735
   0.00000   0.00000   0.57735

L =

Diagonal Matrix

   0.50000         0         0
         0   0.50000         0
         0         0   1.00000

warning: matrix singular to machine precision, rcond = 6.70951e-17
warning: called from
    bad_case at line 5 column 3
V =

   1.0000e+00   4.5036e+15  -4.5036e+15
   0.0000e+00   4.5036e+15  -4.5036e+15
   0.0000e+00   0.0000e+00   1.7321e+00

the eigenvalues are 1 and 1/2, with algebraic multiplicity 2 for λ=1/2. there is only one eigenvector (subject to a scaling constant) for the eigenvalue 1/2. Pⁿ approaches steady-state as n*(1/2)ⁿ. for all practical purposes, this is still the eigenvalue going down exponentially. so for all practical purposes, what you wind up with is the second-largest eigenvalue still determines how fast you get convergence

example

assume that one or more eigenvalues have multiplicity greater than 1, but that if an eigenvalue has multiplicity k, then it has k linearly independent eigenvectors

consider a Markov chain consisting of l ergodic sets of states. then each ergodic set will have an eigenvalue equal to 1 with a right eigenvector equal to 1 on the states of that set and 0 elsewhere

there will also be a ‘steady-state’ vector, nonzero only on that set of states

then Pⁿ will converge to a block diagonal matrix where for each ergodic set, the rows within that set are the same. so what happens in this situation?

anybody with any sense, faced with a Markov chain like this, would say "If we start here, we're going to stay here, if we start there, we're going to stay there. Let's just analyze this first. And then after we're done analyzing this, we'll analyze that. And then we'll put the whole thing together".

Markov chains with rewards

suppose that each state i of a Markov chain is associated with a given reward, r_i

letting the rv X_n be the state at time n, the (random) reward at time n is the rv R(X_n) that maps X_n=i into r_i for each i

if we start from state m=i and do n steps than our total reward v in this case will be

if the Markov chain is an ergodic unichain, then successive terms of this expression tend to a steady state gain per step

lenghth of walk from state i to state j

suppose, for some arbitrary unichain, that we want to find the expected number of steps, starting from a given state i until some given recurrent state, say 1, is first entered. assume i≠1

this can be viewed as a reward problem by assigning one unit of reward to each successive state until state 1 is entered

the modified Markov chain is now an ergodic unichain with a single recurrent state, i.e., state 1 is a trapping state

dynamical programming

now we consider a much more elaborate structure in which a decision maker can choose among various possible rewards coupled with transition probabilities

at each time m, a decision maker, given X_m=i, selects one of the k_i possible choices of P for his current state i. we assume that if decision k is selected at time m, the probability of entering state j at time m+1 is independent of earlier states and decisions

the set of rules used by the decision maker in selecting an alternative at each time is called a policy . we want to consider the expected aggregate reward over n steps as a function of the policy used by the decision maker. if for each state i, the policy uses the same decision, say ki , at each occurrence of i, then that policy corresponds to a homogeneous Markov chain

it turns out to simplify matters considerably if we include a ﬁnal reward u at time m+n. this ﬁnal reward u is considered as a ﬁxed vector, to be chosen as appropriate, rather than as part of the choice of policy

the optimized strategy, as a function of the number of steps n and the ﬁnal reward u, is called an optimal dynamic policy for that u. this policy is found from the dynamic programming algorithm, which is conceptually very simple

the dynamic programming algorithm is independent of the starting time m; the parameter n, usually referred to as stage n, is the number of decisions over which the aggregate gain is being optimized. this algorithm yields the optimal dynamic policy for any fixed final reward vector u and any given number of trials

along with the calculation of v(n,u) for each n, the algorithm also yields the optimal decision at each stage (under the assumption that the optimal policy is to be used for each lower numbered stage, i.e., for each later trial of the process)

the surprising simplicity of the algorithm is due to the Markov property. that is, v_i(n,u) is the aggregate present and future reward conditional on the present state

since it is conditioned on the present state, it is independent of the past i.e., how the process arrived at state i from previous transitions and choices

example

this figure shows an example of the situation in which the decision maker can choose between two possible decisions. he cannot choose the state - all that he can choose is the matrix P (one or another) for his current state. all started in the state 1

this figure illustrates the familiar tradeoff between instant gratification (alternative 2) and long term gratification (alternative 1)

we can now go on to next stage, using the results from the previous step. continuing this computation, one finds that

so the optimum dynamic policy is alternative 2 for the last decision to be made and alternative 1 for all decisions before the last

optimal stationary policies

in the last example, we saw that there was a ﬁnal transient (at stage 1) in which decision 1 was taken, and in all other stages, decision 2 was taken. thus, the optimal dynamic policy consisted of a long-term stationary policy, followed by a transient period (for a single stage in this case) over which a diﬀerent policy was used

it turns out that this ﬁnal transient can be avoided by choosing an appropriate ﬁnal reward vector u for the dynamic programming algorithm

if one has very good intuition, one would guess that the appropriate choice of final reward u is the relative-gain vector w associated with the long-term optimal policy

Renewal processes

Def : a renewal process is an arrival process in which the interarrival intervals are positive, independent and identically distributed (IID) random variables (rv’s)

renewal processes (since they are arrival processes) can be speciﬁed in three standard ways,

arrival epochs and the counting rv’s are related in each of the following equivalent ways

the reason for calling these processes renewal processes is that the process probabilistically starts over at each arrival epoch, Sn . that is, if the nth arrival occurs at Sn=τ, then, counting from Sn=τ, the jth subsequent arrival epoch is

example: visits to a given state for a Markov chain

suppose a recurrent finite-state Markov chain with transition matrix P starts in state i at time 0. then on the first return to state i, say at time n, the Markov chain, from time n on, is a probabilistic replica of the chain starting at time 0. that is, the state at time 1 is j with probability Pij , and, given a return to i at time n, the probability of state j at time n+1 is Pij. In the same way, for any m > 0

each subsequent return to state i at a given time n starts a new probabilistic replica of the Markov chain starting in state i at time 0. thus the sequence of entry times to state i can be viewed as the arrival epochs of a renewal process

the problem is that the time of the ﬁrst entry to state i after time 0 is a rv rather than a given number n

SLLN

for each sample point ω∈Ω, N(t,ω)/t is a nonnegative number for each t and {N(t,ω);t>0} is a sample path of the counting renewal process, taken from (0,t] for each t. thus lim t→∞ N(t,ω)/t, if it exists, is the time-average renewal rate over (0,∞) for the sample point ω

the strong law for renewal processes states that this limiting time-average renewal rate exists for a set of ω that has probability 1, and that this limiting value is 1/𝕏

why the slope N(t)/t in the ﬁgure approaches 1/𝕏 as t→∞ ? as t increases, we would guess that N(t) increases without bound, i.e., that for each arrival, another arrival occurs eventually. assuming this, with increasing t the value of the left side of

задачка на разорение

у Алисы и Боба есть монеты, на которых с вероятностью 51% выпадает орел и 49% решка. у них $100 начального капитала у каждого

они начинают раз в минуту каждый ставить $1 на результат броска и бросать свою монету, причем Алиса всегда ставит на орла (который выпадает чуть чаще), а Боб всегда ставит на решку (которая выпадает чуть реже)

играют они против банка, а не против друг друга

известно, что в результате игры и Алиса и Боб разорились

внимание, вопрос! кто из них с большей вероятностью разорился первым?
что изменится, если будет только одна общая монета?

для начала, поменяем условия задачи. пусть у Алисы будет честная монета: 50%. тогда она в итоге разорится с вероятностью ≅1 и expected hitting time N^2 (10000 бросков)

представим себе, что монета Боба дает орла с вероятностью не в 51% а в 100-ε%. тогда Боб разорится за N≅100 шагов и мы вполне резонно можем предположить, что его expected hitting time – монотонная функция от ε ∈ [0,0.5]

теперь начнем менять свойства монеты Алисы. пусть она только чуть-чуть кривая: 50+ε% орлов. процент путей, на которых Алиса никогда не разоряется, должен быть очень малым, а на тех что всё же разоряется expected hitting time должно быть для малых ε близко к тому же N^2
если же ε у монеты Алисы большое, то expected hitting time должно быть – как ни парадоксально – маленьким: потому что иначе Алиса быстро разбогатеет и тогда уже никогда не разорится
как и с Бобом, expected hitting time должно быть монотонной функцией от ε ∈ [0,0.5]

"Черт побери!" - получается, что в первом приближении вероятности разориться для Алисы и Боба на одном и том же шаге - равны. осталось только это доказать

вероятность Бобу разориться вообще - больше, но мы рассматриваем не все случаи, а только те, когда еще и Алиса разорится ибо это задано в условиях

для каждого пути, на котором разоряется Алиса {ОР…} есть симметричный путь {РО…} на котором разоряется Боб и, более того, никаких других путей той же длинны, на которых Боб разоряется - нет. поэтому среднее время до разорения одинаково

вероятность разориться после 100-го броска монетки (а раньше - никак):
у Алисы a = 0.49^100 = 5.715018094850436e-30
у Боба b = 0.51^100 = 1.046183829131434e-31
а после 102-го броска
у Алисы 0.51 * 0.49^101 = a * 0.49 * 0.51
у Боба 0.49 * 0.51^101 = b * 0.49 * 0.51
получаем две убывающие геометрические прогрессии с шагом q = 0.49 * 0.51
суммы у этих прогрессий разные, но, поскольку мы знаем, что вероятность разорения у обоих реализовалась, мы должны отнормировать вероятности на сумму каждой из прогрессий, которая равна a/(1-q) для Алисы и b/(1-q) для Боба. тогда вероятность разориться после 100-го шага
у Алисы a / (a / (1 - q)) = 1 - q
у Боба b / (b / (1 - q)) = 1 - q
т.е. они равны

итак представим, что мы выписали последовательность, в которой Алиса проиграет $1. рассмотрим вероятности ВСЕХ таких исходов, в зависимости от длины “цепи выпавших монет”. заметим, что длина цепи всегда нечётна
1 решка = 0.49
1 орел + 2 решки = 0.51^1 * 0.49^2
2 орла + 3 решки = 0.51^2 * 0.49^3
...

рассмотрим теперь вероятность всех исходов, в которых Боб проигрывает $1, по длинам максимальной по включению последовательности:
1 орел = 0.51
2 орла + 1 решка = 0.51^2 * 0.49^1
3 орла + 2 решки = 0.51^3 * 0.49^2
...
эти последовательности одинаково быстро убывают. это значит, что

выписываем условное распределение вероятности разориться для Алисы, т.е. p(100), p(101), p(102) и.т.д., где p(t) есть "(условная) вероятность разорения Алисы за t шагов", при условии, что разорение произошло. [опустим мелкие очевидные детали, например то, что в нечетные моменты времени разорения точно не будет]. то же самое выписываем для Боба. с удивлением замечаем, что распределения совпадают. значит Алиса и Боб совершенно симметричны, как ни странно, поэтому с вероятностью 50% первой разорится Алиса, и с вероятностью 50% первым разорится Боб. ну или, если буквоедствовать, то с вероятностью w первой разорится Алиса, с вероятностью w первым разорится Боб, и с вероятностью ε они разорятся одновременно. w и ε можно, при желании, посчитать:

будем выигрыш записывать как “+”, а проигрыш как “-”. результат игры будет последовательность плюсов и минусов. очевидно, и у Алисы, и у Боба одинаковые способы разориться. только вероятности каждого способа разные: у Алисы плюс выпадает с вероятностью p=51, а минус с вероятностью q=49, а у Боба наоборот. но интересно, что для каждого способа разориться отношение вероятности этого способа у Алисы и у Боба будет всегда одинаково!

действительно, вариант, у которого w плюсов и l минусов,
у Алисы может случиться с вероятностью p^w * q^l
а у Боба с вероятностью q^w * p^l
любое разорение содержит ровно на 100 минусов больше, чем плюсов, т.е.
l - w = 100
значит отношение Бобовой вероятности к Алисиной будет
(q^w * p^l) / (p^w * q^l) = (p/q)^(l-w) = (p/q)^100

условие, что “и Боб и Алиса разорились”, говорит, что мы должны нормировать эти распределения, чтобы суммарная вероятность разориться в конечный момент времени была равна 1 (на самом деле нужно нормировать только распределение Алисы, - у Боба вероятность разориться и так равна 1). ну а если мы нормируем два пропорциональных распределения, то получим одно и то же. значит условные вероятности разориться в любую данную минуту (при условии, что разорение произойдет) у Алисы и Боба будут абсолютно одинаковы. а значит и вероятности разориться первым у обоих одинакова

P.S. кстати мы "за бесплатно" получили вероятность "не разориться" для Алисы: она равна (q/p)^100. и это - самый простой способ получить такую вероятность

при правильном подходе и решение и ответ совершенно естественны. ключевое понятие тут - так называемые условные марковские цепи. дело в том, что марковское свойство может быть сформулировано как условная независимость будущего и прошлого относительно настоящего. поэтому при наложении некоторого условия на поведение цепи в удаленном будущем она остается марковской, хотя и с другими переходными вероятностями

в начале задачи речь идет о несимметричном "случайном блуждании" rw на множестве неотрицательных целых чисел с вероятностями перехода p=0.51 из n в n+1 и q=0.49 из n в n-1 для n>0 и с поглощением (разорением) в нуле. в силу наличия сноса, направленного от точки поглощения, вероятность поглощения f(n) при выходе из точки n строго меньше 1. функция f удовлетворяет гармоническому уравнению f(n) = p * f(n+1) + q * f(n-1) с граничным условием f(0)=1, откуда следует, что f(n)=(q/p)^n. после этого легко увидеть, что переходные вероятности для условного блуждания при условии поглощения составят в точности p для перехода из n в n-1 и q для перехода из n в n+1, т.е., условные блуждания для Алисы и для Боба (который разоряется с вероятностью 1, т.е., для него условное блуждание при условии разорения совпадает с безусловным) попросту имеют одни и те же переходные вероятности. в силу этой симметрии Алиса и Боб разорятся первыми с одной и той же вероятностью

во второй половине задачи у нас есть "случайное блуждание" на всей целочисленной прямой с вероятностями перехода p=0.51 из n в n+1 и q=0.49 из n в n-1, выпущенное из точки N=100. условие заключается в том, что блуждание посетит точки 0 (разорение Алисы, обозначим это событие A) и 2*N (разорение Боба, событие B). событие B имеет полную вероятность, поэтому надо позаботиться только о событии A
как объяснено в решении основной задачи, соответствующее условное блуждание имеет обращенные переходные вероятности до достижения точки 0, которые сменяются на исходные переходные вероятности после этого. тем самым вероятность того, что A наступит раньше чем B (при условии, что A произойдет) равна вероятности того, что обращенное блуждание поглотится до посещения точки 2*N. эту вероятность легко явно вычислить, решив то же гармоническое уравнение для вероятностей поглощения, что и в основной задаче. она составляет (p/q)^N/(1+(p/q)^N), что больше 1/2 при всех значениях p>q=1-p и N (напомню, что в нашей ситуации p =0.51 и N=100). т.о., при условии, что Алиса разорится вообще, она с большей вероятностью разорится первой