Modelling Mixtures: the Dirichlet distribution
==================================================
A measure on discrete measures.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You are given a set :math:`\mathcal{X}` (here taken as finite) and a probability density :math:`p(x), (p(x)\geq 0, \sum p(x)=1)`\ . Also given is a set
:math:`A` in :math:`\mathcal{X}`\ . The problem is to compute or approximate
:math:`p(A)`\ . We consider the Birthday Problem from a
Bayesian standpoint.
In order to go further we need to extend what we did before for the
binomial and its Conjugate Prior to the multinomial and the the
Dirichlet Prior. This is a probability distribution on the :math:`n`
simplex
.. math::
\Delta_n=\{{\tilde{p} }=(p_1,\cdots,p_n),\; p_1+\cdots
+p_n=1,\; p_i\geq 0\;\}
It is a :math:`n`\ -dimensional version of the beta density. Wilks
(1962) is a standard reference for Dirichlet c omputations.
The Dirichlet has a parameter vector:
:math:`\tilde{a} =(a_1,\ldots,a_n)`\ . Throughout we write
:math:`A=a_1+\cdots+a_n`\ .
With respect to Lebesgue measure on :math:`\Delta_n` normalised to have
total mass 1 the Dirichlet has density:
.. math::
D_{\tilde{a}}(\tilde{x})=\frac{\Gamma(A)}{\prod \Gamma (a_i)} x_1^{a_1-1}
x_2^{a_2-1} \cdots
x_n^{a_n-1}
The uniform distribution on :math:`\Delta_n` results from choosing all
:math:`a_i=1`\ . The multinomial distribution corresponding to :math:`k`
balls dropped into :math:`n` boxes with fixed probability
:math:`(p_1,\cdots,p_n)` (with the ith box containing :math:`k_i` balls)
is
.. math:: {k \choose {k_1 \ldots k_n}} p_1^{k_1} \cdots p_n^{k_n}
If this is averaged with respect to :math:`D_{\tilde{a}}` one gets the
marginal (or Dirichlet/ Multinomial):
.. math::
P (k_1,\ldots,k_n) = P(\tilde{k})=
\frac{(a_1)_{(k_1)} (a_2)_{(k_2)} \ldots (a_n)_{(k_n)}} {A_{(k)}}
\mbox{ where }\;\; m_{(j)} \stackrel{\mbox{def}}{=} m(m+1)\cdots (m+(j-1))
From a more practical point of view there are two simple procedures
worth recalling here:
- To pick :math:`\tilde{p}` from a Dirichlet prior; just pick
:math:`X_1, X_2, \ldots ,X_n` independant from gamma densities
.. math::
\frac{e^{-x}x^{a_i-1}}{\Gamma(a_i)}
\mbox{ and set } p_i=\frac{X_i}{X_1+\cdots X_n}, 1\leq i\leq N
- To generate sequential samples from the marginal distribution use
**Polya’s Urn**:
Consider an urn containing :math:`a_i` balls of color :math:`i`
(actually fractions are allowed).
Each time, choose a color :math:`i` with probability proportional to
the number of balls of that color in the urn. If :math:`i` is drawn,
replace it along with another ball of the same color.
The Dirichlet is a convenient prior because the posterior for
:math:`\tilde{p}` having observed :math:`(k_1,\cdots,k_n)` is Dirichlet
with probability :math:`(a_1+k_1,\cdots,a_n+k_n)`\ . Zabell (1982) gives
a nice account of W.E. Johnson’s characterization of the Dirichlet: it
is the only prior that predicts outcomes linearly in the past. One
special case is the symmetric Dirichlet when all
:math:`a_i=c >0`\ . We denote this prior as :math:`D_c`\ .
:math:`The\; Birthday\; Problem`
--------------------------------
In its simplest version the birthday problem involves :math:`k` balls
dropped uniformly at random into :math:`n` boxes. We declare a
:math:`match` if two or more balls drop into the same box. Elementary
considerations Feller (1968, page 33) show that:
.. math::
P(match)=
1-P(no\; matches)
=1-\prod^{k-1}_{i=1}(1-\frac{i}{n})
\doteq
1-e^{-\frac{{{k}\choose{2}}}{n}}
To be more precise:
Proposition: If :math:`n` and :math:`k` are large in such a way that :
:math:`\frac{{{k}\choose{2}}}{n} \longrightarrow \lambda`
then in the classical birthday problem :
.. math:: P(Match) \cong 1-e^{-\lambda}
Proof :
.. math::
\mbox{ Write } \qquad \prod^{k-1}_{i=1}(1-\frac{i}{n})=exp({\sum_{i=1}^{k-1}}log
(1-\frac{i}{n}))
Expand the :math:`\log` using :math:`log (1-x)=-x+O(x^2)` to see that
the exponent is
.. math:: -\frac{{{k}\choose{2}}}{n} +O(\frac{k^3}{n^2})
Setting the right side equal to :math:`\frac{1}{2}` and solving for
:math:`k` shows that if :math:`k\doteq 1.2 \sqrt{n}` ,
:math:`P(match) \doteq \frac{1}{2}`\ . When :math:`n=365`\ ,
:math:`k=1.2\sqrt{n}\doteq 22.9`\ . This gives the usual statement : it
is about even odds that there are two people with the same birthday in a
group of 23.
In contrast with the clean formulation above, consider
an introductory probability class with 25
students. When considering ‘doing’ the birthday problem, several things
come to mind :
- It is slightly better than a 50-50 chance of success with 25
students.
- If it fails it’s quite a disaster, 10 minutes of class time with a
big build-up, devoted to a failure.
- If it succeeds, the students are usually captivated and they are
interested in learning how computations can dissipate the *‘intuitive
fog’* surrounding probability.
- The students are practically all from the same year, it is quite well
known that birthdays within a year are not uniformly distributed; far
more births occur on weekdays than on weekends (Doctors don’t like to
work on weekends, and the induced births and c-sections are never
scheduled for weekends). There are also seasonal trends (more births
in the summer) and lunar trends.
Taking these things into consideration, you realize that you have no
idea what the ‘true underlying probabilities’ are! The considerations
above make a match more likely. It seems sensible to carry through a
Bayesian version of the birthday problem.
The problem becomes : drop :math:`k` balls into :math:`n` boxes where
the chance of a ball falling in box :math:`i` is :math:`p_i`\ . Here
:math:`\tilde{p}=(p_1,p_2,\ldots,p_n)` has some prior distribution on the :math:`n`\ -simplex
:math:`\Delta_n`\ . We will work with a Dirichlet prior :math:`D_{\tilde{a}}`\ ,
with :math:`\tilde{a}=(a_1,a_2,\ldots,a_n)`\ .
This has density proportional to
:math:`x_1^{a_1-1} x_2^{a_2-1} \cdots x_n^{a_n-1}`\ .
For :math:`a_i=1` we get the classical uniform
prior on the simplex.
For :math:`a_i\equiv c` we get the symmetric
Dirichlet prior.
The Dirichlet prior is the n-dimensional version of the 2-dimensional
beta prior we have already studied.
This interpolates between the uniform prior(c=1) and the classical case
:math:`p_i=\frac{1}{365}` (:math:`c\longrightarrow \infty`\ ). For more
general choices of :math:`a_i` we get a flexible family of priors.
Let us we carry out necessary computations in the following cases; in
each we give :math:`k` required for a :math:`50-50` chance of a match
when n=365:
- Uniform Prior, c=1 :math:`k\doteq .83 \sqrt{n}`\ , for :math:`n=365`\ ,
:math:`k\doteq 16`
- Symmetric Prior, :math:`a_i=c`
.. math::
\begin{array}{l|ccccccc}
c & .5 & 1 & 2 & 5 & 20 & \infty\\
\hline
k_c & 13.2 & 16.2 & 18.7 & 20.9 & 21.9 & 22.9\\
\end{array}
In coin tossing the standard prior is the uniform on :math:`[0,1]`\ .
The standard prior on the :math:`n-`\ dimensional simplex
:math:`\Delta_n` is the uniform distribution :math:`U` where all
vectors :math:`\tilde{p}` have same probability.
The probability of a match, averaged over :math:`(p_1,p_2,\ldots,p_n)`\ ,
represents the chance of success to a Bayesian statistician who has chosen the
uniform prior. As is well known, (Bayes (17???), Good (1979),
Diaconis and Efron (1987)) such a uniform mixture of multinomials
results in Bose-Einstein allocation of balls in boxes, each
configuration, or composition :math:`(k_1,k_2,\ldots,k_n)` being
equally likely with chance :math:`\displaystyle{
\frac{1}{{{k+n-1}\choose {k}}}}`\ . For this simple prior, it is
again possible to do an exact calculation:
Under a uniform prior on :math:`\Delta_n`
.. math::
\label{eq2.1}
P_u(match)=1-\prod_{i=1}^{k-1}(\frac{1-\frac{i}{n}}{1+\frac{i}{n}})
If :math:`n` and :math:`k` are large in such a way that
:math:`\frac{k^2}{n}\longrightarrow \lambda`\ , then
.. math::
\label{eq2.2}
P(\mbox{match})\cong 1-e^{-\lambda}
**Proof**
Represent the uniform mixture of multinomials using Polya’s urn as
described in the introduction. The chance that the first :math:`k`
balls fall in different boxes is
.. math::
\frac{n-1}{n+1}
\times \frac{n-2}{n+2} \cdots \times \frac{n-(k-1)}{n+(k+1)}
This gives ([eq2.1]) and ([eq2.2]) follows by the same calculus
approximations detailed in the proof of Proposition [prop1].
Thus in order to obtain a 50-50 chance of a match under a uniform
prior :math:`k` must be :math:`.83\sqrt{n}`\ . When :math:`n=365`\ ,
this becomes :math:`k=16`\ , and for :math:`k=23`\ ,
:math:`P_u(match)\doteq .75`\ .
The uniform prior allows some mass far from
:math:`(\frac{1}{n},\frac{1}{n},
\ldots,
\frac{1}{n})` and such “lumpy” configurations make a match quite
likely.
The uniform prior studied above is a special case of a symmetric
Dirichlet prior :math:`D_c` on :math:`\Delta_n`\ , with :math:`c=1`\ .
We next extend the calculations above to a general :math:`c`\ . For
:math:`c` increasing to infinity, the prior converges to point mass
:math:`\delta_{(\frac{1}{n},\frac{1}{n},\ldots \frac{1}{n})}`
and thus gives the classical answer. When :math:`c`
converges to :math:`0`\ , :math:`D_c` becomes an improper prior
giving infinite mass to the corners of the simplex, thus for small
:math:`c`\ , the following proposition shows that matches are judged
likely when :math:`k=2`\ .
Under a symmetric Dirichlet prior :math:`D_c` on :math:`\Delta_n`\ ,
.. math:: P_c(\mbox{match})=1-\prod^{k-1}_{i=1} \frac{(n-i)c}{nc+i} .
For a proof see the paper `here `_
In order for the probability of a match to be about
:math:`\frac{1}{2}` ; :math:`k_c\doteq 1.2\sqrt{\frac{nc}{c+1}}` is
needed. The following table shows how :math:`k_c` depends on
:math:`c` when :math:`n=365`\ :
.. math::
\begin{array}{l|ccccccc}
c & .5 & 1 & 2 & 5 & 20 & \infty\\
\hline
k_c & 13.2 & 16.2 & 18.7 & 20.9 & 21.9 & 22.9\\
\end{array}
- Honest Priors Construct a :math:`2` “hyper”parameter family of
Dirichlet priors writing :math:`a_i= A \pi_i`\ , with
:math:`\pi_1+\pi_2\cdots+ \pi_n=1`\ . Assign weekdays parameter
:math:`\pi_i=a`\ , weekends :math:`\pi_i=\gamma a`\ , with
:math:`260 a +104 \gamma a=1`\ . Here :math:`\gamma` is the parameter
‘ratio of weekends to weekdays’, (roughly we said :math:`\gamma
\doteq .7`\ ) and :math:`A` measures the strength of prior
conviction. The table below shows how :math:`k` varies as a function
of :math:`A` and :math:`\gamma`\ . We have assumed the year has
:math:`7\times 52=364` days.
.. math::
\begin{array}{l|ccc}
A\;\;\; {\gamma} & .5 & .7 & 1 \\
\hline
1 & 2.2 & 2.2 & 2.2\\
364 & 16.1& 16.3 & 16.4\\
728 & 18.4& 18.6& 18.8\\
\infty & 22.2& 22.4& 22.6
\end{array}
:math:`Remarks`
#. Under the uniform,symmetric Dirichlet priors for small :math:`c`\ ,and
‘honest priors’ for moderate :math:`A` the prior concentrates on
fairly ‘lumpy’ vectors :math:`\tilde{p}` which will have matches with small
:math:`k`\ .
#. The calculations show that some uncertainty about :math:`p` leads to
a much improved chance of a match. Returning to the :math:`25`
students in the probability class, the Dirichlet prior with
:math:`A=365` and :math:`\gamma=.7` is a first approximation to the
prior of the present authors. It leads to a chance of a match with
:math:`25` students of 0.81. This may be the reason the birthday
problem works as often as it does!