1. Binary Variable

Likelihood: $ p(D | \mu) = \prod_{n=1}^{N}\mu ^ {x_n} (1-\mu)^{1 - x_n}$

Beta distribution(Prior): $ Beta(\mu | a, b) = \frac{\Gamma(a + b)}{\Gamma a \Gamma b}\mu^{a-1}(1 - \mu)^{b-1} $ * Here the term with $\Gamma$ aims to do normalization

Posterior: $ p(\mu

m, l, a, b) = \frac{\Gamma(m + a + l + b)}{\Gamma (m + a) \Gamma (l + b)}\mu^{m + a - 1}(1 - \mu)^{l + b - 1} $

The reason choosing beta distribution is that it’ll have the same functional form(both in the form of $\mu^x (1 - \mu)^{1-x}$) as posterior, which is a good property called conjugacy.

2. Multinomial Variables

Likelihood: $Mult(m_1, m_2, …, m_K|\mu, N) = \begin{pmatrix} N
m_1m_2…m_K \end{pmatrix} \prod_{k=1}^{K}\mu_k^{m_k} $

Dirichlet distribution(Prior): $ Dir(\mu | \alpha ) = \frac{\Gamma(\alpha_0)}{\Gamma (\alpha_1)… \Gamma(\alpha_K)} \prod_{k=1}{K}\mu_k^{\alpha_k-1} $ * Here the term with $\Gamma$ aims to do normalization

Posterior: $ p(\mu

D, \alpha) = \frac{\Gamma(\alpha_0 + N)}{\Gamma (\alpha_1 + m_1)… \Gamma(\alpha_K + m_K)} \prod_{k=1}^{K}\mu_k^{\alpha_k + m_K-1} $

The reason choosing beta distribution is that it’ll have the same functional form as posterior, which is a good property called conjugacy.