Naive Bayes Classification#

Naive Bayes classifiers are a class of relatively simple and computationally efficient classifiers for multiclass classification tasks. They are based on Bayes’ theorem for conditional probabilities and on the (naive) assumption that features are mutually independent if interpreted as random variables.

Given training samples \((x_1, y_1),\ldots,(x_n, y_n)\) with \(m\)-dimensional feature vectors \(x_l\) and labels \(y_l\in\{1,\ldots,C\}\) we want to train a model which assigns a label \(y_0\) to non-training input \(x_0\).

Principal Approach#

The training samples can be regarded as \(n\) realizations of a pair \((X,Y)\) of random variables. Then a natural way to select a label for the unlabeled feature vector \(x_0\) is to postulate

\[\begin{equation*} P(Y=i|X=x_0)\to\max_{i\in\{1,\ldots,C\}}. \end{equation*}\]

That is, we choose \(y_0\) to be the maximizer of \(P(Y=i|X=x_0)\) with respect to \(i\).

The probabilities \(P(Y=i|X=x_0)\) are not accessible. But Bayes’ theorem states

\[\begin{equation*} P(Y=i|X=x_0)=\frac{p_{X|Y=i}(x_0)\,P(Y=i)}{p_X(x_0)} \end{equation*}\]

for all \(i\in\{1,\ldots,C\}\). Here, \(p_{X|Y=i}\) is a density of the conditional probability measure \(P(\,\cdot\,|Y=i)\) and \(p_X\) is the marginal density with respect to \(X\) of a density for \(P\). If all components of \(X\) are discrete random variables, then \(p_{X|Y=i}(x_0)\) and \(p_X(x_0)\) can be replaced by the probabilities \(P(X=x_0|Y=i)\) and \(P(X=x_0)\), respectively. If \(X\) contains continuous components, then we have to work with densities.

The denominator \(p_X(x_0)\) in Bayes’ theorem does not depend on \(i\) and, thus, does not influence maximization. Knowing \(p_{X|Y=i}(x_0)\) and \(P(Y=i)\) for all \(i\) we may compute

\[\begin{equation*} p_X(x_0)=\sum_{i=1}^C p_{X|Y=i}(x_0)\,P(Y=i), \end{equation*}\]

because the sum of all \(P(Y=i|X=x_0)\) has to be 1. The conditional density \(p_{X|Y=i}(x_0)\) and the probabily \(P(Y=i)\) can be estimated from the training samples.

Estimating Label Probabilities#

For estimating the probabilities \(P(Y=i)\) with \(i\in\{1,\ldots,C\}\) there exist two standard approaches:

If it is known from the context of the learning task, that all labels are equally likely, then

\[\begin{equation*} P(Y=i)=\frac{1}{C} \end{equation*}\]

for all \(i\in\{1,\ldots,C\}\).
If there is no additional information about label distribution and if the training samples reflect the unknown underlying distribution sufficiently accurate, then

\[\begin{equation*} P(Y=i)=\frac{\bigl|\bigl\{l\in\{1,\ldots,n\}:\,y_l=i\bigr\}\bigr|}{n} \end{equation*}\]

for all \(i\in\{1,\ldots,C\}\) is a reasonable joice.

For some learning tasks there might be additional information about label distribution available. Then a tailored estimate for \(P(Y=i)\) should be used.

Estimating Classes’ Feature Probabilities#

To estimate the densities \(p_{X|Y=i}(x_0)\) in Bayes’ theorem above we assume that all components \(X^{(1)},\ldots,X^{(m)}\) of the random variable \(X\) are mutually independent. Usually this assumption is not satisfied, but it simplifies formulas and makes computations more efficient. Thus, the classification approach discussed here is called naive.

Mutual independency yields

\[\begin{equation*} p_{X|Y=i}(x_0)=p_{X^{(1)}|Y=i}\left(x_0^{(1)}\right)\cdots p_{X^{(m)}|Y=i}\left(x_0^{(m)}\right). \end{equation*}\]

So we only have to estimate one-dimensional densities. To simplify notation we write \(p_{U|Y=i}\left(u_0\right)\) as placeholder for one of the \(m\) densities.

Gaussian Probabilities#

If \(U\) is a continuous random variable we may assume that \(p_{U|Y=i}\) is a Gaussian density with mean \(\mu_i\) and standard deviation \(\sigma_i\):

\[\begin{equation*} p_{U|Y=i}(u_0)=\frac{1}{\sigma_i\,\sqrt{2\,\pi}}\,\exp\left(-\frac{1}{2}\,\left(\frac{u_0-\mu_i}{\sigma_i}\right)^2\right) \end{equation*}\]

Parameters \(\mu_i\) and \(\sigma_i\) can be estimated from the training samples in class \(i\) with standard techniques (see statistics lecture).

Bernoulli Probabilities#

If \(U\) takes values in \(\{0,1\}\) then \(p_{U|Y=i}(u_0)\) in Bayes’ theorem has to be replaced by \(P(U=u_0|Y=i)\) with

\[\begin{equation*} P(U=0|Y=i)=1-p_i\qquad\text{and}\qquad P(U=1|Y=i)=p_i. \end{equation*}\]

The parameter \(p_i\) can be estimated from the training samples in class \(i\) by counting the samples with value 1 for the feature under consideration.

Multinomial Probabilities#

If \(X^{(1)},\ldots,X^{(m)}\) count the occurrences of \(m\) possible events for several independent trials of an experiment, then the corresponding random vector follows a multinomial distribution with parameters \(p_1,\ldots,p_m\) (probabilities of events) and \(\nu\) (number of trials):

\[\begin{equation*} P(X=x_0)=\frac{\nu!}{x_0^{(1)}!\cdots x_0^{(m)}!}\,p_1^{x_0^{(1)}}\cdots p_m^{x_0^{(m)}}, \end{equation*}\]

where \(p_1+\cdots+p_m=1\).

A typical machine learning application with multinomially distributed inputs is language processing. Each feature counts the number of occurrences of some word in a text. Based on such word counts models then can be trained to solve classification tasks.

Note that components of a multinomially distributed random vector are not independent. Thus, multinomial probabilities do not fit into the naive Bayes framework. Nonetheless multinomial naive Bayes is a fixed term, because non-independence does not prevent us from writing probabilities as products with each factor depending only on one component of the random vector. To see this we introduce probabilities \(p_{i,1},\ldots,p_{i,m}\) for each class \(i\). Then

\[\begin{equation*} P(X=x_0|Y=i)=\frac{\nu!}{x_0^{(1)}!\cdots x_0^{(m)}!}\,p_{i,1}^{x_0^{(1)}}\cdots p_{i,m}^{x_0^{(m)}} \end{equation*}\]

with \(\nu=x_0^{(1)}+\cdots+x_0^{(m)}\). The factor

\[\begin{equation*} c(x_0):=\frac{\left(x_0^{(1)}+\cdots+x_0^{(m)}\right)!}{x_0^{(1)}!\cdots x_0^{(m)}!} \end{equation*}\]

only depends on the sample \(x_0\), but neither on the probabilities \(p_{i,k}\) nor on the class \(i\). Thus, we have the typical product structure

\[\begin{equation*} P(X=x_0|Y=i)=\prod_{k=1}^m c(x_0)^{\frac{1}{m}}\,p_{i,k}^{x_0^{(k)}} \end{equation*}\]

of the naive Bayes approach, although components of \(X\) are not independent. Moreover, we do not have to compute \(c(x_0)\) explicitely. It’s a scaling factor which can be computed subsequently from the fact that all probabilities have to sum to 1 (cf. \(p_X(x_0)\) above).

The \(p_{i,k}\) can be estimated from training data by counting the occurences of event \(k\) in class \(i\). Denoting the number of occurrences by \(\nu_{i,k}\) and the number of samples in class \(i\) by \(n_i\) the estimate reads

\[\begin{equation*} p_{i,k}=\frac{\nu_{i,k}}{n_i}. \end{equation*}\]

If \(\nu_{i,k}\) is zero for some event, which might by the case for small training set size, then \(p_{i,k}\) is estimated to be zero. But if \(p_{i,k}\) is zero for some event \(k\), then due to the product structure the whole probability \(P(X=x_0|Y=i)\) will become zero. Thus, classes for which some event never occurres in the training data are never used as labels by the model.

Note

The problem with zero probabilities stems from estimation. The exact probabilities always are strictly positive, else we would count occurrences of events having zero probability. Due to estimating probabilites from incomplete data we may estimate probabilities to be zero although they aren’t.

To avoid such failure either sufficiently large training sets have to be used or event counts have to be set to at least one for each event and each class. An typical estimate for \(p_{i,k}\) which prevents problems with missing events is

\[\begin{equation*} p_{i,k}=\frac{\nu_{i,k}+1}{n_i+m}, \end{equation*}\]

known as Laplace smoothing. The total number of samples per class is increased by \(m\). The idea here is that per feature we add one training sample showing count 1 for this feature. To prevent zero counts for all features we have to add \(m\) samples.

Kernel Density Estimates#

If the underlying probability model is not know in advance, the densities \(p_{X^{(k)}|Y=i}\) can be estimated from training data using kernel density estimation.

Self-study task

Read about kernel density estimation at Wikipedia (introduction, definition, example).