Computing classification means computing class posterior probabilities \(Pr(G|X)\) - that is probability that class is G given input X. If \(f_k(x)\) is the class conditional density of \(X\) in class \(Gk\) and if \(\pi_k\) is prior probability, then Bayes theorem gives,
\[\begin{equation}
Pr(G=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l}
\end{equation}\]
Depending on the models for class density, different techniques emerge.
Suppose that each class density is modelled as multivariate Gaussian. \[\begin{equation} f_k(x)=\frac{1}{(2\pi)^{p/2} |\Sigma_k|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)} \end{equation}\] Clear that \[\begin{equation} Pr(G=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l} \\ Pr(G=l|X=x) = \frac{f_l(x)\pi_l}{\sum_{p=1}^K f_p(x)\pi_p} \\ \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)} = \frac{f_k(x)\pi_k}{f_l(x)\pi_l} \\ log\left(\frac{Pr(G=k|X=x)}{Pr(G=l|X=x)} \right) = log\frac{f_k(x)}{f_l(x)}+log\frac{\pi_k(x)}{\pi_l(x)} \end{equation}\] Now expression \(\frac{f_k(x)}{f_l(x)}\) can be simplified as follows. Note \(\Sigma\) is assumed to be same for all the classes. \[\begin{equation} log\left(\frac{f_k(x)}{f_l(x)}\right) = \left[ -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)\right] \end{equation}\] And \[\begin{equation} = -\frac{1}{2} [x^T\Sigma^{-1}x-x^T\Sigma^{-1}\mu_k-\mu_k^T\Sigma^{-1}x+\mu_k^T\Sigma^{-1}\mu_k - x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k + \mu_l \Sigma^{-1} x - \mu_l^T\Sigma^{-1}\mu_l ] \end{equation}\] Next \[\begin{equation} = \frac{1}{2} [-x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k + \mu_k^T\Sigma^{-1}x -\mu_k^T\Sigma^{-1}\mu_k+x^T\Sigma^{-1}x-x^T\Sigma^{-1}\mu_l-\mu_l^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l ] \end{equation}\]
Before simplying, we should note the following:
Depending on the models for class density, different techniques emerge.
* linear and quadratic discrimnant analysis use Gaussian densities
* Mixture of Gaussians lead to non-linear decision boundaries.
* Navie Bayes models assume that each class density is product of marginal densities.
Derivations:Suppose that each class density is modelled as multivariate Gaussian. \[\begin{equation} f_k(x)=\frac{1}{(2\pi)^{p/2} |\Sigma_k|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)} \end{equation}\] Clear that \[\begin{equation} Pr(G=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l} \\ Pr(G=l|X=x) = \frac{f_l(x)\pi_l}{\sum_{p=1}^K f_p(x)\pi_p} \\ \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)} = \frac{f_k(x)\pi_k}{f_l(x)\pi_l} \\ log\left(\frac{Pr(G=k|X=x)}{Pr(G=l|X=x)} \right) = log\frac{f_k(x)}{f_l(x)}+log\frac{\pi_k(x)}{\pi_l(x)} \end{equation}\] Now expression \(\frac{f_k(x)}{f_l(x)}\) can be simplified as follows. Note \(\Sigma\) is assumed to be same for all the classes. \[\begin{equation} log\left(\frac{f_k(x)}{f_l(x)}\right) = \left[ -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)\right] \end{equation}\] And \[\begin{equation} = -\frac{1}{2} [x^T\Sigma^{-1}x-x^T\Sigma^{-1}\mu_k-\mu_k^T\Sigma^{-1}x+\mu_k^T\Sigma^{-1}\mu_k - x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k + \mu_l \Sigma^{-1} x - \mu_l^T\Sigma^{-1}\mu_l ] \end{equation}\] Next \[\begin{equation} = \frac{1}{2} [-x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k + \mu_k^T\Sigma^{-1}x -\mu_k^T\Sigma^{-1}\mu_k+x^T\Sigma^{-1}x-x^T\Sigma^{-1}\mu_l-\mu_l^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l ] \end{equation}\]
Before simplying, we should note the following:
- Since \(\Sigma\) is a covariance matrix and since covariance between \(i\) and \(j\) is same as \(j\) and \(i\), then \(\Sigma^T=\Sigma = \Sigma^{-1}\).
- The mean matrices \(\mu_k\) for class \(k\) and \(\mu_l\) for class \(l\) are \(p \times 1\) matrix.
- Terms such as \(\mu_k \Sigma^{-1}\mu_l\) have dimensions \(1\times p * p \times p * p \times 1=1\times 1\) or scalars aka reals.