Naive Bayes, also known as Naive Bayes Classifiers are classifiers with the assumption that features are statistically independent of one another. Unlike many other classifiers which assume that, for a given class, there will be some correlation between features, naive Bayes explicitly models the features as conditionally independent given the class. While this may seem an overly simplistic (naive) restriction on the data, in practice naive Bayes is competitive with more sophisticated techniques and enjoys some theoretical support for its efficacy.
Because of the independence assumption, naive Bayes classifiers are highly scalable and can quickly learn to use high dimensional features with limited training data. This is useful for many real world datasets where the amount of data is small in comparison with the number of features for each individual piece of data, such as speech, text, and image data. Examples of modern applications include spam filtering, automatic medical diagnoses, medical image processing, and vocal emotion recognition.
Performing classification on input vectors of features involves assigning each vector to a class . For example, if one is seeking to classify trees using the features and , the possible classes may be , , and .
Modeling this classification function typically involves creating a probability distribution modeling the joint distribution and then using Bayes' Theorem plus the class prior probabilities to calculate the class probability conditioned on . In the example above, the joint function would be a function that models that chances of both the class label, which specifies the species of tree, and specific values for the tree features co-occurring.
For example, the chances of would be unlikely since most Douglas firs don't lose their leaves in the winter. Once this distribution is found, one can use Bayes' Theorem to calculate the possibility of the tree species given the tree features. For example, the highest class probability for would be , since oak trees are the only one of the three species that loses its leaves in the winter.
Curse of Dimensionality
Unfortunately, for large feature spaces learning takes an exponentially large amount of training data in the dimension , the number of features. This is known as the curse of dimensionality, which intuitively says that as the number of features grows, the number of "bins" in the space spanned by those features grows exponentially fast. Accurately learning a distribution requires an amount of data proportional to the number of bins (i.e. empty bins can't be learned), so learning classifiers on high dimensional data with small datasets presents a problem.
The Naive Assumption
A simplifying assumption is that the features are conditionally independent given the class . As demonstrated below, this avoids the curse-of-dimensionality by allowing the join distribution to be decomposed into factors ( features plus the class prior ), thereby reducing the number of bins from to . In the tree example, this would be akin to making the assumption that given the correct species label , whether a tree sheds its leaves in the winter doesn't affect how tall the tree is. This is probably true for most trees, since a species of tree's height is generally not affected by its shedding, but only by the species itself.
The assumption that the features are independent is considered "naive" since this is, unlike the previous tree example, generally false. For example, in the classification of coins, given a sample coin's weight and diameter , naive Bayes would assume that the weight and diameter are independent conditioned on the type of the coin. However, since the weight of a coin is roughly (where is the gravitational constant, is the volume, and is the density) is an increasing function of , the weight is very positively correlated with the diameter for every type of coin. However, even with this typically false assumption, it turns out that for a large number of features, e.g. , naive Bayes yields good performance in practice.
Given a data point of features, naive Bayes predicts the class for according to the probabiltiy
Using Bayes' Theorem, this can be factored as
Using the chain rule, the factor in the numerator can be further decomposed as
At this point, the "naive" conditional independence assumption is put into play. Specifically, naive Bayes models assume that feature is independent of feature for given the class . Using the previous decomposition, this can be formulated as
Practically speaking, the class conditional feature probabilities are usually modeled using the same class of probability distribution, such as the binomial distribution or Gaussian distribution. In the latter case, the class conditional distribution is known as a product of gaussians and has some convenient computational properties that simplify the optimization equations for parameter learning.
Naive Bayes gives the probability of a data point belonging to class as proportional to a simple product of factors (the class prior plus conditional feature probabilities ). Since classification involves assigning the class to the data point for which the value is greatest, this proportional product can be used to determine the most likely class assignment. Specifically,
Thus, the most likely class assignment for a data point can be found by calculating for and assigning the class for which this value is largest. In mathematical notation, this is defined
where is the estimated class for given its features .
Model the following dataset for males and females using a Gaussian naive Bayes classifier. Then, for a sample with , , and , predict whether that sample is male or female given the trained model.
Height (ft) Weight (lbs) Shoe Size (in) Gender 6.00 180 12 male 5.92 190 11 male 5.58 170 12 male 5.92 165 10 male 5.00 100 6 female 5.50 150 8 female 5.42 130 7 female 5.75 150 9 female
Since the classifier is a Gaussian naive Bayes model, the feature distributions conditioned on the classes ( or ) will be Gaussian distributions. Thus,
where is the Gaussian probability density function with mean and standard deviation .
Using maximum likelihood estimates of the mean and the unbiased variance gives
height mean height variance weight mean weight variance shoe size mean shoe size variance
Plugging these into the proportional class probabilities conditioned on the , , and data gives
Thus, because , the classifier predicts that the sample data came from a female.
Since determining the most likely class for a data point consists of calculating the product of factors times, the big O notation for the runtime complexity of classification is . This is very computationally efficient (see embarrassingly parallel) and gives naive Bayes its high scalability since the runtime scales linearly in the number of features and number of classes . This is especially useful in the domain of very high dimensional data, such as bag-of-words classifiers for large vocabulary corpora or high resolution image data, such as MRIs.
Naive Bayes learns conditional probability features for each feature separately, so it is very efficient for learning to classify small datasets, especially in the case that , where is the number of training samples in . This is often the case in medical imaging datasets, where a single MRI scan can have millions of features (pixels), but MRI scans are costly to obtain for a particular classification task, and thus, few and far between. As long as is sufficiently large to accurately estimate the individual factors ( when is modeled by a Gaussian distribution), the number of features can be any size.
- Bird, S. A Bayesian Network Graph. Retrieved July 1, 2015, from http://www.nltk.org/book/ch06.html
- Zhang, H. (2004). The Optimality of Naive Bayes. American Association for Artificial Intelligence, 1-6.
- Samahi, M. (1998). A Bayesian approach to filtering junk e-mail. AAAI'98 Workshop on Learning for Text Categorization, 1-8.
- Al-Aidaroos, K. (2012). Medical Data Classification with Naive Bayes Approach. Information Technology Journal, 11, 1166-1174.
- Choudhary, V. (2013). Vocal Emotion Recognition Using Naive Bayes Classifier. Procession of International Conference on Advances in Computer Science, AETACS, 378-382.