Ensembles of classifiers
Recently in the area of machine learning the concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual classifiers. These classifiers could be based on a variety of classification methodologies, and could achieve different rate of correctly classified individuals. The goal of classification result integration algorithms is to generate more certain, precise and accurate system results. Dietterich (2001) provides an accessible and informal reasoning, from statistical, computational and representational viewpoints, of why ensembles can improve results.
Methods
Numerous methods have been suggested for the creation of ensemble of classifiers.
- Using different subset of training data with a single learning method
- Using different training parameters with a single training method (e.g. using different initial weights for each neural network in an ensemble)
- Using different learning methods.
Weaknesses
- Increased storage
- Increased computation
- Decreased comprehensibility
The first weakness, increased storage, is a direct consequence of the requirement that all component classifiers, instead of a single classifier, need to be stored after training. The total storage depends on the size of each component classifier itself and the size of the ensemble (number of classifiers in the ensemble). The second weakness is increased computation: to classify an input query, all component classifiers (instead of a single classifier) must be processed, and thus it requires more execution time. The last weakness is decreased comprehensibility. With involvement of multiple classifiers in decision-making, it is more difficult for users to perceive the underlying reasoning process leading to a decision.
Bagging
Bagging is a method of the first category (Breiman, 1996). If there is a training set of size t, then it is possible to draw t random instances from it with replacement (i.e. using a uniform distribution), these t instances can be learned, and this process can be repeated several times. Since the draw is with replacement, usually the instances drawn will contain some duplicates and some omissions as compared to the original training set. Each cycle through the process results in one classifier. After the construction of several classifiers, taking a vote of the predictions of each classifier performs the final prediction.
Boosting
Another method of the first category is called boosting. AdaBoost is a practical version of the boosting approach (Freund and Schapire, 1996). Boosting is similar in overall structure to bagging, except that one keeps track of the performance of the learning algorithm and forces it to concentrate its efforts on instances that have not been correctly learned. Instead of choosing the t training instances randomly using a uniform distribution, one chooses the training instances in such a manner as to favour the instances that have not been accurately learned. After several cycles, the prediction is performed by taking a weighted vote of the predictions of each classifier, with the weights being proportional to each classifier’s accuracy on its training set.
Boosting algorithms are considered stronger than bagging on noise free data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, Kotsiantis and Pintelas (2004) built an ensemble using a voting methodology of bagging and boosting ensembles that give better classification accuracy. The volume and velocity of big data streams make this even more crucial in terms of prediction accuracies and resource requirements.
Ensemble Size
While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests was used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble which having more or less than this number of classifiers would deteriorate the accuracy. It is called “the law of diminishing returns in ensemble construction.” Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.[1]
References
- ↑ R. Bonab, Hamed; Can, Fazli (2016). A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams. CIKM. USA: ACM. p. 2053.
- Breiman L. (1996): Bagging Predictors. Machine Learning, 24(3), 123–140. Kluwer Academic Publishers.
- Dietterich, T.G. (2001): Ensemble methods in machine learning. In Kittler, J., Roli, F., eds.: Multiple Classifier Systems. LNCS Vol. 1857, Springer (2001) 1–15
- Yoav Freund and Robert E. Schapire, Experiments with a New Boosting Algorithm, Proceedings: ICML’96, p. 148-156, 1996
- S. Kotsiantis, P. Pintelas, Combining Bagging and Boosting, International Journal of Computational Intelligence, Vol. 1, No. 4 (324-333), 2004.
- Josef Kittler; Robert P.W. Duin; et al. "On combining classifiers". IEEE TPAMI. IEEE. 20 (3): 226–239. doi:10.1109/34.667881. Retrieved 27 January 2015.