7 minute read

Code Structure of Combining Classifiers via Majority Vote



Combining Classifiers via Majority Vote

Implementing a Simple Majority Vote Classifier

The algorithm we plan to implement will enable us to merge various classification algorithms with specific weights to boost confidence. We aim to create a more robust meta-classifier that offsets the individual classifiers’ shortcomings on a particular dataset. To put it more precisely in mathematical terms, we can express the weighted majority vote in the following manner.

$\hat{y} = \text{arg} \ \text{max}_i \sum^m_{j=1} W_j \chi_A \big( c_j(x)=i \big) $


In the given formula, $W_j$ represents the weight connected with a base classifier, $C_j$, while $\hat{y}$ stands for the predicted class label of the ensemble. A denotes the set of unique class labels, and $\chi_A$ signifies the characteristic function or indicator function, which yields 1 if the predicted class of the $j$th classifier matches $i$ ($C_j(x)=i$. To simplify the equation for equal weights, we can express it as follows.

$\hat{y} = \text{mode} \big\{ C_1(x), C_2(x), \dots, C_m(x) \big\}$


Let’s assume we have three base classifiers to predict the class label of a given example. Two of three base classifiers predict class 0, and one predicts class 1. When we weigh the predictions of each base classifier equally, the majority vote predicts that the example belongs to class 0.

$C_1(x) \rightarrow 0,\ C_2(x) \rightarrow 0, \ C_3(x) \rightarrow 1$

$\hat{y} = \text{mode} \{ 0,0,1 \} = 0$


Let’s assign a weight of 0.6 to $C_3$ and let’s weight $C_1$ and $C_2$ by a coefficient of 0.2:

$\hat{y} = \text{arg} \ \text{max}_i \sum^m_{j=1} W_j \chi_A \big( c_j(x)=i \big)$

$= \text{arg} \ \text{max}_i[0.2 \times i_0 + 0.2 \times i_0 + 0.6 \times i_1] = 1$


More simply, since $3 \times 0.2 = 0.6$, we can say that the prediction made by $C_3$ has three times more weight than the predictions by $C_1$ or $C_2$. This can be expressed as follows.

$\hat{y} = \text{mode} \{ 0,0,1,1,1 \} = 1$


We can use NumPy’s argmax and bincount functions as following code for easier way to calculate.

>>> import numpy as np
>>> np.argmax(np.bincount([0,0,1], weights[0.2, 0.2, 0.6]))
1



If we would like to determine this with the probability of the predicted class, we can derive equations as follows.

$\hat{y} = \text{arg} \ \text{max}_i \sum^m_{j=1} w_j p_{ij}$


Let’s assume that we have a binary classification problem with class labels $i \in {0,1 }$ and an ensemble of three classifiers, $C_j (j \in { 1,2,3})$. Let’s assume that the classifiers $C_j$ return the following class membership probabilities for a particular example, $x$:

$$ C_1(x) \rightarrow [0.9, 0.1],\ C_2(x) \rightarrow [0.8, 0.2], \ C_3(x) \rightarrow [0.4, 0.6] $$


Using the same weights, the probability of the individual class as follows.

$p(i_0 \vert x) = 0.2 \times 0.9 + 0.2 \times 0.8 + 0.6 \times 0.4 = 0.58$

$p(i_1 \vert x) = 0.2 \times 0.1 + 0.2 \times 0.2 + 0.6 \times 0.6 = 0.42$

$\hat{y} = \text{arg} \ \text{max}_i [p(i_0 \vert x), p(i_1 \vert x)] = 0$




Putting everything together, we will confirm the procedures with MajorityVoteClassifier in Python.

Overview

The MajorityVoteClassifier is an ensemble classifier that aggregates predictions from multiple classifiers. It supports two voting strategies:

  • Hard Voting ('classlabel'): Each classifier votes for a class label, and the final prediction is the class with the majority of votes.
  • Soft Voting ('probability'): The class probabilities from each classifier are averaged, and the final prediction is the class with the highest average probability.


from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator

class MajorityVoteClassifier(BaseEstimator, ClassifierMixin):
  """ A majority vote ensemble classifier
  
  Parameters
  ----------
  classifiers: array-like, shape = [n_classifiers]
  	Different classifiers for the ensemble
  	
  vote: str, {'classlabel', 'probability'}
  	Default: 'classlabel'
  	If 'classlabel' the prediction is based on the argmax of class labels.
  	Else if 'probability', the argmax of the sum of probabilities is used to predict
  	the class label (recommend for calibrated classifiers)
  
  weights: array-like, shape = [n_classifiers]
  	Optional, default: None
  	If a list of `int` or `float` values are provided, the classifiers are weighted by importance; uses uniform weights if `weights=None.`
  	
  """
  
  def __init__(self, classifiers, vote='classlabel', weights=None):
    self.classifiers = classifiers
    self.named_classifiers = {key: value for key, value in _name_estimators(classifiers)}
    self.vote = vote
    self.weights = weights
    
    
  def fit(self, X, y):
    """ Fit classifiers.
    
    Parameters
    ----------
    X: {array-like, sparse matrix}, shape = [n_examples, n_features]
    	Matrix of training examples.
    	
    y: array-like, shape = [n_examples]
    	Vector of target class labels.
    	
    
    Returns
    -------
    self: object
    
    """
    
    if self.vote not in ('probability', 'classlabel'):
      raise ValueError("vote must be 'probability'" "or 'classlabel'; got (vote=%r)" 
                      % self.vote)
    if self.weights and len(self.weights) != len(self.classifiers):
      raise ValueError("Number of classifiers and weights" 
                       "must be equal; got %d weights," "%d classifiers" 
                       % (len(self.weights), len(self.classifiers)))
     
    # Use LabelEncoder to ensure class labels start
    # with 0, which is important for np.argmax
    # call in self.predict
    
    self.lablenc_ = LabelEncoder()
    self.lablenc_.fit(y)
    self.calsses_ = self.lablenc_.classes_
    self.classifiers_ = []
for clf in self.classifiers:
    fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
    self.classifiers_.append(fitted_clf)
return self

Explanations

1. Package Explanations
  • BaseEstimator: Provides base methods like get_params and set_params for parameter tuning.
  • ClassifierMixin: Mixin class that adds a score method to classifiers.
  • LabelEncoder: Encodes target labels with values between 0 and n_classes - 1.
  • clone: Creates a deep copy of the estimator with the same parameters.
  • _name_estimators: Utility function to generate names for estimators in a pipeline.


2. Class MajorityVoteClassifier Parameters
  • vote: Voting strategy—either 'classlabel' for hard voting or 'probability' for soft voting.
  • weights: Weights for each classifier; higher weights increase a classifier’s influence.



We utilized the BaseEstimator and ClassifierMixin parent classes to access basic functionality, which includes the get_params and set_params methods for setting and retrieving the classifier's parameters, as well as the score method for calculating prediction accuracy.

Furthermore, we will incorporate the predict method to forecast the class label using a majority vote based on the class labels when initializing a new `MajorityVoteClassifier` object with `vote='classlabel'`. Alternatively, when initializing the ensemble classifier with vote='probability', it will predict the class label based on the class membership probabilities. Additionally, we will include a predict_proba method to provide the average probabilities, which helps compute the area under the curve (ROC AUC) when evaluating the received operating characteristic.

	def predict(self, X):
    """ Predict class labels for X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix},
    	Shape = [n_examples, n_features]
    	Matrix of training examples.
    	
    Returns
    -------
    maj_vote : array-like, shape = [n_examples]
    	Predicted class labels. 
    	
    """
    
    if self.vote == 'probability':
     		maj_vote = np.armax(self.predict_proba(X), axis=1)
        
      else: #'classlabel' vote
        
          # Collect results from clf.predict calls
          predictions = np.asarray([clf.predict(X) for clf in self.classifiers_].T
 					maj_vote = np.apply_along_axis
                                   (lambda x: np.argmax(np.bin-count(x, weights=self.weights)), 
                                    axis=1, arr=predictions)
					maj_vote = self.lablenc_.inverse_transform(maj_vote)
			return mat_vote
                                   
	def predict_proba(self, X):
			""" Predict class probabilities for X.
			
			Parameters
			----------
			X : {array-like, sparse matrix},
				shape = [n_examples, n_features]
				Training vectors, where
				n_examples is the number of examples and
				n_features is the number of features.
				
				
			Returns
			----------
			avg_proba : array-like,
				shape = [n_examples, n_classes]
				Weighted average probability for each class per example.
			
			
			"""
			probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_])
			avg_proba = np.average(probas, axis=0, weights=self.weights)
			return avg_proba
                                   

	def get_params(self, deep=True):
			""" Get classifier parameter names for Grid-Search"""
			if not deep:
					return super(MajorityVoteClassifier, self).get_params(deep=False)
			else:
           out = self.named_classifiers.copy()
					 for name, step in self.named_classifiers.iterms():
							for key, value in step.get_params(deep=True).items():
									out['%s__%s' % (name, key)] = value
					return out

Explanations

1. predict Method

The method handles two voting mechanisms: soft voting (based on probabilities) and hard voting (based on class labels):

  • Soft Voting (self.vote == 'probability'):
    • Calls predict_proba() to get the class probabilities for each classifier.
    • Uses np.argmax to predict the class with the highest average probability for each example.
  • Hard Voting ('classlabel'):
    • Collects the predictions from each classifier using clf.predict().
    • Applies np.bincount() to count the votes for each class.
    • If weights are provided, each classifier’s vote is weighted according to its importance.
    • Uses np.argmax to determine the class with the most votes.
    • Finally, LabelEncoder is used to inverse-transform the predicted class labels back to their original format.
  • Returns: maj_vote: Array of predicted class labels, one for each training example


2. predict_proba method
  • Collect Class Probabilities: The method calls clf.predict_proba() for each classifier to collect the predicted probabilities for all classes.
  • Compute Weighted Average: The np.average() function is used to compute the weighted average of the predicted probabilities across classifiers.
  • The result is a 2D array of probabilities, with one row for each example and one column for each class.


3. get_params method
  • The get_params method is necessary for enabling GridSearchCV and similar hyperparameter optimization tools to access the parameters of each classifier within the ensemble.
  • Returns the dictionary of parameters, which is compatible with scikit-learn’s hyperparameter search utilities (like GridSearchCV).


It’s important to remember that we created a modified version of the get_params method to access the parameters of individual classifiers in the ensemble using the _name_estimators function. This might seem complicated initially, but it will make perfect sense when we implement a grid search for hyperparameter tuning in later sections.




Side Note

<Class Membership Probabilities from Decision Trees>

In scikit-learn, the ROC AUC score is computed using the predict_proba method, if applicable. With decision trees, probabilities are calculated using a frequency vector created for each node during training. This vector gathers frequency values for each class label based on the distribution at that node, which is then normalized to add up to 1. Similarly, in the k-nearest neighbors algorithm, the class labels of the nearest neighbors are combined to provide normalized class label frequencies. While the normalized probabilities from decision trees and k-nearest neighbors may resemble those obtained from a logistic regression model, it's important to note that they are not derived from a probability mass function.




Leave a comment