Day26 ML Review - Perceptron (1)
Choosing a Classification Algorithm Step by Step - Training Perceptron (1)
(Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition)
Choosing the right classification algorithm for a particular problem involves practice and experience. Each algorithm has its distinct characteristics and is based on specific assumptions.
In practical terms, it’s always advisable to compare the performance of various learning algorithms to select the best model for the specific problem. The selection of algorithms may vary based on factors such as the number of features or examples, the level of noise in the dataset, and whether the classes are linearly separable or not.
Ultimately, the performance of a classifier, including computational performance and predictive power, heavily relies on the underlying data available for learning. The five main steps involved in training a supervised machine learning algorithm can be summarized as follows:
- Selecting features and collecting labeled training examples.
- Choosing a performance metric.
- Choosing a classifier and optimization algorithm.
- Evaluating the performance of the model.
- Tuning the algorithm.
Step by step of Training Perceptron
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3 random_state=1, stratify=y)
Using integer labels is recommended to avoid technical issues and improve computational performance due to a smaller memory footprint. Also, most machine-learning libraries follow the convention of encoding class labels as integers.
We will further split the dataset into training and test datasets to evaluate how well a trained model performs on unseen data.
Using the train_test_split
function from sci-kit-learn’s model_selection
module with the test_size=0.3
line, we randomly split the x
and y
arrays into 30 percent test data and 70 percent training data.
Note that the train_test_split
function already shuffles the training datasets internally before splitting; otherwise, all examples from class 0 and class 1 would have ended up in the training datasets, and the test dataset would consist of 45 examples from class ‘2’ only.
Via the random_state
parameter, we provided a fixed random seed (random_state=1) for the internal pseudo-random number generator used for shuffling the datasets before splitting. Using such a fixed random_state
ensures that our results are reproducible.
Lastly, we took advantage of stratification’s built-in support via stratify=y
. In this context, stratification means that the train_test_split
method returns training and test subsets with the same class label proportions as the input dataset. We can use NumPy’s bin_count
function, which counts the number of occurrences of each value in an array, to verify this is true.
Many machine learning and optimization algorithms also require feature scaling for optimal performance. We will standardize the feature using the StandardScaler
class from sci-kit-learn’s preprocessing
module.
Leave a comment