Machine_Learning_Scikit-learn
Estimated reading: 2 minutes
87 views
Scikit-learn
Scikit-learn (also known as sklearn) is a machine-learning library for the Python programming language. It features various classification, regression, and clustering algorithms, pre-processing tools, and utilities for evaluating models. It is built on top of other popular Python libraries, such as NumPy and matplotlib, and is designed to integrate seamlessly with the broader scientific Python ecosystem.
Link: https://scikit-learn.org/stable/
Scikit-learn in data preprocessing
Scikit-learn offers a number of tools and methods for data pre-processing, such as:
- Missing value imputation: The SimpleImputer class offers fundamental methods for imputing missing values, including utilising the mean, median, or most prevalent value of the feature.
Missing values
Simple imputer
# Simple imputer
from sklearn.impute import SimpleImputer
import numpy as np
# Create a sample dataset with missing values
X = np.array([[1, 2, np.nan], [3, np.nan, 4], [np.nan, 6, np.nan], [8, 8, 8]])
print(X)
output
[[ 1. 2. nan]
[ 3. nan 4.]
[nan 6. nan]
[ 8. 8. 8.]]
# Create an instance of the SimpleImputer class
imputer = SimpleImputer(strategy='mean') # we can replace mean with median too
# Fit the imputer to the dataset
imputer.fit(X)
# Transform the dataset, replacing missing values with the mean
X_imputed = imputer.transform(X)
print(X_imputed)
[
[1. 2. 6. ]
[3. 5.33333333 4. ]
[4. 6. 6. ]
[8. 8. 8. ]
]