Scikit-learn (also known as sklearn) is a machine-learning library for the Python programming language. It features various classification, regression, and clustering algorithms, pre-processing tools, and utilities for evaluating models. It is built on top of other popular Python libraries, such as NumPy and matplotlib, and is designed to integrate seamlessly with the broader scientific Python ecosystem.
Scikit-learn in data preprocessing
Scikit-learn offers a number of tools and methods for data pre-processing, such as:
- Missing value imputation: The SimpleImputer class offers fundamental methods for imputing missing values, including utilising the mean, median, or most prevalent value of the feature.
# Simple imputer from sklearn.impute import SimpleImputer import numpy as np # Create a sample dataset with missing values X = np.array([[1, 2, np.nan], [3, np.nan, 4], [np.nan, 6, np.nan], [8, 8, 8]]) print(X)
[[ 1. 2. nan]
[ 3. nan 4.]
[nan 6. nan]
[ 8. 8. 8.]]
# Create an instance of the SimpleImputer class imputer = SimpleImputer(strategy='mean') # we can replace mean with median too # Fit the imputer to the dataset imputer.fit(X) # Transform the dataset, replacing missing values with the mean X_imputed = imputer.transform(X) print(X_imputed)
[1. 2. 6. ]
[3. 5.33333333 4. ]
[4. 6. 6. ]
[8. 8. 8. ]
Leave a Comment