Docy

Machine_Learning_Scikit-learn

Estimated reading: 2 minutes 369 views

Scikit-learn

Scikit-learn (also known as sklearn) is a machine-learning library for the Python programming language. It features various classification, regression, and clustering algorithms, pre-processing tools, and utilities for evaluating models. It is built on top of other popular Python libraries, such as NumPy and matplotlib, and is designed to integrate seamlessly with the broader scientific Python ecosystem.

Link: https://scikit-learn.org/stable/

Scikit-learn in data preprocessing

Scikit-learn offers a number of tools and methods for data pre-processing, such as:

  • Missing value imputation: The SimpleImputer class offers fundamental methods for imputing missing values, including utilising the mean, median, or most prevalent value of the feature.

Missing values

Simple imputer

				
					# Simple imputer
from sklearn.impute import SimpleImputer
import numpy as np

# Create a sample dataset with missing values
X = np.array([[1, 2, np.nan], [3, np.nan, 4], [np.nan, 6, np.nan], [8, 8, 8]])

print(X)

				
			

output

[[ 1. 2. nan]

[ 3. nan 4.]

[nan 6. nan]

[ 8. 8. 8.]]

				
					# Create an instance of the SimpleImputer class
imputer = SimpleImputer(strategy='mean') # we can replace mean with median too

# Fit the imputer to the dataset
imputer.fit(X)

# Transform the dataset, replacing missing values with the mean
X_imputed = imputer.transform(X)

print(X_imputed)
				
			

[

[1. 2. 6. ]

[3. 5.33333333 4. ]

[4. 6. 6. ]

[8. 8. 8. ]

]

Leave a Comment

Share this Doc
CONTENTS