Scikit-learn, often abbreviated as sklearn, is an open-source machine learning library for Python. It is built on top of other Python libraries such as NumPy, SciPy, and matplotlib, making it a versatile and easy-to-use tool for machine learning tasks. Scikit-learn provides a wide range of machine learning algorithms and tools that facilitate tasks such as classification, regression, clustering, dimensionality reduction, and more.
Data Preprocessing #
Level 1: Data Preprocessing
train_test_split
: Splits a dataset into training and testing sets.from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
StandardScaler
: Standardizes features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
LabelEncoder
: Encodes categorical labels into numerical values.from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
Regression #
Level 2: Regression
LinearRegression
: Fits a linear regression model.from sklearn.linear_model
import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
RandomForestRegressor
: Builds a random forest regression model.from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100)
rf_regressor.fit(X_train, y_train)
cross_val_score
: Performs k-fold cross-validation for regression models.from sklearn.model_selection import cross_val_score
scores = cross_val_score(regressor, X, y, cv=5)
Classification #
Level 2: Classification
LogisticRegression
: Fits a logistic regression model for binary classification.from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression() classifier.fit(X_train, y_train)
SVM (Support Vector Machine)
: Trains a support vector machine classifier.from sklearn.svm import SVC svm_classifier = SVC(kernel='linear') svm_classifier.fit(X_train, y_train)
GridSearchCV
: Performs grid search for hyperparameter tuning.from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1, 10, 100], 'gamma': [0.1, 1, 10]}
grid_search = GridSearchCV(SVC(), param_grid, cv=3)
grid_search.fit(X_train, y_train)
Model Evaluation #
Level 3: Model Evaluation
accuracy_score
: Computes accuracy for classification models.from sklearn.metrics import accuracy_score
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
mean_squared_error
: Calculates mean squared error for regression models.from sklearn.metrics import mean_squared_error
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
classification_report
: Generates a detailed classification report.from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
Advanced Techniques #
Level 4: Advanced Techniques
Ensemble Methods
: Combines multiple models (e.g., RandomForestRegressor, GradientBoostingRegressor) into an ensemble for regression.from sklearn.ensemble import GradientBoostingRegressor
ensemble = GradientBoostingRegressor(n_estimators=100)
ensemble.fit(X_train, y_train)
Neural Network
: Utilizes scikit-learn’sMLPClassifier
for classification andMLPRegressor
for regression.from sklearn.neural_network import MLPClassifier, MLPRegressor
classifier = MLPClassifier(hidden_layer_sizes=(100, 50))
classifier.fit(X_train, y_train)
Pipeline
: Creates a data processing and modeling pipeline.from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', RandomForestClassifier())]) pipe.fit(X_train, y_train)
These definitions and examples cover a range of functions in scikit-learn for data preprocessing, regression, classification, model evaluation, and more. They are divided into different levels of complexity to provide a structured learning path for users.