When a machine learning model learns about the data, it often performs well with the training data but underperforms with unseen data or the test data. This is known as model overfitting. Model overfitting occurs when the model hugs the training data well, and underfitting occurs when the model does not perform well, even with the training data.
Cross-validation is one of the techniques that ensures that the machine learning model generalizes well to unseen data. It works as follows:
Basic Knowledge of Machine Learning β Understanding model training, evaluation metrics, and overfitting.
Python Programming Skills β Familiarity with Python and libraries like scikit-learn
, numpy
, and pandas
.
Dataset Preparation β A cleaned and preprocessed dataset ready for model training.
Scikit-Learn Installed β Install it using pip install scikit-learn
if not already available.
Understanding of Model Performance Metrics β Knowledge of accuracy, precision, recall, RMSE, etc., depending on the task.
Cross-validation helps in selecting the best model and hyperparameters while preventing overfitting.
In this guide, weβll explore:
scikit-learn
K-Fold Cross-Validation is a resampling technique used to evaluate machine learning models by splitting the dataset into K equal-sized folds. The model is trained on K-1 folds and validated on the remaining fold, repeating the process K times. The final performance score is the average of all iterations.
Aspect | K-Fold Cross-Validation | Train-Test Split |
---|---|---|
Data Utilization | Data is divided into multiple folds, ensuring that each data point has a chance to be part of both the training and validation sets across different iterations. | Divide the data into fixed portions for training and testing. |
Bias-Variance Tradeoff | It reduces variance as the model is trained multiple times on unseen data; hence, the optimal bias-variance tradeoff is achieved. | There is a chance of high variance with the plain train-test split. This often occurs because the model hugs the training data well and often fails to understand the test data. |
Overfitting Risk | Low risk of overfitting as the model gets tested across different folds. | Higher risk of overfitting if the train-test split is not representative. |
Performance Evaluation | Provides a more reliable and generalized performance estimate. | Performance depends on a single train-test split, which may be biased. |
Letβs implement K-Fold Cross-Validation using scikit-learn
.
First, we will start by importing the necessary libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model, tree, ensemble
For this demo, we will use the Titanic dataset, a very famous dataset that will help us understand how to perform k-fold cross-validation.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head(3))
print(df.info())
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
Now, it is a great practice to start with data processing and feature engineering before building any model.
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']] # Select relevant features
df.dropna(inplace=True) # Remove missing values
# Encode categorical variable
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
# Split features and target
X = df.drop(columns=['Survived'])
y = df['Survived']
df.shape
(714, 7)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
Here, we specify n_splits=5
, meaning the data is divided into five folds. The shuffle=True
ensures randomness.
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f'Cross-validation accuracy scores: {scores}')
print(f'Average Accuracy: {np.mean(scores):.4f}')
Cross-validation accuracy scores: [0.77622378 0.8041958 0.79020979 0.88111888 0.80985915] Average Accuracy: 0.8123
score = cross_val_score(tree.DecisionTreeClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.72727273 0.79020979 0.76923077 0.81818182 0.8028169] Average score: 0.78
For datasets with imbalanced classes, Stratified K-Fold ensures each fold has the same class distribution as the full dataset. This distribution of classes makes it the ideal choice for imbalance classification problems.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f'Average Accuracy (Stratified K-Fold): {np.mean(scores):.4f}')
Repeated K-Fold runs K-Fold multiple times with different splits to further reduce variance. This is usually done when the data is simple and models such as logistic regression can be fitted into the data set.
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=rkf, scoring='accuracy')
print(f'Average Accuracy (Repeated K-Fold): {np.mean(scores):.4f}')
Nested K-Fold performs hyperparameter tuning within the inner loop while evaluating performance in the outer loop, reducing overfitting.
from sklearn.model_selection import GridSearchCV, cross_val_score
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 10, 20]}
gs = GridSearchCV(model, param_grid, cv=5)
scores = cross_val_score(gs, X, y, cv=5)
print(f'Average Accuracy (Nested K-Fold): {np.mean(scores):.4f}')
If your dataset has groups (e.g., multiple images from the same patient), Group K-Fold ensures samples from the same group are not split across training and validation, which is useful for hierarchical data.
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
groups = np.random.randint(0, 5, size=len(y))
scores = cross_val_score(model, X, y, cv=gkf, groups=groups, scoring='accuracy')
print(f'Average Accuracy (Group K-Fold): {np.mean(scores):.4f}')
Use cross_val_score()
from scikit-learn
with KFold
as the cv
parameter.
K-Fold randomly splits data, whereas Stratified K-Fold maintains class balance in each fold.
KFold
class do in Python?It divides the dataset into n_splits
folds for training and validation.
To ensure that any machine learning model that you are building performs best when provided with unseen data. Cross-validation becomes a crucial step to make the model reliable. βFold cross-validation is one of the best ways to make sure that the model does not overfit the training data, hence maintaining the bias-variance tradeoff. Dividing the data into different folds and training and validating the model iteratively through each phase provides a better estimate of how the model will perform when provided withan unknown dataset.
In Python, implementing K-Fold Cross-Validation is straightforward using libraries like scikit-learn
, which offers KFold
and StratifiedKFold
for handling imbalanced datasets. Integrating K-Fold Cross-Validation into your workflow allows you to fine-tune hyperparameters effectively, compare models with confidence, and enhance generalization for real-world applications.
Whether building a regression, classification, or deep learning models, this validation approach is a key component for machine learning pipelines.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!