Multiple Linear Regression is a fundamental statistical technique used to model the relationship between one dependent variable and multiple independent variables. In Python, tools like scikit-learn
and statsmodels
provide robust implementations for regression analysis. This tutorial will walk you through implementing, interpreting, and evaluating multiple linear regression models using Python.
Before diving into the implementation, ensure you have the following:
numpy
, pandas
, matplotlib
, seaborn
, scikit-learn
, and statsmodels
installed.Multiple Linear Regression (MLR) is a statistical method that models the relationship between a dependent variable and two or more independent variables. It is an extension of simple linear regression, which models the relationship between a dependent variable and a single independent variable. In MLR, the relationship is modeled using the formula:
Where:
Example: Predicting the price of a house based on its size, number of bedrooms, and location. In this case, there are three independent variables, i.e., size, number of bedrooms, and location, and one dependent variable, i.e., price, that is the value to be predicted.
Before implementing multiple linear regression, it is essential to ensure that the following assumptions are met:
Linearity: The relationship between the dependent variable and independent variables is linear.
Independence of Errors: Residuals (errors) are independent of each other. This is often verified using the Durbin-Watson test.
Homoscedasticity: The variance of residuals is constant across all levels of the independent variables. A residual plot can help verify this.
No Multicollinearity: Independent variables are not highly correlated. Variance Inflation Factor (VIF) is commonly used to detect multicollinearity.
Normality of Residuals: Residuals should follow a normal distribution. This can be checked using a Q-Q plot.
Outlier Influence: Outliers or high-leverage points should not disproportionately influence the model.
These assumptions ensure that the regression model is valid and the results are reliable. Failing to meet these assumptions may lead to biased or misleading results.
In this section, you will learn to use the Multiple Linear Regression model in Python to predict house prices based on features from the California Housing Dataset. You’ll learn how to preprocess data, fit a regression model, and evaluate its performance while addressing common challenges like multicollinearity, outliers, and feature selection.
You will use the California Housing Dataset, a popular dataset for regression tasks. This dataset contains 13 features about houses in Boston suburbs and their corresponding median house price.
First, let’s install the necessary packages:
pip install numpy pandas matplotlib seaborn scikit-learn statsmodels
from sklearn.datasets import fetch_california_housing # Import the fetch_california_housing function from sklearn.datasets to load the California Housing dataset.
import pandas as pd # Import pandas for data manipulation and analysis.
import numpy as np # Import numpy for numerical computing.
# Load the California Housing dataset using the fetch_california_housing function.
housing = fetch_california_housing()
# Convert the dataset's data into a pandas DataFrame, using the feature names as column headers.
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
# Add the target variable 'MedHouseValue' to the DataFrame, using the dataset's target values.
housing_df['MedHouseValue'] = housing.target
# Display the first few rows of the DataFrame to get an overview of the dataset.
print(housing_df.head())
You should observe the following output of the dataset:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseValue
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
Here is what each of the attributes mean:
Variable | Description |
---|---|
MedInc | Median income in block |
HouseAge | Median house age in block |
AveRooms | Average number of rooms |
AveBedrms | Average number of bedrooms |
Population | Block population |
AveOccup | Average house occupancy |
Latitude | House block latitude |
Longitude | House block longitude |
Ensures there are no missing values in the dataset which might affect the analysis.
print(housing_df.isnull().sum())
Output:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseValue 0
dtype: int64
Let’s first create a correlation matrix to understand the dependencies between the variables.
correlation_matrix = housing_df.corr()
print(correlation_matrix['MedHouseValue'])
Output:
MedInc 0.688075
HouseAge 0.105623
AveRooms 0.151948
AveBedrms -0.046701
Population -0.024650
AveOccup -0.023737
Latitude -0.144160
Longitude -0.045967
MedHouseValue 1.000000
You can analyze the above correlation matrix to select the dependent and independent variables for our regression model. The correlation matrix provides insights into the relationships between each pair of variables in the dataset.
In the given correlation matrix, MedHouseValue
is the dependent variable, as it is the variable we are trying to predict. The independent variables have a significant correlation with MedHouseValue
.
Based on the correlation matrix, you can identify the following independent variables that have a significant correlation with MedHouseValue
:
MedInc
: This variable has a strong positive correlation (0.688075) with MedHouseValue
, indicating that as median income increases, median house value also tends to increase.AveRooms
: This variable has a moderate positive correlation (0.151948) with MedHouseValue
, suggesting that as the average number of rooms per household increases, median house value also tends to increase.AveOccup
: This variable has a weak negative correlation (-0.023737) with MedHouseValue
, indicating that as the average occupancy per household increases, median house value tends to decrease, but the effect is relatively small.By selecting these independent variables, you can build a regression model that captures the relationships between these variables and MedHouseValue
, allowing us to make predictions about median house value based on median income, average number of rooms, and average occupancy.
You can also plot the correlation matrix in Python using the below:
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming 'housing_df' is the DataFrame containing the data
# Plotting the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(housing_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
You’ll focus on a few key features for simplicity based on the above, such as MedInc
(median income), AveRooms
(average rooms per household), and AveOccup
(average occupancy per household).
selected_features = ['MedInc', 'AveRooms', 'AveOccup']
X = housing_df[selected_features]
y = housing_df['MedHouseValue']
The above code block selects specific features from the housing_df
data frame for analysis. The selected features are MedInc
, AveRooms
, and AveOccup
, which are stored in the selected_features
list.
The DataFrame housing_df
is then subset to include only these selected features and the result is stored in X
list.
The target variable MedHouseValue
is extracted from housing_df
and stored in the y
list.
You will use Standardization to ensure all features are on the same scale, improving model performance and comparability.
Standardization is a preprocessing technique that scales numerical features to have a mean of 0 and a standard deviation of 1. This process ensures that all features are on the same scale, which is essential for machine learning models sensitive to the input features’ scale. By standardizing the features, you can improve model performance and comparability by reducing the effect of features with large ranges dominating the model.
from sklearn.preprocessing import StandardScaler
# Initialize the StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the data and transform it
X_scaled = scaler.fit_transform(X)
# Print the scaled data
print(X_scaled)
Output:
[[ 2.34476576 0.62855945 -0.04959654]
[ 2.33223796 0.32704136 -0.09251223]
[ 1.7826994 1.15562047 -0.02584253]
...
[-1.14259331 -0.09031802 -0.0717345 ]
[-1.05458292 -0.04021111 -0.09122515]
[-0.78012947 -0.07044252 -0.04368215]]
The output represents the scaled values of the features MedInc
, AveRooms
, and AveOccup
after applying the StandardScaler. The values are now centered around 0 with a standard deviation of 1, ensuring all features are on the same scale.
The first row [ 2.34476576 0.62855945 -0.04959654]
indicates that for the first data point, the scaled MedInc
value is 2.34476576, AveRooms
is 0.62855945, and AveOccup
is -0.04959654. Similarly, the second row [ 2.33223796 0.32704136 -0.09251223]
represents the scaled values for the second data point, and so on.
The scaled values range from approximately -1.14259331 to 2.34476576, indicating that the features are now normalized and comparable. This is essential for machine learning models that are sensitive to the scale of input features, as it prevents features with large ranges from dominating the model.
Now that you are done with data preprocessing let’s implement multiple linear regression in python.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# The 'LinearRegression' model is initialized and fitted to the training data.
model = LinearRegression()
model.fit(X_train, y_train)
# The model is used to predict the target variable for the test set.
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))
The train_test_split
function is used to split the data into training and testing sets. Here, 80% of the data is used for training and 20% for testing.
The model is evaluated using Mean Squared Error and R-squared. Mean Squared Error (MSE) measures the average of the squares of the errors or deviations.
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
Output:
Mean Squared Error: 0.7006855912225249
R-squared: 0.4652924370503557
The output above provides two key metrics to evaluate the performance of the multiple linear regression model:
Mean Squared Error (MSE): 0.7006855912225249
The MSE measures the average squared difference between the predicted and actual values of the target variable. A lower MSE indicates better model performance, as it means the model is making more accurate predictions. In this case, the MSE is 0.7006855912225249
, indicating that the model is not perfect but has a reasonable level of accuracy. The MSE values typically should be closer to 0, with lower values indicating better performance.
R-squared (R2): 0.4652924370503557
R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 is perfect prediction and 0 indicates no linear relationship. In this case, the R-squared value is 0.4652924370503557
, indicating that about 46.53% of the variance in the target variable can be explained by the independent variables used in the model. This suggests that the model is able to capture a significant portion of the relationships between the variables but not all of it.
Let’s check out some important plots:
# Residual Plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='red', linestyle='--')
plt.show()
# Predicted vs Actual Plot
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=4)
plt.show()
statsmodels
The Statsmodels library in Python is a powerful tool for statistical analysis. It provides a wide range of statistical models and tests, including linear regression, time series analysis, and nonparametric methods.
In the context of multiple linear regression, statsmodels
can be used to fit a linear model to the data, and then perform various statistical tests and analyses on the model. This can be particularly useful for understanding the relationships between the independent and dependent variables, and for making predictions based on the model.
import statsmodels.api as sm
# Add a constant to the model
X_train_sm = sm.add_constant(X_train)
model_sm = sm.OLS(y_train, X_train_sm).fit()
print(model_sm.summary())
# Q-Q Plot for residuals
sm.qqplot(model_sm.resid, line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()
Output:
==============================================================================
Dep. Variable: MedHouseValue R-squared: 0.485
Model: OLS Adj. R-squared: 0.484
Method: Least Squares F-statistic: 5173.
Date: Fri, 17 Jan 2025 Prob (F-statistic): 0.00
Time: 09:40:54 Log-Likelihood: -20354.
No. Observations: 16512 AIC: 4.072e+04
Df Residuals: 16508 BIC: 4.075e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.0679 0.006 320.074 0.000 2.055 2.081
x1 0.8300 0.007 121.245 0.000 0.817 0.843
x2 -0.1000 0.007 -14.070 0.000 -0.114 -0.086
x3 -0.0397 0.006 -6.855 0.000 -0.051 -0.028
==============================================================================
Omnibus: 3981.290 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11583.284
Skew: 1.260 Prob(JB): 0.00
Kurtosis: 6.239 Cond. No. 1.42
==============================================================================
Here is the summary of the above table:
The model is an Ordinary Least Squares regression model, which is a type of linear regression model. The dependent variable is MedHouseValue
, and the model has an R-squared value of 0.485, indicating that about 48.5% of the variation in MedHouseValue
can be explained by the independent variables. The adjusted R-squared value is 0.484, which is a modified version of R-squared that penalizes the model for including additional independent variables.
The model was fit using the Least Squares method, and the F-statistic is 5173, indicating that the model is a good fit. The probability of observing an F-statistic at least as extreme as the one observed, assuming that the null hypothesis is true, is approximately 0. This suggests that the model is statistically significant.
The model coefficients are as follows:
MedHouseValue
is approximately 2.0679.x1
(In this case MedInc
) is 0.8300, indicating that for every unit increase in MedInc
, the predicted MedHouseValue
increases by approximately 0.83 units, assuming all other independent variables are held constant.x2
(In this case AveRooms
) is -0.1000, indicating that for every unit increase in x2
, the predicted MedHouseValue
decreases by approximately 0.10 units, assuming all other independent variables are held constant.x3
(In this case AveOccup
) is -0.0397, indicating that for every unit increase in x3
, the predicted MedHouseValue
decreases by approximately 0.04 units, assuming all other independent variables are held constant.The model diagnostics are as follows:
Multicollinearity is a common issue in multiple linear regression, where two or more independent variables are highly correlated with each other. This can lead to unstable and unreliable estimates of the coefficients.
To detect and handle multicollinearity, you can use the Variance Inflation Factor. The VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF of 1 means that there is no correlation between a given predictor and the other predictors. VIF values exceeding 5 or 10 indicate a problematic amount of collinearity.
In the code block below, let’s calculate the VIF for each independent variable in our model. If any VIF value is above 5, you should consider removing the variable from the model.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data['Feature'] = selected_features
vif_data['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
print(vif_data)
# Bar Plot for VIF Values
vif_data.plot(kind='bar', x='Feature', y='VIF', legend=False)
plt.title('Variance Inflation Factor (VIF) by Feature')
plt.ylabel('VIF Value')
plt.show()
Output:
Feature VIF
0 MedInc 1.120166
1 AveRooms 1.119797
2 AveOccup 1.000488
The VIF values for each feature are as follows:
MedInc
: The VIF value is 1.120166, indicating a very low correlation with other independent variables. This suggests that MedInc
is not highly correlated with other independent variables in the model.AveRooms
: The VIF value is 1.119797, indicating a very low correlation with other independent variables. This suggests that AveRooms
is not highly correlated with other independent variables in the model.AveOccup
: The VIF value is 1.000488, indicating no correlation with other independent variables. This suggests that AveOccup
is not correlated with other independent variables in the model.In general, these VIF values are all below 5, indicating that there is no significant multicollinearity between the independent variables in the model. This suggests that the model is stable and reliable, and that the coefficients of the independent variables are not significantly affected by multicollinearity.
Cross-validation is a technique used to evaluate the performance of a machine learning model. It is a resampling procedure used to evaluate a model if we have a limited data sample. The procedure has a single parameter called k
that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print("Cross-Validation Scores:", scores)
print("Mean CV R^2:", scores.mean())
# Line Plot for Cross-Validation Scores
plt.plot(range(1, 6), scores, marker='o', linestyle='--')
plt.xlabel('Fold')
plt.ylabel('R-squared')
plt.title('Cross-Validation R-squared Scores')
plt.show()
Output:
Cross-Validation Scores: [0.42854821 0.37096545 0.46910866 0.31191043 0.51269138]
Mean CV R^2: 0.41864482644003276
The cross-validation scores indicate how well the model performs on unseen data. The scores range from 0.31191043 to 0.51269138, indicating that the model’s performance varies across different folds. A higher score indicates better performance.
The mean CV R^2 score is 0.41864482644003276, which suggests that, on average, the model explains about 41.86% of the variance in the target variable. This is a moderate level of explanation, indicating that the model is somewhat effective in predicting the target variable but may benefit from further improvement or refinement.
These scores can be used to evaluate the model’s generalizability and identify potential areas for improvement.
The Recursive Feature Elimination method is a feature selection technique that recursively eliminates the least important features until a specified number of features is reached. This method is particularly useful when dealing with a large number of features and the goal is to select a subset of the most informative features.
In the provided code, you first import the RFE
class from sklearn.feature_selection
. Then create an instance of RFE
with a specified estimator (in this case, LinearRegression
) and set n_features_to_select
to 2, indicating that we want to select the top 2 features.
Next, we fit the RFE
object to our scaled features X_scaled
and target variable y
. The support_
attribute of the RFE
object returns a boolean mask indicating which features were selected.
To visualize the ranking of features, you create a DataFrame with the feature names and their corresponding rankings. The ranking_
attribute of the RFE
object returns the ranking of each feature, with lower values indicating more important features. You then plot a bar chart of the feature rankings, sorted by their ranking values. This plot helps us understand the relative importance of each feature in the model.
from sklearn.feature_selection import RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
rfe.fit(X_scaled, y)
print("Selected Features:", rfe.support_)
# Bar Plot of Feature Rankings
feature_ranking = pd.DataFrame({
'Feature': selected_features,
'Ranking': rfe.ranking_
})
feature_ranking.sort_values(by='Ranking').plot(kind='bar', x='Feature', y='Ranking', legend=False)
plt.title('Feature Ranking (Lower is Better)')
plt.ylabel('Ranking')
plt.show()
Output:
Selected Features: [ True True False]
Based on the above chart, the 2 most suitable features are MedInc
and AveRooms
. This can also be verified by the model’s output above as dependent variable MedHouseValue
, is mostly dependent on MedInc
and AveRooms
.
To implement multiple linear regression in Python, you can use libraries like statsmodels
or scikit-learn
. Here’s a quick overview using scikit-learn
:
from sklearn.linear_model import LinearRegression
import numpy as np
# Example data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]]) # Predictor variables
y = np.array([5, 7, 9, 11]) # Target variable
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Get coefficients and intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# Make predictions
predictions = model.predict(X)
print("Predictions:", predictions)
This demonstrates how to fit the model, obtain the coefficients, and make predictions.
Multiple linear regression relies on several assumptions to ensure valid results:
You can test these assumptions using tools like residual plots, Variance Inflation Factor (VIF), or statistical tests.
Key metrics from regression results include:
Example: A coefficient of 2 for X1 means the target variable increases by 2 for every 1-unit increase in X1, holding other variables constant.
2.Intercept (intercept_): Represents the target variable’s predicted value when all predictors are zero.
3.R-squared: Explains the proportion of the variance in the target variable explained by the predictors.
Example: An R^2 of 0.85 means 85% of the variability in the target variable is explained by the model.
4.P-values (in statsmodels
): Assess the statistical significance of predictors. A p-value < 0.05 typically indicates a predictor is significant.
Feature | Simple Linear Regression | Multiple Linear Regression |
---|---|---|
Number of Independent Variables | One | More than one |
Model Equation | y = β0 + β1x + ε | y = β0 + β1x1 + β2x2 + … + βnxn + ε |
Assumptions | Same as multiple linear regression, but with a single independent variable | Same as simple linear regression, but with additional assumptions for multiple independent variables |
Interpretation of Coefficients | The change in the target variable for a unit change in the independent variable, while holding all other variables constant (not applicable in simple linear regression) | The change in the target variable for a unit change in one independent variable, while holding all other independent variables constant |
Model Complexity | Less complex | More complex |
Model Flexibility | Less flexible | More flexible |
Overfitting Risk | Lower | Higher |
Interpretability | Easier to interpret | More challenging to interpret |
Applicability | Suitable for simple relationships | Suitable for complex relationships with multiple factors |
Example | Predicting house prices based on the number of bedrooms | Predicting house prices based on the number of bedrooms, square footage, and location |
In this comprehensive tutorial, you learned to implement Multiple Linear Regression using the California Housing Dataset. You tackled crucial aspects such as multicollinearity, cross-validation, feature selection, and regularization, providing a thorough understanding of each concept. You also learned to incorporate visualizations to illustrate residuals, feature importance, and overall model performance. You can now easily construct robust regression models in Python and apply these skills to real-world problems.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!