Real Estate Price Prediction Using Multiple Linear Regression

juliamtw20
Oct 22, 2024
3 min read

Introduction

Real estate pricing is a complex phenomenon influenced by various factors such as the size of the property, the number of rooms, amenities, and location-specific attributes. Predicting property prices is crucial for buyers, sellers, and investors who wish to make informed decisions. In this project, we aim to explore and build a regression model to predict real estate prices using multiple linear regression techniques.

The data used for this project is derived from the dataset containing information about properties, such as their size (in square meters), type, number of rooms, elevator availability, terraces, and more. We combine these datasets, create new features, and analyze them before building our predictive model.

Dataset and Data Cleaning

The dataset includes 413 records of 28 variables. The dataset contains various attributes such as Price, m2 (square meters of the property), Rooms, Bathrooms, and other features like Elevator, Terrasse, Parking, and Kitchen.

I chekced the missing value and duplications.

Feature Engineering

We created several new features based on the existing ones to capture non-linear relationships between the independent variables and the target variable (price). These transformations include:

SQm2: Square of the area in square meters.
CUBE_m2: Cube of the area in square meters.
LnRooms: Natural logarithm of the number of rooms.
SQBathrooms: Square of the number of bathrooms.
Interaction terms such as R_M2 (Rooms multiplied by m2) and TE_M2 (Terrasse multiplied by m2).

Exploratory Data Analysis (EDA)

We performed an exploratory data analysis (EDA) to understand the relationships between the variables and the target variable (Price). A correlation matrix and scatterplots revealed significant relationships between features like m2, Rooms, Bathrooms, and Price.

Some key findings from the EDA include:

A positive correlation between property size (m2) and price.
Properties with elevators and terraces tend to have higher prices.
Cubic and squared terms of property size are important for capturing non-linear relationships in pricing.

Model Building

We built a multiple linear regression model using BARdata1 to predict real estate prices. The model includes various features and their transformations. The goal is to minimize the residuals and maximize the explained variance.

model <- lm(Price ~  m2 + SQm2 + CUBE_m2 + T_M2 + Type + Elevator + E_M2 + Bathrooms + Terrasse + Atico + TE_M2 + Parking + Kitchen + Yard + CV + EX + GR + HG + LC + NB + SA + SAM + SM + SAS, data = BARdata1)

Model Evaluation

We evaluated the model using standard metrics such as the R-squared value, which indicates how much variance in the price is explained by the model, and the p-values of the coefficients to determine their statistical significance.

The residuals were analyzed to check if they met the assumptions of linear regression:

Mean of residuals close to zero indicates that the model is unbiased.
A histogram of residuals showing a roughly normal distribution suggests that the errors are normally distributed.
A plot of residuals versus fitted values showing no patterns suggests homoscedasticity (constant variance of errors).

hist(model$residuals, breaks = 50)
mean(model$residuals)
sd(model$residuals)

Conclusion

In this project, we explored a real estate dataset, performed feature engineering, and built a multiple linear regression model to predict real estate prices. The model demonstrates a strong relationship between property characteristics and their prices. By transforming variables and adding interaction terms, we captured non-linearities and improved the model’s performance.

While the model provides insights into the factors affecting real estate prices, further improvements could be made by exploring non-linear models or using regularization techniques (e.g., Ridge or Lasso regression) to prevent overfitting.

This analysis can be useful for real estate professionals, investors, and buyers interested in estimating property values based on key features.

Future Work

Future extensions of this project could include:

Cross-validation to ensure the model generalizes well on unseen data.
Exploring other machine learning models such as Random Forest or Gradient Boosting for more complex relationships.
Adding geographical data (location-based features) for improved price predictions.