Using AI to Predict Housing Prices and Classify Properties

Part A – Regression Modelling for Business Decision Making

1. Background & Objective

As a data analyst at a real-estate company, my role involves supporting the sales team by providing insights and trends from data. I used this Kaggle dataset (https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction) to predict housing prices in Paris. The goal is to assist customers in making informed decisions about selling their properties.

2. Data Exploration & Preprocessing

Dataset Overview: 17 columns, 10,000 rows. Features include size, number of rooms, yard/pool presence, exclusivity of neighborhood, and more.
Handling Missing Data: No missing values found.
Feature Engineering: Categorical variables were already encoded.
Splitting Data: 80% training, 20% testing.
Scaling: Applied MinMaxScaler to ensure features are within [0,1].

3. Model Selection & Hyperparameter Tuning

Initial Models: Evaluated Linear Regression and Support Vector Regression (SVR) with default parameters.
Performance Comparison:
- Linear Regression: Lower MSE, high R2 (~99%), but potential overfitting.
- SVR (RBF Kernel): Lower MSE, high R2, better generalization.
SVR Hyperparameter Tuning:
- GridSearchCV with parameters C, epsilon, gamma.
- Best Parameters: C=1, epsilon=0.1, gamma=1.
- Lowest RMSE: 0.0348.

4. Model Evaluation

Metrics: MSE (0.0348), R2 (0.9850).
K-Fold Cross-Validation: Tested k=5,10,12,15; k=5 was chosen for balance between efficiency and accuracy.

5. Business Impact

Predicted selling price for a new property: ~$3,063,964.40. Recommended this price to the customer.

Part B – Classification Modelling for Business Decision Making

1. Problem Statement

Classify a property as “Basic” or “Luxury” and provide renovation insights to increase its value.

2. Data Preprocessing

Handling Outliers: Used IQR method; replaced outliers with median.
Feature Scaling: Applied MinMaxScaler.
Dataset Split: 80% training, 20% testing.

3. Model Building & Hyperparameter Tuning

Algorithm Choice: Random Forest due to its ability to handle high-dimensional data and mixed feature types.
Tuning: Random Search with parameters n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap.
Best Parameters: n_estimators=400, min_samples_split=10, min_samples_leaf=15, max_features='log2', max_depth=None, bootstrap=False.

4. Model Evaluation

Confusion Matrix: Perfect classification with no errors.
Metrics: Precision, Recall, F1-score all at 1.0.
K-Fold Cross-Validation: Tested k=3,5,7,10,12; all showed 100% accuracy with zero variance.

5. Business Decision & Recommendations

Classification Result: New property classified as “Basic” (0).
Feature Importance: Key features for value increase are hasYard, isNewBuilt, and hasPool.
Recommendations: Renovating these features can elevate the property to “Luxury” category and boost its market value.

Key Takeaways

Regression: SVR outperformed Linear Regression due to better generalization and handling of non-linear relationships.
Classification: Random Forest provided perfect accuracy, highlighting its strength in high-dimensional datasets with mixed variables.
Actionable Insights: Focus on 3 key features for renovations to significantly enhance property value.

Coding can be found on Github:https://github.com/wanlichen2024/708A2_MachineLearning_RegressionClassification.git

Wanli's Data Story