Using AI to Predict Housing Prices and Classify Properties

Part A – Regression Modelling for Business Decision Making

1. Background & Objective

As a data analyst at a real-estate company, my role involves supporting the sales team by providing insights and trends from data. I used this Kaggle dataset (https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction) to predict housing prices in Paris. The goal is to assist customers in making informed decisions about selling their properties.

2. Data Exploration & Preprocessing

  • Dataset Overview: 17 columns, 10,000 rows. Features include size, number of rooms, yard/pool presence, exclusivity of neighborhood, and more.
  • Handling Missing Data: No missing values found.
  • Feature Engineering: Categorical variables were already encoded.
  • Splitting Data: 80% training, 20% testing.
  • Scaling: Applied MinMaxScaler to ensure features are within [0,1].

3. Model Selection & Hyperparameter Tuning

  • Initial Models: Evaluated Linear Regression and Support Vector Regression (SVR) with default parameters.
  • Performance Comparison:
    • Linear Regression: Lower MSE, high R2 (~99%), but potential overfitting.
    • SVR (RBF Kernel): Lower MSE, high R2, better generalization.
  • SVR Hyperparameter Tuning:
    • GridSearchCV with parameters C, epsilon, gamma.
    • Best Parameters: C=1, epsilon=0.1, gamma=1.
    • Lowest RMSE: 0.0348.

4. Model Evaluation

  • Metrics: MSE (0.0348), R2 (0.9850).
  • K-Fold Cross-Validation: Tested k=5,10,12,15; k=5 was chosen for balance between efficiency and accuracy.

5. Business Impact

Predicted selling price for a new property: ~$3,063,964.40. Recommended this price to the customer.


Part B – Classification Modelling for Business Decision Making

1. Problem Statement

Classify a property as “Basic” or “Luxury” and provide renovation insights to increase its value.

2. Data Preprocessing

  • Handling Outliers: Used IQR method; replaced outliers with median.
  • Feature Scaling: Applied MinMaxScaler.
  • Dataset Split: 80% training, 20% testing.

3. Model Building & Hyperparameter Tuning

  • Algorithm Choice: Random Forest due to its ability to handle high-dimensional data and mixed feature types.
  • Tuning: Random Search with parameters n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap.
  • Best Parameters: n_estimators=400, min_samples_split=10, min_samples_leaf=15, max_features='log2', max_depth=None, bootstrap=False.

4. Model Evaluation

  • Confusion Matrix: Perfect classification with no errors.
  • Metrics: Precision, Recall, F1-score all at 1.0.
  • K-Fold Cross-Validation: Tested k=3,5,7,10,12; all showed 100% accuracy with zero variance.

5. Business Decision & Recommendations

  • Classification Result: New property classified as “Basic” (0).
  • Feature Importance: Key features for value increase are hasYard, isNewBuilt, and hasPool.
  • Recommendations: Renovating these features can elevate the property to “Luxury” category and boost its market value.

Key Takeaways

  • Regression: SVR outperformed Linear Regression due to better generalization and handling of non-linear relationships.
  • Classification: Random Forest provided perfect accuracy, highlighting its strength in high-dimensional datasets with mixed variables.
  • Actionable Insights: Focus on 3 key features for renovations to significantly enhance property value.

Coding can be found on Github:https://github.com/wanlichen2024/708A2_MachineLearning_RegressionClassification.git

留下评论