Part A – Regression Modelling for Business Decision Making
1. Background & Objective
As a data analyst at a real-estate company, my role involves supporting the sales team by providing insights and trends from data. I used this Kaggle dataset (https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction) to predict housing prices in Paris. The goal is to assist customers in making informed decisions about selling their properties.
2. Data Exploration & Preprocessing
- Dataset Overview: 17 columns, 10,000 rows. Features include size, number of rooms, yard/pool presence, exclusivity of neighborhood, and more.
- Handling Missing Data: No missing values found.
- Feature Engineering: Categorical variables were already encoded.
- Splitting Data: 80% training, 20% testing.
- Scaling: Applied MinMaxScaler to ensure features are within [0,1].
3. Model Selection & Hyperparameter Tuning
- Initial Models: Evaluated Linear Regression and Support Vector Regression (SVR) with default parameters.
- Performance Comparison:
- Linear Regression: Lower MSE, high R2 (~99%), but potential overfitting.
- SVR (RBF Kernel): Lower MSE, high R2, better generalization.
- SVR Hyperparameter Tuning:
- GridSearchCV with parameters
C,epsilon,gamma. - Best Parameters:
C=1,epsilon=0.1,gamma=1. - Lowest RMSE: 0.0348.
- GridSearchCV with parameters
4. Model Evaluation
- Metrics: MSE (0.0348), R2 (0.9850).
- K-Fold Cross-Validation: Tested k=5,10,12,15; k=5 was chosen for balance between efficiency and accuracy.
5. Business Impact
Predicted selling price for a new property: ~$3,063,964.40. Recommended this price to the customer.
Part B – Classification Modelling for Business Decision Making
1. Problem Statement
Classify a property as “Basic” or “Luxury” and provide renovation insights to increase its value.
2. Data Preprocessing
- Handling Outliers: Used IQR method; replaced outliers with median.
- Feature Scaling: Applied MinMaxScaler.
- Dataset Split: 80% training, 20% testing.
3. Model Building & Hyperparameter Tuning
- Algorithm Choice: Random Forest due to its ability to handle high-dimensional data and mixed feature types.
- Tuning: Random Search with parameters
n_estimators,max_depth,min_samples_split,min_samples_leaf,max_features,bootstrap. - Best Parameters:
n_estimators=400,min_samples_split=10,min_samples_leaf=15,max_features='log2',max_depth=None,bootstrap=False.
4. Model Evaluation
- Confusion Matrix: Perfect classification with no errors.
- Metrics: Precision, Recall, F1-score all at 1.0.
- K-Fold Cross-Validation: Tested k=3,5,7,10,12; all showed 100% accuracy with zero variance.
5. Business Decision & Recommendations
- Classification Result: New property classified as “Basic” (0).
- Feature Importance: Key features for value increase are
hasYard,isNewBuilt, andhasPool. - Recommendations: Renovating these features can elevate the property to “Luxury” category and boost its market value.
Key Takeaways
- Regression: SVR outperformed Linear Regression due to better generalization and handling of non-linear relationships.
- Classification: Random Forest provided perfect accuracy, highlighting its strength in high-dimensional datasets with mixed variables.
- Actionable Insights: Focus on 3 key features for renovations to significantly enhance property value.
Coding can be found on Github:https://github.com/wanlichen2024/708A2_MachineLearning_RegressionClassification.git



留下评论