Here is the Github repo for the project code.
In this project, I built a full end-to-end machine learning pipeline on the Kaggle Ames House Prices dataset. Using careful feature engineering and Gradient Boosting, I achieved a leaderboard score of 0.13339 (top ~2k). The main focus was learning classical ML workflows on tabular data rather than leaderboard chasing.
This post is aimed at beginners and intermediates who want to see another's process for learning classical machine learning on tabular data.
Goal: Predict the sales price for houses. For each Id in the test set, predict the value of SalePrice.
Metric: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.
For cross-validation, RMSE is reported on the original SalePrice scale for interpretability. Kaggle leaderboard scores are computed on log-transformed values.
1 entry. Scored 0.13339. 1964 on Leaderboard. Here is the Github repo.
I started with an Exploratory Data Analysis (EDA) checklist. EDA focused on missingness, sparsity, feature correlations, and target distribution — with the goal of informing imputation and scaling decisions.
Here are some interesting findings.
The datasets are in good condition. Their columns match (apart from the target SalePrice), and there is a good description in the documentation for almost all of the features.
Quick missing and sparse checks on the combined training and test data.
We can see that there are missing values for several features. (yellow indicates missing/sparse values)


This directed me to an imputation strategy for all features, categorising all features into:
numeric or categoricalstructural (missing=non-existent. eg. PoolArea) or substantive (missing=unknown. eg. LotArea)ordinal or nominal for categorical features.Here is one example of the ordinal categories for one feature:
Ex ExcellentGd GoodTA Typical - slight dampness allowedFa Fair - dampness or some cracking or settlingPo Poor - Severe cracking, settling, or wetnessNA No BasementI was comfortable assuming that the implied ordering reflects true value differences, while acknowledging this may not always hold perfectly.
At this point, I created a feature descriptor database in src/ames/data/features.py to capture the different classes of feature outlined above. Here is a view of it in a table:
| Feature | Type | Sub Type | Ordinal? |
|---|---|---|---|
| 1stFlrSF | num | substantive | False |
| 2ndFlrSF | num | structural | False |
| 3SsnPorch | num | structural | False |
| Alley | cat | structural | False |
| BedroomAbvGr | num | substantive | False |
| BldgType | cat | substantive | False |
| BsmtCond | cat | structural | True |
| BsmtExposure | cat | structural | True |
| BsmtFinSF1 | num | structural | False |
| BsmtFinSF2 | num | structural | False |
| BsmtFinType1 | cat | structural | True |
| BsmtFinType2 | cat | structural | True |
| BsmtFullBath | num | substantive | False |
| BsmtHalfBath | num | substantive | False |
| BsmtQual | cat | structural | True |
| BsmtUnfSF | num | structural | False |
| CentralAir | cat | substantive | False |
| Condition1 | cat | structural | False |
| Condition2 | cat | structural | False |
| Electrical | cat | substantive | False |
| EnclosedPorch | num | structural | False |
| ExterCond | cat | substantive | True |
| Exterior1st | cat | substantive | False |
| Exterior2nd | cat | substantive | False |
| ExterQual | cat | substantive | True |
| Fence | cat | structural | False |
| FireplaceQu | cat | structural | True |
| Fireplaces | num | structural | False |
| Foundation | cat | substantive | False |
| FullBath | num | substantive | False |
| Functional | cat | substantive | True |
| GarageArea | num | structural | False |
| GarageCars | num | structural | False |
| GarageCond | cat | structural | True |
| GarageFinish | cat | structural | True |
| GarageQual | cat | structural | True |
| GarageType | cat | structural | False |
| GarageYrBlt | num | substantive | False |
| GrLivArea | num | substantive | False |
| HalfBath | num | substantive | False |
| Heating | cat | substantive | False |
| HeatingQC | cat | substantive | True |
| HouseStyle | cat | substantive | False |
| Id | num | substantive | False |
| KitchenAbvGr | num | substantive | False |
| KitchenQual | cat | substantive | True |
| LandContour | cat | substantive | False |
| LandSlope | cat | substantive | True |
| LotArea | num | substantive | False |
| LotConfig | cat | substantive | False |
| LotFrontage | num | substantive | False |
| LotShape | cat | substantive | False |
| LowQualFinSF | num | substantive | False |
| MasVnrArea | num | structural | False |
| MasVnrType | cat | structural | False |
| MiscFeature | cat | structural | False |
| MiscVal | num | structural | False |
| MoSold | num | substantive | False |
| MSSubClass | cat | substantive | False |
| MSZoning | cat | substantive | False |
| Neighborhood | cat | substantive | False |
| OpenPorchSF | num | structural | False |
| OverallCond | num | substantive | False |
| OverallQual | num | substantive | False |
| PavedDrive | cat | substantive | True |
| PoolArea | num | structural | False |
| PoolQC | cat | structural | True |
| RoofMatl | cat | substantive | False |
| RoofStyle | cat | substantive | False |
| SaleCondition | cat | substantive | False |
| SaleType | cat | substantive | False |
| ScreenPorch | num | structural | False |
| Street | cat | substantive | False |
| TotalBsmtSF | num | structural | False |
| TotRmsAbvGrd | num | substantive | False |
| Utilities | cat | substantive | True |
| WoodDeckSF | num | structural | False |
| YearBuilt | num | substantive | False |
| YearRemodAdd | num | structural | False |
| YrSold | num | substantive | False |
The next step was to explore numeric feature<->feature correlation and feature<->target correlation.
The following image shows top correlating numeric features across both training and test datasets. If we look at the red squares above 0.8 on the heatmap, we can health check it: For example, the top scoring feature<->feature correlation is 0.89 between GarageArea and GarageCars. It is self-explanatory why these two features might correlate highly.
Seeing this heatmap sparked ideas on feature ablation tests for the Feature Engineering phase.

The following image shows the features that correlate most with the target SalePrice. Looking at the first column, we can see that OverallQual and GrLivArea correlate most highly. These are likely to be strong predictors of SalePrice in the modelling phase.

This following image shows a visual-check overview of how each feature scatterplots with SalePrice. I found this very useful to see which features have high sparsity, which are discrete, and which show strong linear correlation. Looking at this chart, it became clear to me that:
MSSubClass is a categorical in disguise.MoSold and YrSold should be dropped since they exhibit no variation on SalePrice.
The following image shows the top Mutual Information scores for numerics features against SalePrice. Mutual Information captures both linear and non-linear dependency by estimating how much knowing one variable reduces uncertainty about the other. Again we can see some familiar features topping the list.

The following image shows the top Mutual Information scores for categorical features against SalePrice. Here we can see for the first time how strong Neighborhood is as a predictor of SalePrice.
This checks out intuitively. Anyone who has thought about house prices for more than a few seconds knows that location really matters when it comes to house prices.

Target encoding was straight-forward since Kaggle specified it in the competition brief. They want competitors to predict the logarithm of the target. I did some analysis to see why this makes sense.
When we look at the distribution for SalePrice we observe that it is never negative, it is right skewed, and it has some high outliers. We don't want to ignore the high outliers since big house prices are important data points. But we want to find a way to make the distribution more balanced and suitable for linear modelling.
The left hand side is the distribution of SalePrice. The right hand side of this image is the log distribution. We can see that the log distribution is more balanced, and the effect of the outliers has been reduced. This makes it more suitable for linear modelling.

Categorical features can be split into ordinal and nominal features. For the ordinal features, the decision was straightforward. I would use OrdinalEncoder. For the nominal features, the options are either to:
I investigated the top cardinality of the categorical nominal features across training and test datasets:
| Feature | Cardinality |
|---|---|
| Neighborhood | 25 |
| Exterior2nd | 15 |
| Exterior1st | 13 |
| SaleType | 9 |
| Condition1 | 9 |
You can see that there isn't high cardinality in these datasets. Neighborhood has the highest, but we know this is a strong predictor of SalePrice so we don't want to lose any information here. And even then it's only 25.
My thinking at this point was to one-hot encode all categorical nominal features.
One experiment might be to ignore feature categories with low frequency using the min_frequency parameter of OneHotEncoder.
In this project I wanted to explore both linear and tree-based models. Linear models require appropriate scaling of all numeric features. So I went through all the numeric features and analysed what scaling might be appropriate. This is what I came up with:
| Feature | key stats | Strategy | Reason |
|---|---|---|---|
| 1stFlrSF | skew=1.37, sparse=0.0, outlier=0.013 | Log Scale | Right skew, meaningful high outliers |
| 2ndFlrSF | skew=0.81, sparse=0.56, outlier=0.001 | Split in "has2ndFlr" and "SecondFlrSF". Then Standard. | High sparsity. once split has limited skew |
| 3SsnPorch | skew=10.29, sparsen=0.98 outlier=0.016 | Make binary: "has3SsnPorch" | Very high sparsity. poor predictor of saleprice |
| BedroomAbvGr | skew=0.21, sparsen=0.00 outlier=0.023 | Standard Scale | Discrete. distribution |
| BsmtFinSF1 | skew=1.68, sparsen=0.31 outlier=0.004 | Split in "hasBsmt1" and "BsmtSF1". Then Standard | High sparsity. once split has limited skew |
| BsmtFinSF2 | skew=4.25, sparsen=0.88 outlier=0.114 | Make binary: "hasBsmt2" | Very high sparsity. poor predictor of saleprice |
| BsmtFullBath | skew=0.59, sparsen=0.58 outlier=0.00 | Standard Scale | Discrete distribution with low max |
| BsmtHalfBath | skew=4.0, sparse=0.94 outlier=0.05 | Standard Scale | Very high sparsity, discrete distribution |
| BsmtUnfSF | skew=0.91, sparse=0.08, outlier=0.019 | Robust Scale | Poisson-style dist. |
| EnclosedPorch | skew=3.08, sparse=0.85, outlier=0.142 | Split in "hasEnclosedPorch" and "EnclosedPorchSF". | High sparsity. once split has limited skew |
| Fireplaces | skew=0.648, sparse=0.47, outlier=0.003 | Standard Scale | Discrete distribution with low max |
| FullBath | skew=0.036, sparse=0.00, outlier=0.000 | Standard Scale | Discrete distribution with low max |
| GarageArea | skew=0.179, sparse=0.05, outlier=0.014 | Split in "hasGarage" and "GarageSF". Then Standard | Sparsity takes away from distribution |
| GarageCars | skew=-0.342, sparse=0.05, outlier=0.003 | Standard Scale | Discrete distribution with low max |
| GarageYrBlt | skew=-0.677, sparse=0.00, outlier=0.000 | Standard Scale | Preserve increase in years |
| GrLivArea | skew=1.365, sparse=0.00, outlier=0.021 | Log Scale | Right skew, meaningful high outliers |
| HalfBath | skew=0.675, sparse=0.62, outlier=0.000 | Standard Scale | Discrete distribution with low max |
| Id | skew=0.000, sparse=0.00, outlier=0.000 | Drop | Id not required |
| KitchenAbvGr | skew=4.483, sparse=0.00, outlier=0.046 | Standard Scale | Discrete distribution with low max |
| LotArea | skew=12.195, sparse=0.00, outlier=0.047 | Log Scale | Right skew, meaningful high outliers |
| LotFrontage | skew=2.406, sparse=0.00, outlier=0.072 | Log Scale | Right skew, meaningful high outliers |
| LowQualFinSF | skew=9.002, sparse=0.98, outlier=0.017 | Make binary: "hasLowQualSF" | Very high sparsity. poor predictor of saleprice |
| MasVnrArea | skew=2.674, sparse=0.59, outlier=0.067 | Split in "HasMasVnrArea" and "MasVnrSF". Then Standard | High sparsity. once split has limited skew |
| MiscVal | skew=24.451, sparse=0.96, outlier=0.035 | Drop | Even when split, I can't see a linear correlation with saleprice |
| MoSold | skew=0.211, sparse=0.00, outlier=0.000 | Drop | No relationship to sale price |
| MSSubClass | skew=1.406, sparse=0.00, outlier=0.070 | Turn to categorical. one-hot encode. | Not a numeric feature |
| OpenPorchSF | skew=2.361, sparse=0.44, outlier=0.052 | Split in "hasOpenPorch" and "OpenPorchSF". | High sparsity. once split has limited skew |
| OverallCond | skew=0.692, sparse=0.00, outlier=0.085 | Standard Scale | Discrete distribution with low max |
| OverallQual | skew=0.216, sparse=0.00, outlier=0.001 | Standard Scale | Discrete distribution with low max |
| PoolArea | skew=14.813, sparse=0.99, outlier=0.004 | Make binary: "hasPool" | Very high sparsity. poor predictor of saleprice |
| ScreenPorch | skew=4.117, sparse=0.92, outlier=0.079 | Split in "hasScreenPorch" and "ScreenPorchSF" | High sparsity. once split has limited skew |
| TotalBsmtSF | skew=1.522, sparse=0.02, outlier=0.041 | Split in "hasBsmt" and "BsmtSF". Then Standard. | Exclude zeros to reveal distribution |
| TotRmsAbvGrd | skew=0.675, sparse=0.00, outlier=0.020 | Standard Scale | Discrete distribution with low max |
| WoodDeckSF | skew=1.539, sparse=0.52, outlier=0.021 | Split in "hasWoodDeck" and "WoodDeckSF". Then Standard. | High sparsity. once split has limited skew |
| YearBuilt | skew=-0.612, sparse=0.00, outlier=0.004 | Standard Scale | Preserve year to year scale |
| YearRemodAdd | skew=-0.503, sparse=0.00, outlier=0.000 | Standard Scale | Preserve year to year scale |
| YrSold | skew=0.096, sparse=0.00, outlier=0.000 | Drop | No relationship to sale price |
The process of building this table allowed me to group the features into classes. When grouped, these are the overarching heuristics:
With the EDA complete, my next step was to choose some models and establish baselines.
The goal of this project was to explore linear and tree-based models. The models I wanted to evaluate were the following:
| Base Type | Model | Reason |
|---|---|---|
| Linear | Ordinary Least Squares (OLS) | Rough baseline, zero-cost training |
| Linear | Lasso | See how it compares with OLS |
| Linear | Ridge | L2 regularization could work well with this dataset |
| Linear | ElasticNet | Curious how this would compare against Lasso and Ridge |
| Tree | RandomForest | Get a baseline for how tree models cope with this dataset |
| Tree | GradientBoost | Compare against RandomForest. Could the shallow tree depth be stronger against overfitting here? |
| Tree | AdaBoost | Comparison to other tree-based models |
Pipeline to avoid gotchas with preprocessing.n_splits=10 as a reasonable default.The next step was to establish a baseline pipeline, that does imputing, encoding, and modelling. My decision for the baseline was to:
Id, MoSoldStandardScaler for all numeric featuresOneHotEncoder for all categorical nominal featuresOrdinalEncoder for all categorical ordinal featuresMy baseline imputation strategy was as follows:
# numeric structural
SimpleImputer(strategy="constant", fill_value=0)
# numeric substantive
SimpleImputer(strategy="median", add_indicator=True)
# categorical structural
SimpleImputer(strategy="constant", fill_value="NA")
# categorical substantive
SimpleImputer(strategy="constant", fill_value="Unknown")
# categorical ordinal structural
SimpleImputer(strategy="constant", fill_value="NA")
# categorical ordinal substantive
SimpleImputer(strategy="most_frequent", add_indicator=True)
With some rough grid search for hyperparameters, these were the initial baseline results with cv_params = {"n_splits": 10} and commit short hash 5a67de1:
| Model | Hyperparameters | RMSE |
|---|---|---|
| Ordinary Least Squares (OLS) | n/a | 40986.26 |
| Lasso | {"alpha": 0.01} |
42407.55 |
| Ridge | {"alpha": 1} |
37591.72 |
| ElasticNet | {"l1_ratio": 1e-7,"alpha": 1} |
37825.40 |
| RandomForest | {"n_estimators": 600, "max_depth": 11} |
28678.29 |
| GradientBoost | {"n_estimators": 450, "learning_rate": 0.2, "max_depth": 2} |
24304.67 |
| AdaBoost | {"n_estimators": 650, "learning_rate": 1} |
33985.59 |
We can see that GradientBoost outperforms all other models, and the linear models perform worse than the tree-based models on the baseline pipeline.
At this stage, I decided to move forward with Ridge, RandomForest & GradientBoost.
I was curious to see if I could get the Ridge model working better with feature scaling.
This experiment was to implement custom scaling as described above, for all numeric features to improve linear models. As expected, we observe strong improvement in the Ridge model, but no improvement in the GradientBoost.
commit short hash 5a67de1:
| Model | Preprocessor | Hyperparameters | RMSE |
|---|---|---|---|
| Ridge | Baseline | {"alpha": 1} |
37591.72 |
| Ridge | Scaling | {"alpha": 1} |
32298.81 |
| GradientBoost | Baseline | {"n_estimators": 450, "learning_rate": 0.2, "max_depth": 2} |
24437.50 |
| GradientBoost | Scaling | {"n_estimators": 450, "learning_rate": 0.2, "max_depth": 2} |
24877.34 |
At this point, I decided to move forward just with GradientBoost, due to its performance.
I attempted a number of different ablation strategies with GradientBoost and couldn't improve on the score. In practice, Gradient Boosting proved robust to feature ablation; removing correlated or weak features did not improve validation RMSE.
In this experiment, I introduce 2 new features:
TotalSF = 1stFlrSF + 2ndFlrSF + TotalBsmtSF
Square footage per room above ground: SqftPerRmAbvGr = TotalSF / (TotRmsAbvGrd + FullBath)commit short hash 5a67de1:
| Model | Preprocessor | Hyperparameters | RMSE |
|---|---|---|---|
| GradientBoost | Baseline | {"n_estimators": 450, "learning_rate": 0.2, "max_depth": 2} |
24426.31 |
| GradientBoost | Computed | {"n_estimators": 450, "learning_rate": 0.2, "max_depth": 2} |
23962.42 |
We observe that these computed features do improve GradientBoost. GradientBoost benefits from computed feature interaction more effectively than any ablation or scaling.
Now with GradientBoost running on the Computed Values Preprocessor, I went forward with some final tuning.
Sticking with cross validation n_splits=10, I was able to grid search improved parameters for GradientBoost:
commit short hash 5a67de1:
| Model | Preprocessor | Hyperparameters | RMSE |
|---|---|---|---|
| GradientBoost | Computed | {"n_estimators": 500, "learning_rate": 0.2, "max_depth": 2} |
23364.31 |
This was the final tuned model, achieving an RMSE of $23,364.31 under 10-fold cross-validation.
The final model is stored in src/ames/models/best.py
The predictions can be run with:
# Create submission.csv in folder
conda activate ames
python ./src/ames/models/best.py
I learned a huge amount on this project. Beforehand I had never run any ML models, and I had never entered into a Kaggle competition. I learned from scratch all about:
sklearn ,pandas, scipyThe biggest surprise to me was how much time I had to spend on preprocessing compared to running models. Perhaps 70% of the time spent was in EDA and preprocessing. I also had to spend a lot of time understanding the architecture of the models to gain an intuitive sense of how to improve their performance, for example why scaling improves linear models but not tree-based models.
These were deliberately avoided to keep the focus on fundamentals.
If I were to take this further I might look into: