Predicting Rent Prices in Albuquerque Metropolitan Area

Project Goals

Contracted to analyze apartment metrics to build a model to predict rent prices and apartment occupancy from relevant features. Data will be used to competetively price apartment rent and predict occupancy for profit analysis. Resulted in a model with less than 5% error rate using ensemble machine learning methods.

Data Cleaning

The first step in the data analysis was to examine the data. There were many missing or incorrect values, so we replaced invalid entires with either the mean or mode, depending on the specific feature. This was done to improve the quality of the dataset, without removing rows from an already sparse data source.

We also remove features such as the apartment complex name, sub market, etc as they do not have any relationship with our predictors.

Here are the statistics for each feature:

Feature Mean Median Mode Minimum Maximum Standard Deviation
Units 209.2893 77 240 46 573 114.4006
Bedrooms 1.6309 2 2 1 4 0.7115
Size 854.8595 840 800 109 1751 271.1651
Bathrooms 1.3795 1 1 1 3 0.4931
Rent $601.04 $715 $599 $310 $2,437 $152.82
Occupancy 89.3106% 100% 100 36% 100% 9.0944%
Zip Code 67671.6736 87111 null 87048 87123 17132.2395
Age 1820.5799 1965 null 2 2011 196.1572

Data Visualization & Analysis

The data then needed to be visualized to examine given features and their relationship to our two output features- rent and occupancy. The graphs found below start telling the story of our data:

As you can see, the size of a unit and the number of bedrooms (and bathrooms) are positively correlated with the rent. This intuitively makes sense, as we expect large apartments with more amenities to be more expensive.

Interestingly enough, the number of units in an apartment complex is negatively correlated with occupancy. This is likely due to large apartment complexes having longer occupancy between renters.

We then use a correlation matrix to confirm our visual suspicions:

Feature Rent Occupancy
Units 9% -22%
Bedrooms 40% 8%
Size 39% -6%
Bathrooms 34% -3%
ZIP 100% 100%
Age 100% 100%

The correlation matrix reflects what we learned from visualizing the data, but also brings up an alarming feature- both the zip code and age of the apartment complex are both highly correlated with rent and occupancy. Further inspection reveals that due to the sparse data, most zip codes and ages are unique, which explains the high correlation. We will remove these features, as it is unlikely they will generalize well to new sample data.

Model Building and Analysis

The next step in our process is to build a model. We use a simple feedforward network with a small number of hidden layers. Data is normalized to prevent training errors. Initial results were promising, but using a gradient boosting technique and hyperparameter tuning lowered the error of our model. Evaluation on the test dataset revealed the following errors:

Feature Mean Percent Absolute Error
Rent 4.87%
Occupancy 5.07%

Query Model:



Results and Future Work:

The generated model has great accuracy for predicting rent prices, with an acceptable error rate. The main issue with the current model centers around the lack of market data. The model could be improved with more data points and extended to other areas and markets. More data would likely make location information more relevant in the model, which we know to be correlated with price.

It would also be interesting to include a single image of the proeprty in the pipeline, to try and quantify the aesthetic value of the property. This could be trained using the same network with a convolutional neural network used to downsample features of the image into an input vector.