Predicting House Prices With Machine Learning

In this article let's try using machine learning to predict the California housing prediction problem. The California housing data set contains below characteristics

Input : Data Features


MedInc        median income in block group
HouseAge      median house age in block group
AveRooms      average number of rooms per household
AveBedrms     average number of bedrooms per household
Population    block group population
AveOccup      average number of household members
Latitude      block group latitude
Longitude     block group longitude

Output : Target variable


MedHouseVal       The Median House Values target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)

Brief Intro About California Housing Dataset

This dataset was derived from the 1990 U.S. census, using one row per census, you can read more about this data set and also download the tar file of this data using the link.

https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). A household is a group of people residing within a home.

Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

Title :

Predicting House Prices with Machine Learning

Solution :

Given the above input data, we have to predict the MedHouseVal (median house values). Lets first try import required libraries


import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
california = fetch_california_housing()

Separate the data into input features and target value


X = california.data  # Input Data features
y = california.target  # Output Target variable (median house values)

# Split the data into training and testing sets 20 percent test size = 0.2 and 80 percent training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Get the instance of linear regression model


model = LinearRegression()

Train the model and make prediction on test set


model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Evaluate the model


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Provide random input data values and predict the new target variable in USD


new_data = np.array([[8.234, 41.0, 6.9841, 1.0238, 322.0, 2.5556, 37.88, -142.23]])

predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price}")

Complete Code for Predicting House Prices with Machine Learning :


import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
california = fetch_california_housing()

# Split the data into input features and target variable
X = california.data  # Input features
y = california.target  # Target variable (median house values)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

new_data = np.array([[8.3252, 41.0, 6.9841, 1.0238, 322.0, 2.5556, 37.88, -222.23]])

predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price}")

Output 1 :


Input : new_data = new_data = np.array([[8.234, 23.0, 6.9841, 1.0238, 322.0, 2.5556, 37.88, -222.23]])
Mean Squared Error: 0.5558915986952425
Predicted Price: [47.30678911]

Output 2:


Input Data : new_data = np.array([[8.234, 41.0, 6.9841, 1.0238, 322.0, 2.5556, 37.88, -142.23]])
Predicted Price: [12.78518055]

Conclusion :

The above python program predicts the housing price by accepting the California housing dataset as input, the modules we require is sklearn and numpy

This program gets the input from the sklearn, by just importing fetch_california_housing data, get the input data X and the value that needs to be predicted label as y Split the X and y values in to training and test set and then use the LinearRegression model to fit and predict the values, calculate and print the mean squared error.

Finally give a random input data and try predict the in Target variable MedHouseVal (median house values) in USD