ESTIMATING
THE PRICE OF HOUSES USING MACHINE LEARNING
ABSTRACT
This paper explores
the question of how house prices in five different counties are affected by
housing characteristics (both internally, such as number of bathrooms,
bedrooms, etc. and externally, such as public schools’ scores or the walkability
score of the neighborhood). Using data from sold houses listed on Zillow,
Trulia and Redfin, three prominent housing websites, this paper utilizes both
the hedonic pricing model (Linear Regression) and various machine learning
algorithms, such as Random Forest (RF) and Support Vector Regression (SVR), to
predict house prices. The models’ prediction scores, as well as the ratio of
overestimated houses to underestimated houses are compared against Zillow’s price estimation
scores and ratio. Results show that SVR gives a better price prediction score
than the Zillow’s baseline on the same dataset of Hunt County (TX) and RF gives close
or the same prediction scores to the baseline on three other counties.
Moreover, this paper’s models reduce the overestimated to underestimated house ratio of 3:2
from Zillow’s estimation to a ratio of 1:1. This paper also identifies the four
most important attributes in housing price prediction across the counties as
assessment, comparable houses’ sold price, listed price and number of bathrooms.
INTRODUCTION
According to the US
Census Bureau, 560,000 houses were sold in the United States in 2016 [11]. In
addition, 65% of all American families owned houses in 2016 [12]. For the
Americans who sold and bought these houses, a good housing price prediction
would better prepare them for what to expect before they make one of the most
important financial decisions in their lives. A recent report from the Zillow
Group, a popular housing database website, indicates that house sellers and
buyers are increasingly turning to online research in order to estimate house
price before contacting real estate agents [4]. Researching how much the house
you are interested in is worth on your own can be difficult for multiple
reasons. One particular reason is that there many factors that influence the
potential price of a house, making it more complicated for an individual to
decide how much a house is worth on their own without external help. This can
lead to people making poorly informed decisions about whether to buy or sell
their houses and which prices are reasonable. Because houses are long term
investments, it is imperative that people make their decisions with the most
accurate information possible. Therefore, housing websites such as Zillow,
Trulia and Redfin 1 , exist to provide estimations of housing valuations based
on the houses’ characteristics, at no cost. However, the estimations provided by
these housing websites are not always accurate. For example, Zillow states that
their housing price prediction algorithm, called “Zestimate”, only estimates
54.4% of houses within the 5% of their actual sale prices [22]. For Trulia,
only 48.2% of houses have Trulia-estimated prices to be within the 5% range of
their actual sold prices [20]. Therefore, the first question of this project is
whether I can outperform Zestimate’s prediction score or come close to it. In this
project, I define the prediction score as the percentage of houses whose
estimated prices fall within the 5% range of their actual sold prices. Using
this project’s datasets and Zestimates as the predictions, I compute Zillow’s prediction scores
and use them as the baselines to see how well my own models perform. I chose
Zillow’s estimator as
a benchmark instead of its competitors’ because Zillow is widely regarded as the most
popular housing website due to its large databases of 110 million houses and
their 11 years of expertise in pricing estimations. According to Hitwise, a
consumer analytics company, Zillow’s market share, based on online visits to the
site, is 27.2% in 2016, while the numbers for Trulia and Redfin are 9.4% and
3.7%, respectively. Zillow tends to overestimate their listed properties,
meaning the Zestimates are higher than the actual sold prices of the houses. In
the dataset of 1,457 sold houses I collected, the ratio of overestimated houses
to underestimated houses is 3 to 2. Hollas, Rutherford and Thomson (2010)
studies Zillow’s estimations of single family houses and finds that 80% of their
housing sample gathered from Zillow are overpriced by Zestimate [8]. For a
house seller who prices his house based on Zillow’s suggestion, he/she
is likely to list his/her house for more than what it is worth. According to a
Zillow research in 2016, if a house is priced above its true market valuation,
it tends to stay on the market five times longer compared to a house that is
well-priced, suggesting a string penalty for overpricing houses [19]. Moreover,
the same research suggests that houses that have been on the market for two
months can lose 5% of its original listed price. Asabere and Huffman (1993)
also supports the theory of a reversed correlation between a house’s time on the market
and its final sold price [1]. Therefore, the second question of this project is
whether my models can get rid of this overestimation problem. The final
question of this project is what the most important factors affecting housing
prices are. In order to answer the three questions listed above, this project
proposes using both the hedonic pricing model and various machine learning
algorithms.
EXISTING SYSTEM
In the previous days
the houses are being sold by using a third party agents who themselves also
takes some stake of the money resulting in frauds and owners does not gets
proper money. To avoid such system we are proposing the system which will use
machine learning algorithms to predict what is the price of the house and
without even using the third party agents the house can be sold at a much
easier way.
ALGOROTHMS
LINEAR REGRESSION
Simple linear regression is useful for
finding relationship between two continuous variables. One is predictor or
independent variable and other is response or dependent variable. It looks for
statistical relationship but not deterministic relationship. Relationship
between two variables is said to be deterministic if one variable can be
accurately expressed by the other. For example, using temperature in degree
Celsius it is possible to accurately predict Fahrenheit. Statistical
relationship is not accurate in determining relationship between two variables.
For example, relationship between height and weight.
The core idea is to obtain a line that
best fits the data. The best fit line is the one for which total prediction
error (all data points) are as small as possible. Error is the distance between
the point to the regression line.
Real-time example
We have a dataset which contains
information about relationship between ‘number of hours studied’ and ‘marks obtained’. Many students have been observed and their hours of study
and grade are recorded. This will be our training data. Goal is to design a
model that can predict marks if given the number of hours studied. Using the
training data, a regression line is obtained which will give minimum error. This
linear equation is then used for any new data. That is, if we give number of
hours studied by a student as an input, our model should predict their mark
with minimum error.
Y(pred) = b0 + b1*x
The values b0 and b1 must be chosen so
that they minimize the error. If sum of squared error is taken as a metric to
evaluate the model, then goal to obtain a line that best reduces the error.

Figure 2: Error Calculation
If we don’t square the error, then positive and
negative point will cancel out each other.
For model with one predictor,
Figure 3: Intercept Calculation

Figure 4: Co-efficient Formula
Exploring ‘b1’
·
If b1 > 0, then x(predictor) and y(target) have a positive
relationship. That is increase in x will increase y.
·
If b1 < 0, then x(predictor) and y(target) have a negative
relationship. That is increase in x will decrease y.
Exploring ‘b0’
·
If the model does not include x=0, then the prediction will
become meaningless with only b0. For example, we have a dataset that relates
height(x) and weight(y). Taking x=0(that is height as 0), will make equation
have only b0 value which is completely meaningless as in real-time height and
weight can never be zero. This resulted due to considering the model values
beyond its scope.
·
If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But,
setting zero for all the predictor variables is often impossible.
·
The value of b0 guarantee that residual have mean zero. If
there is no ‘b0’ term, then regression will be forced to pass over the
origin. Both the regression co-efficient and prediction will be biased.
Co-efficient from Normal equations
Apart from above equation co-efficient
of the model can also be calculated from normal equation.
Figure 5: Co-efficient calculation using Normal Equation
Theta contains co-efficient of all
predictors including constant term ‘b0’. Normal equation performs computation by taking inverse of
input matrix. Complexity of the computation will increase as the number of
features increase. It gets very slow when number of features grow large.
SUPPORT VETOR
REGRESSION
Those who are in Machine Learning or Data Science are quite
familiar with the term SVM or Support Vector Machine. But SVR is a bit
different from SVM. As the name suggest the SVR is an regression algorithm , so we can use SVR for working with
continuous Values instead of Classification which is SVM.
The terms that we are going to be using
frequently in this post
1.
Kernel: The function used to map a lower dimensional data into a
higher dimensional data.
2.
Hyper Plane: In SVM this is basically the separation line between the
data classes. Although in SVR we are going to define it as the line that will
will help us predict the continuous value or target value
3.
Boundary line: In SVM there are two lines other than Hyper Plane which
creates a margin . The support vectors can be on the
Boundary lines or outside it. This boundary line separates the two classes. In
SVR the concept is same.
4.
Support vectors: This are the data points which are closest to the boundary.
The distance of the points is minimum or least.
“Support Vector Machine” (SVM) is a
supervised machine learning algorithm which can be used for both classification
and regression challenges. However, it is mostly used in classification problems.
In this algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature being
the value of a particular coordinate. Then, we perform classification by
finding the hyper-plane that differentiate the two classes very well (look at
the below snapshot). The SVM algorithm is implemented in practice using a
kernel. The learning of the hyperplane in linear SVM is done by transforming
the problem using some linear algebra, which is out of the scope of this
introduction to SVM. A powerful insight is that the linear SVM can be rephrased
using the inner product of any two given observations, rather than the
observations themselves. The inner product between two vectors is the sum of
the multiplication of each pair of input values. For example, the inner product
of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28. The equation for making a
prediction for a new input using the dot product between the input (x) and each
support vector (xi) is calculated as follows:
f(x) = B0 + sum(ai
* (x,xi))
DECISION
TREE
A tree
has many analogies in real life, and turns out that it has influenced a wide
area of machine learning, covering both classification and
regression. In
decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. As the name goes, it uses a tree-like
model of decisions. Though a commonly used tool in data mining for deriving a
strategy to reach a particular goal, its also widely used in machine learning.
RANDOM FOREST
With increase in
computational power, we can now choose algorithms which perform very
intensive calculations. One such algorithm is “Random Forest”.Random
forest is like bootstrapping algorithm with Decision tree (CART) model.
Say, we have 1000 observation in the complete population with 10 variables.
Random forest tries to build multiple CART model with different sample and
different initial variables. For instance, it will take a random sample of 100
observation and 5 randomly chosen initial variables to build a CART model. It
will repeat the process (say) 10 times and then make a final prediction on each
observation. Final prediction is a function of each prediction. This final prediction
can simply be the mean of each prediction.
thank you for your comment
pls call me on 8125424511