ESTIMATING THE PRICE OF HOUSES USING MACHINE LEARNING


ESTIMATING THE PRICE OF HOUSES USING MACHINE LEARNING

ABSTRACT
This paper explores the question of how house prices in five different counties are affected by housing characteristics (both internally, such as number of bathrooms, bedrooms, etc. and externally, such as public schools scores or the walkability score of the neighborhood). Using data from sold houses listed on Zillow, Trulia and Redfin, three prominent housing websites, this paper utilizes both the hedonic pricing model (Linear Regression) and various machine learning algorithms, such as Random Forest (RF) and Support Vector Regression (SVR), to predict house prices. The models prediction scores, as well as the ratio of overestimated houses to underestimated houses are compared against Zillows price estimation scores and ratio. Results show that SVR gives a better price prediction score than the Zillows baseline on the same dataset of Hunt County (TX) and RF gives close or the same prediction scores to the baseline on three other counties. Moreover, this papers models reduce the overestimated to underestimated house ratio of 3:2 from Zillows estimation to a ratio of 1:1. This paper also identifies the four most important attributes in housing price prediction across the counties as assessment, comparable houses sold price, listed price and number of bathrooms.

INTRODUCTION
According to the US Census Bureau, 560,000 houses were sold in the United States in 2016 [11]. In addition, 65% of all American families owned houses in 2016 [12]. For the Americans who sold and bought these houses, a good housing price prediction would better prepare them for what to expect before they make one of the most important financial decisions in their lives. A recent report from the Zillow Group, a popular housing database website, indicates that house sellers and buyers are increasingly turning to online research in order to estimate house price before contacting real estate agents [4]. Researching how much the house you are interested in is worth on your own can be difficult for multiple reasons. One particular reason is that there many factors that influence the potential price of a house, making it more complicated for an individual to decide how much a house is worth on their own without external help. This can lead to people making poorly informed decisions about whether to buy or sell their houses and which prices are reasonable. Because houses are long term investments, it is imperative that people make their decisions with the most accurate information possible. Therefore, housing websites such as Zillow, Trulia and Redfin 1 , exist to provide estimations of housing valuations based on the houses characteristics, at no cost. However, the estimations provided by these housing websites are not always accurate. For example, Zillow states that their housing price prediction algorithm, called Zestimate, only estimates 54.4% of houses within the 5% of their actual sale prices [22]. For Trulia, only 48.2% of houses have Trulia-estimated prices to be within the 5% range of their actual sold prices [20]. Therefore, the first question of this project is whether I can outperform Zestimates prediction score or come close to it. In this project, I define the prediction score as the percentage of houses whose estimated prices fall within the 5% range of their actual sold prices. Using this projects datasets and Zestimates as the predictions, I compute Zillows prediction scores and use them as the baselines to see how well my own models perform. I chose Zillows estimator as a benchmark instead of its competitors because Zillow is widely regarded as the most popular housing website due to its large databases of 110 million houses and their 11 years of expertise in pricing estimations. According to Hitwise, a consumer analytics company, Zillows market share, based on online visits to the site, is 27.2% in 2016, while the numbers for Trulia and Redfin are 9.4% and 3.7%, respectively. Zillow tends to overestimate their listed properties, meaning the Zestimates are higher than the actual sold prices of the houses. In the dataset of 1,457 sold houses I collected, the ratio of overestimated houses to underestimated houses is 3 to 2. Hollas, Rutherford and Thomson (2010) studies Zillows estimations of single family houses and finds that 80% of their housing sample gathered from Zillow are overpriced by Zestimate [8]. For a house seller who prices his house based on Zillows suggestion, he/she is likely to list his/her house for more than what it is worth. According to a Zillow research in 2016, if a house is priced above its true market valuation, it tends to stay on the market five times longer compared to a house that is well-priced, suggesting a string penalty for overpricing houses [19]. Moreover, the same research suggests that houses that have been on the market for two months can lose 5% of its original listed price. Asabere and Huffman (1993) also supports the theory of a reversed correlation between a houses time on the market and its final sold price [1]. Therefore, the second question of this project is whether my models can get rid of this overestimation problem. The final question of this project is what the most important factors affecting housing prices are. In order to answer the three questions listed above, this project proposes using both the hedonic pricing model and various machine learning algorithms.
EXISTING SYSTEM
In the previous days the houses are being sold by using a third party agents who themselves also takes some stake of the money resulting in frauds and owners does not gets proper money. To avoid such system we are proposing the system which will use machine learning algorithms to predict what is the price of the house and without even using the third party agents the house can be sold at a much easier way.
ALGOROTHMS
LINEAR REGRESSION
Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.
Real-time example
We have a dataset which contains information about relationship between number of hours studied and marks obtained. Many students have been observed and their hours of study and grade are recorded. This will be our training data. Goal is to design a model that can predict marks if given the number of hours studied. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used for any new data. That is, if we give number of hours studied by a student as an input, our model should predict their mark with minimum error.
Y(pred) = b0 + b1*x
The values b0 and b1 must be chosen so that they minimize the error. If sum of squared error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces the error.
IMG_256
Figure 2: Error Calculation
If we dont square the error, then positive and negative point will cancel out each other.
For model with one predictor,
IMG_257
Figure 3: Intercept Calculation
IMG_258
Figure 4: Co-efficient Formula
Exploring b1
·         If b1 > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will increase y.
·         If b1 < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will decrease y.
Exploring b0
·         If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0(that is height as 0), will make equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.
·         If the model includes value 0, then b0 will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.
·         The value of b0 guarantee that residual have mean zero. If there is no b0 term, then regression will be forced to pass over the origin. Both the regression co-efficient and prediction will be biased.
Co-efficient from Normal equations
Apart from above equation co-efficient of the model can also be calculated from normal equation.
IMG_259
Figure 5: Co-efficient calculation using Normal Equation
Theta contains co-efficient of all predictors including constant term b0. Normal equation performs computation by taking inverse of input matrix. Complexity of the computation will increase as the number of features increase. It gets very slow when number of features grow large.

SUPPORT VETOR REGRESSION
 Those who are in Machine Learning or Data Science are quite familiar with the term SVM or Support Vector Machine. But SVR is a bit different from SVM. As the name suggest the SVR is an regression algorithm , so we can use SVR for working with continuous Values instead of Classification which is SVM.
The terms that we are going to be using frequently in this post
1.      Kernel: The function used to map a lower dimensional data into a higher dimensional data.
2.      Hyper Plane: In SVM this is basically the separation line between the data classes. Although in SVR we are going to define it as the line that will will help us predict the continuous value or target value
3.      Boundary line: In SVM there are two lines other than Hyper Plane which creates a margin . The support vectors can be on the Boundary lines or outside it. This boundary line separates the two classes. In SVR the concept is same.
4.      Support vectors: This are the data points which are closest to the boundary. The distance of the points is minimum or least.
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification and regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well (look at the below snapshot). The SVM algorithm is implemented in practice using a kernel. The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra, which is out of the scope of this introduction to SVM. A powerful insight is that the linear SVM can be rephrased using the inner product of any two given observations, rather than the observations themselves. The inner product between two vectors is the sum of the multiplication of each pair of input values. For example, the inner product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28. The equation for making a prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows:


                             f(x) = B0 + sum(ai * (x,xi))
DECISION TREE
A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, its also widely used in machine learning.
RANDOM FOREST
With increase in computational power, we can now choose algorithms which perform very intensive calculations. One such algorithm is “Random Forest”.Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction.



Share this

Related Posts

Previous
Next Post »

thank you for your comment

pls call me on 8125424511