House prices challenge

Organized by friend - Current server time: Oct. 27, 2020, 12:12 p.m. UTC

Previous

Development
Oct. 22, 2017, 6:53 p.m. UTC

Current

Final
April 30, 2018, 6:53 p.m. UTC

End

Competition Ends
Never

House Pricing in King country challenge.

Brought to you by Friend team.

 

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It's a great dataset for evaluating simple regression models. the dataset contains 18 house features plus the price ( the target ), along with 21613 observations.

  

 

The aim of this project is to predict house prices. So that the friend agency can make capital gains by buying undervalued houses and selling them thereafter.

Background

"How much can I sell a house ?"

As a real estate agent, you will be trying to answer this question by digging the data you have been given. In fact, this is a tricky question: you want to sell it at the higher price, but you also want to sell it and nobody will buy it if it's way overpriced. Your real estate agency in King County, a county of the state of Washington in the US, has all the records needed. You will look at the features of a lot of houses and the price they were sold at. We can guess that some properties are important like a bigger house will probably be sold at a higher price than a smaller if they are both located in the same place, but perhaps there are some other factors not so obvious that might be taken into account when determining the price of a house. We want you to find those hidden factors that will allow you to make precise estimations of the right price of a house so we can make capital gains by buying undervalued houses and selling them thereafter.

 

King country real estate market analysis for October 2017

 

According to a study made in October 2017, the average sale price for King County homes is 673,628 dollars. In comparison 5 years ago the average sales price was 414,403 dollars. That is a 62 percent increase over 5 years ago. 

Contact us 

Friend Team: friend@chalearn.org

References and credit:

[1] Kaggle DataSet

[2] King country Housing Market Report

The competition protocol was designed by Isabelle Guyon. 
This challenge was generated using ChaLab.

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Friend group: Evaluation

The problem is a regression problem. Each sample (a house) is characterized by 18 features. You must predict the house price. You are given for training a data matrix X_train of dimension 12967 x 18 and an array y_train of houses prices 12967. You must train a model which predicts the price of houses for two test matrices X_valid and X_test. 

Each example is characterized by the following features:


Preparing your submission with the starting kit is the easiest. 

There are 2 phases:

  • Phase 1: development phase. We provide you with training data with the values of the target variable and a validation and test data without prices. Make predictions for both datasets. However, you will receive feed-back on your performance on the validation set only. The performance of your LAST submission will be displayed on the leaderboard.
  • Phase 2: final phase. You do not need to do anything. Your last submission of phase 1 will be automatically forwarded. Your performance on the test set will appear on the leaderboard when the organizers finish checking the submissions.

This sample competition allows you to submit either:

  • Only prediction results (no code).
  • A pre-trained prediction model.
  • A prediction model that must be trained and tested.

The submissions are evaluated using the R-squared metric. 

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

Graphical Representation of R-squared

Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

 The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.

But while making your model, you need to be careful to not overfit. A model that has learned the noise instead of the signal is considered “overfit” because it fits the training dataset but has a poor fit with new datasets.

 In order to avoid overfitting you can learn about: Cross-ValidationRegularization

 

Learn More about R-squared :

[1] Wikipedia

[2] How Do I Interpret R-squared. 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Friend group: Rules

Submissions must be made before the end of phase 1. You may submit 5 submissions every day and 100 in total.

This challenge is governed by the general ChaLearn contest rules.

This competition is organized solely for test purposes. No prizes will be awarded.

The authors decline responsibility for mistakes, incompleteness or lack of quality of the information provided in the challenge website. The authors are not responsible for any contents linked or referred to from the pages of this site, which are external to this site. The authors intended not to use any copyrighted material or, if not possible, to indicate the copyright of the respective object. The authors intended not to violate any patent rights or, if not possible, to indicate the patents of the respective objects. The payment of royalties or other fees for use of methods, which may be protected by patents, remains the responsibility of the users.

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS" THE ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE THROUGH THIS WEBSITE.

Participation in the organized challenge is not-binding and without obligation. Parts of the pages or the complete publication and information might be extended, changed or partly or completely deleted by the authors without notice.

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Development

Start: Oct. 22, 2017, 6:53 p.m.

Description: Development phase: create models and submit them or directly submit results on validation and/or test data; feed-back are provided on the validation set only.

Final

Start: April 30, 2018, 6:53 p.m.

Description: Final phase: submissions from the previous phase are automatically cloned and used to compute the final score. The results on the test set will be revealed when the organizers make them available.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 auto-sklearn 0.8190
2 reference4 0.7976
3 reference2 0.7970