Air Quality Challenge

Organized by ecolo - Current server time: Oct. 27, 2020, 12:18 p.m. UTC

Previous

Development
Oct. 22, 2017, 6:53 p.m. UTC

Current

Final
April 30, 2018, 6:53 p.m. UTC

End

Competition Ends
Never

Air Quality Challenge

Brought to you by ECOLO TEAM

 

The goal of this challenge is to predict the NOx levels in the air in Northern Taiwan, which is an indicator of pollution. The dataset is from Kaggle[1] and was initially provided by the Environmental Protection Administration, Executive Yuan, R.O.C.
(Taiwan). Data has been collected for several regions in Taiwan during one year using one hour sampling rate.For our analysis, we made the assumption that our features are independant.

A good way of evaluating the pollution rate is with the levels of NOx, a generic term for the nitrogen oxides, namely nitric oxide (NO) and nitrogen dioxide (NO2). These gases contribute to the formation of smog and acid rain, as well as tropospheric ozone.[7]

Background

Pollution, or the introduction of different forms of waste materials in our environment, has negative effects to the ecosystem we rely on. With modernization and development in our lives, pollution has reached its peak, giving rise to global warming and human illness.
In this project, we will focus on the air pollution in in northern Taiwan. Air Pollution is the most prominent and dangerous form of pollution. It occurs due to many reasons : excessive burning of fuel, driving and other industrial activities, etc.

The effects of air pollution are evident. Releasing hazardous gases into the air causes global warming and acid rains, which in turn increase temperatures, erratic rains and droughts worldwide, making it tough for animals to survive. We breathe in every polluted particle from the air : the result is an increased number of asthma cases and lung cancers.[3]
It is interesting to note that 13% of diagnosed cancers worldwide in 2012 were lung cancers, a significant part being caused by air pollution.[3][4]

Temperature Evolution since 1880[5]

Air pollution in Taiwan is significantly created both domestically as well as blown over from China (People's Republic of China). Taiwan's topography has been noted to be a contributing factor to its air pollution problem. Taipei, Taiwan's capital and largest city for example, is surrounded by mountains, and other industrial centers along the northern and western coasts of Taiwan are surrounded by high mountains.[6]

 

NOx represents a family of several compounds. In atmospheric chemistry, the term NOx means the total concentration of NO and NO2 . NO2 is not only an important air polluant by itself, but it also reacts in the atmosphere to form ozone (the ozone in the air we breathe, not stratospheric ozone) and acid rain.The European Union limit value is 40 micrograms per cubic meter ;  this limit has been surpassed for the last decade. [8]

Contact us 

Ecolo Team : ecolo@chalearn.org

 

References and credit

[1] Kaggle dataset
[2] Pollution types 
[3] Research on cancer
[4] How air pollution can cause cancer
[5] Global temperature evolution 
[6] Air pollution in Taiwan (Wikipedia)
[7] NOx (Wikipedia)

[8] Why should we control NOx ?

The competition protocol was designed by Isabelle Guyon. 
This challenge was generated using ChaLab.

Air Quality: Instructions & Evaluation 

Instructions

The problem is a regression problem ; the idea is to predict the levels of NOx. The dataset contains about70,000 examples for each of the training, validating and testing sets.

Each example is characterized by the following features : 

In order to submit your results, go to "Participate", download the starting kit and follow the instructions. We also provide a submission example.

There are 2 phases:

  • Phase 1: development phase. We provide you with labeled training data and unlabeled validation and test data. Make predictions for both datasets. However, you will receive feed-back on your performance on the validation set only. The performance of your LAST submission will be displayed on the leaderboard.
  • Phase 2: final phase. You do not need to do anything. Your last submission of phase 1 will be automatically forwarded. Your performance on the test set will appear on the leaderboard when the organizers finish checking the submissions.

This sample competition allows you to submit either:

  • Only prediction results (no code).
  • A pre-trained prediction model.
  • A prediction model that must be trained and tested.

Evaluation

The error we will use for the evaluation is the r-squared.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 1 : 

  • 0 indicates that the model explains none of the variability of the response data around its mean.
  • 1 indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. 

Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

The regression model on the left accounts for 0.38 while the one on the right accounts for 0.87. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.

 

However, you need to be careful not to overfit your model. Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. Unfortunately, one of the symptoms of an overfit model is an R-squared value that is too high.

 

On the left : the blue line is underfitting the data ; the model has not been trained enough. Middle : a good model. Right : an overfitted model. While the model on the right best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the one in the middle. (source)

 

You can avoid that with cross-validation or regularization.

Learn more : Wikipedia on r-squared, r-squared explained, the danger of overfitting regression models, how to avoid overfitting

Air Quality: Rules

Submissions must be made before the end of phase 1. You may submit 5 submissions every day and 100 in total.

This challenge is governed by the general ChaLearn contest rules.

This competition is organized solely for test purposes. No prizes will be awarded.

The authors decline responsibility for mistakes, incompleteness or lack of quality of the information provided in the challenge website. The authors are not responsible for any contents linked or referred to from the pages of this site, which are external to this site. The authors intended not to use any copyrighted material or, if not possible, to indicate the copyright of the respective object. The authors intended not to violate any patent rights or, if not possible, to indicate the patents of the respective objects. The payment of royalties or other fees for use of methods, which may be protected by patents, remains the responsibility of the users.

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS" THE ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE THROUGH THIS WEBSITE.

Participation in the organized challenge is not-binding and without obligation. Parts of the pages or the complete publication and information might be extended, changed or partly or completely deleted by the authors without notice.

 

Development

Start: Oct. 22, 2017, 6:53 p.m.

Description: Development phase: create models and submit them or directly submit results on validation and/or test data; feed-back are provided on the validation set only.

Final

Start: April 30, 2018, 6:53 p.m.

Description: Final phase: submissions from the previous phase are automatically cloned and used to compute the final score. The results on the test set will be revealed when the organizers make them available.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 Zhengying 0.8481
2 friend 0.8239
3 reference4 0.8225