Activity molecule against HIV infection

Organized by blue-chalearn.org - Current server time: Oct. 27, 2020, 11:46 a.m. UTC

Previous

Development
Nov. 26, 2016, 10:40 a.m. UTC

Current

Final
April 30, 2017, 10:40 a.m. UTC

End

Competition Ends
Never

Activity molecule against HIV infection

Overview

The objective is to predict which compounds are active against the AIDS HIV infection. The dataset has two classes : active or inactive (Binary Classification). The variables represent properties of the molecule inferred from its structure.

The problem is therefore to relate structure to activity (a QSAR=quantitative structure - activity relationship problem) to screen new compounds before actually testing them (a HTS=High Throughput Screening problem).

Background

According to a recent report from the UNSAIDS, it is estimated that more than 34 million people are living with a HIV-1 type infection worldwide and 2.5 million new HIV infections occur every year. Currently, 14.8 million people are eligible for HIV treatment, however only 8 million people are under treatment due to various reasons which includes economical issues.

HTS is a method for scientific experimentation  used in drug discovery, linking the fields of biology and chemistry. This method  remains very costly process despite many recent technological advances in the field of biotechnology. This is why applying machine  learning methods would be of great benefit for the pharmaceutical industry. QSAR models are regression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable [wikipedia]. Machine learning methods like SVM, Random Forest, etc., have been applied to bioinfomatics problems with some success. As more data become available and more complex problems are tackled, deep machine learning methods may also become useful to learn directly representation from the molecular structure by extracting better QSAR features in their hidden layers.

                                                                 When everything else fails,
                                                                 ask for additional domain knowledge…

Contact us:

Vu bach:   backvu@gmail.com

Alvine Lambert:   Lambert.alwine@gmail.com

Amal Targhi:    Targiamal@gmail.com

Abdelhadi Temmar:   abdelhadi.temmar@gmail.com

Divya GRoger:   revorg7@gmail.com

Hafed Rhouma: rhoumahafed83@gmail.com

 

Thanks to Isabelle Guyon for providing us the Dataset.


References

  1. Guyon, I., Saffari, A., Dror, G., & Cawley, G. (2007, August). Agnostic learning vs. prior knowledge challenge. In 2007 International Joint Conference on Neural Networks (pp. 829-834). IEEE.
  2. Guyon, I., Saffari, A., Dror, G., & Cawley, G. (2008). Analysis of the IJCNN 2007 agnostic learning vs. prior knowledge challenge. Neural Networks, 21(2), 544-550.
  3. Isabelle Guyon, Gavin Cawley, Gideon Dror, and Amir Saffari Editors (2011, May). Hand on Pattern Recognition.
  4. Isabelle Guyon (2006, September). Datasets for the Agnostic Learning vs. Prior Knowledge Competition.
  5. Guyon, I., Alamdari, A. R. S. A., Dror, G., & Buhmann, J. M. (2006, July). Performance prediction challenge. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1649-1656). IEEE
  6. Slides created by the authors to explain the problem better. http://bit.ly/2j0gx9E

 

Activity molecule against HIV infection

now comes the real challenge !

The idea of the challenge is to predict which compounds are active or inactive (Binary Classification) against the AIDS HIV infection.

DataSets and instructions

This challenge relies on the HIVA dataset.

This version of the database was prepared for the WCCI 2006 performance prediction challenge and the IJCNN 2007 agnostic learning vs. prior knowledge challenge by Isabelle Guyon. A data block contains (several blocks per file) : a header block (molecule name, dimension, program name, date...), a connection table, a data block (atom and bound blocks) and a terminator line.

Features describe chemical structural data for 42,390 compounds was obtained from the web page. It was converted to structural features. The 1617 features selected include :


   - unbranched fragments : 750 features
   - pharmacophores : 495 features
   - branched fragments : 219 features
   - internal fingerprints : 132 features

We provide the dataset in two different format :

  • "Agnostic" learning track datasets (the one describe above): The data are preprocessed in a feature representation as close as possible to the raw data. You will have no knowledge of what the features are, so no opportunity to use knowledge about the task to improve your method. You should use a completely self contained learning machines and not use information disclosed to the "prior knowledge track" participants about the nature of the data..
  • "Prior knowledge" track datasets. The data are in their original format and you have access to all the information about what it is. Make use of this information to create learning machines that are smarter than those trained on the agnostic data: better feature extraction, better kernels, etc.

During the challenge, you'll only have access to labeled training data and unlabeled validation and test data. If you want to compare yourselves with the others participants, you can make a submission and you'll have the score for the valid dataset. The score for the test set will be displayed only the day of the challenge.

EVALUATION

 The submission will be evaluated using the Balanced Accuaracy (BAC)

Undestanding BAC

BAC: Balanced accuracy, which is the average of class-wise accuracy for classification problems (or the average of sensitivity (true positive rate) and specificity (true negative rate) for the special case of binary classification). For binary classification problems, the class-wise accuracy is the fraction of correct class predictions when qi is thresholded at 0.5, for each class. The class-wise accuracy is averaged over all classes for multi-label problems. For multi-class classification problems, the predictions are binarized by selecting the class with maximum prediction value argmaxk qik before computing the class-wise accuracy. We normalize the BAC with the formula BAC := (BAC-R)/(1-R), where R is the expected value of BAC for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).

Example of  Banlenced Accuaracy (BAC)

If P >> N, b_acc is good. . During the development period, the ranking is performed according to the validation BAC.

For more informations   Click Here !!

Submission

In order to submit your results, go to "Participate". We provide a starting kit, which is an example of submission (you have to submit the .zip). If you want that the platform re-computes the prediction when you make a submission, you have to remove all the files ".predict" in the result directory. You can also pre-calculate the results and put them in the result directory (follow the jupyter notebook ReadMe).

Get the starting kit and the data

You can find the two datasets(orginal dataset and preprocessed dataset), and a starting kit here: Download Here !!

Good luck !

Activity molecule against HIV infection

The main goal of this challenge is to be familiar with machine learning algorithms. No prizes will be awarded.

The authors decline responsibility for mistakes, incompleteness or lack of quality of the information provided in the challenge website. The authors are not responsible for any contents linked or referred to from the pages of this site, which are external to this site. The authors intended not to use any copyrighted material or, if not possible, to indicate the copyright of the respective object. The authors intended not to violate any patent rights or, if not possible, to indicate the patents of the respective objects. The payment of royalties or other fees for use of methods, which may be protected by patents, remains the responsibility of the users.

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS" THE ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE THROUGH THIS WEBSITE.

Participation in the organized challenge is not-binding and without obligation. Parts of the pages or the complete publication and information might be extended, changed or partly or completely deleted by the authors without notice.

Submissions must be submitted before the 30th April 2017.  You may submit 5 submissions every day and 100 in total.

Development

Start: Nov. 26, 2016, 10:40 a.m.

Description: Development phase: create models and submit them or directly submit results on validation and/or test data; feed-back are provided on the validation set only.

Final

Start: April 30, 2017, 10:40 a.m.

Description: Final phase: submissions from the previous phase are automatically cloned and used to compute the final score. The results on the test set will be revealed when the organizers make them available.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In