The objective is to predict which compounds are active against the AIDS HIV infection. The dataset has two classes : active or inactive (Binary Classification). The variables represent properties of the molecule inferred from its structure.
The problem is therefore to relate structure to activity (a QSAR=quantitative structure - activity relationship problem) to screen new compounds before actually testing them (a HTS=High Throughput Screening problem).
According to a recent report from the UNSAIDS, it is estimated that more than 34 million people are living with a HIV-1 type infection worldwide and 2.5 million new HIV infections occur every year. Currently, 14.8 million people are eligible for HIV treatment, however only 8 million people are under treatment due to various reasons which includes economical issues.
HTS is a method for scientific experimentation used in drug discovery, linking the fields of biology and chemistry. This method remains very costly process despite many recent technological advances in the field of biotechnology. This is why applying machine learning methods would be of great benefit for the pharmaceutical industry. QSAR models are regression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable [wikipedia]. Machine learning methods like SVM, Random Forest, etc., have been applied to bioinfomatics problems with some success. As more data become available and more complex problems are tackled, deep machine learning methods may also become useful to learn directly representation from the molecular structure by extracting better QSAR features in their hidden layers.
When everything else fails,
ask for additional domain knowledge…
Vu bach: firstname.lastname@example.org
Alvine Lambert: Lambert.email@example.com
Amal Targhi: Targiamal@gmail.com
Abdelhadi Temmar: firstname.lastname@example.org
Divya GRoger: email@example.com
Hafed Rhouma: firstname.lastname@example.org
The idea of the challenge is to predict which compounds are active or inactive (Binary Classification) against the AIDS HIV infection.
This challenge relies on the HIVA dataset.
This version of the database was prepared for the WCCI 2006 performance prediction challenge and the IJCNN 2007 agnostic learning vs. prior knowledge challenge by Isabelle Guyon. A data block contains (several blocks per file) : a header block (molecule name, dimension, program name, date...), a connection table, a data block (atom and bound blocks) and a terminator line.
Features describe chemical structural data for 42,390 compounds was obtained from the web page. It was converted to structural features. The 1617 features selected include :
- unbranched fragments : 750 features
- pharmacophores : 495 features
- branched fragments : 219 features
- internal fingerprints : 132 features
We provide the dataset in two different format :
During the challenge, you'll only have access to labeled training data and unlabeled validation and test data. If you want to compare yourselves with the others participants, you can make a submission and you'll have the score for the valid dataset. The score for the test set will be displayed only the day of the challenge.
The submission will be evaluated using the
Balanced Accuaracy (BAC)
BAC: Balanced accuracy, which is the average of class-wise accuracy for classification problems (or the average of sensitivity (true positive rate) and specificity (true negative rate) for the special case of binary classification). For binary classification problems, the class-wise accuracy is the fraction of correct class predictions when qi is thresholded at 0.5, for each class. The class-wise accuracy is averaged over all classes for multi-label problems. For multi-class classification problems, the predictions are binarized by selecting the class with maximum prediction value argmaxk qik before computing the class-wise accuracy. We normalize the BAC with the formula BAC := (BAC-R)/(1-R), where R is the expected value of BAC for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems).
If P >> N, b_acc is good. . During the development period, the ranking is performed according to the validation BAC.
For more informations Click Here !!
In order to submit your results, go to "Participate". We provide a starting kit, which is an example of submission (you have to submit the .zip). If you want that the platform re-computes the prediction when you make a submission, you have to remove all the files ".predict" in the result directory. You can also pre-calculate the results and put them in the result directory (follow the jupyter notebook ReadMe).
You can find the two datasets(orginal dataset and preprocessed dataset), and a starting kit here: Download Here !!
The main goal of this challenge is to be familiar with machine learning algorithms. No prizes will be awarded.
The authors decline responsibility for mistakes, incompleteness or lack of quality of the information provided in the challenge website. The authors are not responsible for any contents linked or referred to from the pages of this site, which are external to this site. The authors intended not to use any copyrighted material or, if not possible, to indicate the copyright of the respective object. The authors intended not to violate any patent rights or, if not possible, to indicate the patents of the respective objects. The payment of royalties or other fees for use of methods, which may be protected by patents, remains the responsibility of the users.
ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS" THE ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE THROUGH THIS WEBSITE.
Participation in the organized challenge is not-binding and without obligation. Parts of the pages or the complete publication and information might be extended, changed or partly or completely deleted by the authors without notice.
Submissions must be submitted before the 30th April 2017. You may submit 5 submissions every day and 100 in total.
Start: Nov. 26, 2016, 10:40 a.m.
Description: Development phase: create models and submit them or directly submit results on validation and/or test data; feed-back are provided on the validation set only.
Start: April 30, 2017, 10:40 a.m.
Description: Final phase: submissions from the previous phase are automatically cloned and used to compute the final score. The results on the test set will be revealed when the organizers make them available.
You must be logged in to participate in competitions.Sign In