Health Data Challenge

Organized by hadaca - Current server time: Oct. 27, 2020, 11:43 a.m. UTC


Development Phase
Nov. 15, 2018, midnight UTC


Final Phase
April 30, 2019, midnight UTC


Competition Ends

HADACA : Identifying cancer stages based on tumor heterogeneity


This is a machine learning challenge focused on a classification task that requires substantial preprocessing, including a matrix factorisation. This work was achieved in collaboration with the University of Grenoble. The goal of this challenge is to learn how the reach good classification scores in a high-dimensional and unbalanced context. At first, challengers will be asked to find techniques of dimension reduction to apply to the data. Secondly, challengers will have to classify the resulting data, starting from a labeled training set. We provide a dataset containing artificial medical data, generated from an original dataset of real cancer data. This challenge was designed for the project class under the surpervision of Isabelle Guyon.

The dataset is a set of patients which have been diagnotised at different stages of cancer. Patients are labelled over ten classes :  stage ib,  stage ia,  stage i, stage iib, stage iv, stage iiia, stage iia, stage iiib, not reported and Nan. Your task is to improve the classification results regarding the stages of those patients. The final result of this project is the identification of 10 different cancer stages among a specific population.

The data is a matrix of (number of patients) lines * (number of features per patient) columns. The features correspond to methylation information related to the medical condition of each patient. 

In order to perform a classification task, the original matrix D will be factorized into the product of two matrices A and T. A will be used instead of D for the classification task. Note that the factorization is fundamental if you want to be able to compute your classification on your laptop. This factorization projects the 5000 original features of the initial matrix D in n_components (parameter of the TruncatedSVD function), which makes the data used for model training substantially smaller.

References and credits:
The competition protocol was designed by Isabelle Guyon.

The hadaca team members are
Salim Tabarani, Sid Ali Hamideche, Luc Gibaud, Martin Bauw, Malik Kazi Aoual and Cedric des Lauriers.
The hadaca team was supervised by Magali Richard and Raphael Bacher. They are also the database donors.
The starting kit was adapted from an Jupyper notebook designed by Balazs Kegl for the RAMP platform.
This challenge was generated using Chalab, a competition wizard designed by Laurent Senta.




The problem is a multiclass classification problem. Each sample (a patient) is characterized by methylation informations related to the medical condition (5000 features). You must predict the cancer stage (10 categories).

You are given for training a data matrix X_train of dimension num_training_samples x num_features and an array y_train of labels of dimension num_training_samples. You must train a model which predicts the labels for two test matrices X_valid and X_test. 

There are 2 phases: 

  • Phase 1: development phase. We provide you with labeled training data and unlabeled validation and test data. Make predictions for both datasets. However, you will receive feed-back on your performance on the validation set only. The performance of your LAST submission will be displayed on the leaderboard. 
  • Phase 2: final phase. You do not need to do anything. Your last submission of phase 1 will be automatically forwarded. Your performance on the test set will appear on the leaderboard when the organizers finish checking the submissions. 

This sample competition allows you to submit either:

  • Only prediction results (no code). 
  • A pre-trained prediction model. 
  • A prediction model that must be trained and tested. 

The submissions are evaluated using the macro-average precision metric. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally) contrary to whereas a micro-average which will aggregate the contributions of all classes to compute the average metric.

The evaluation does not involve the matrix factorization. However, the quality of this factorization is both important to obtain good classification results and to study biological insights.


Submissions must be made before the end of phase 1. You may submit 5 submissions every day and 100 in total.

This challenge is governed by the general ChaLearn contest rules.

Development Phase

Start: Nov. 15, 2018, midnight

Description: Development phase: tune your models and submit prediction results, trained model, or untrained model.

Final Phase

Start: April 30, 2019, midnight

Description: Final phase (no submission, your last submission from the previous phase is automatically forwarded).

Competition Ends


You must be logged in to participate in competitions.

Sign In
# Username Score
1 takfarinas.nait-larbi 1.0000
2 HEALTH 1.0000
3 SaraHammache 1.0000