This is a machine learning challenge focused on a classification task that requires substantial preprocessing, including a matrix factorisation. This work was achieved in collaboration with the University of Grenoble. The goal of this challenge is to learn how the reach good classification scores in a high-dimensional and unbalanced context. At first, challengers will be asked to find techniques of dimension reduction to apply to the data. Secondly, challengers will have to classify the resulting data, starting from a labeled training set. We provide a dataset containing artificial medical data, generated from an original dataset of real cancer data. This challenge was designed for the project class under the surpervision of Isabelle Guyon.
The dataset is a set of patients which have been diagnotised at different stages of cancer. Patients are labelled over ten classes : stage ib, stage ia, stage i, stage iib, stage iv, stage iiia, stage iia, stage iiib, not reported and Nan. Your task is to improve the classification results regarding the stages of those patients. The final result of this project is the identification of 10 different cancer stages among a specific population.
The data is a matrix of (number of patients) lines * (number of features per patient) columns. The features correspond to methylation information related to the medical condition of each patient.
In order to perform a classification task, the original matrix D will be factorized into the product of two matrices A and T. A will be used instead of D for the classification task. Note that the factorization is fundamental if you want to be able to compute your classification on your laptop. This factorization projects the 5000 original features of the initial matrix D in n_components (parameter of the TruncatedSVD function), which makes the data used for model training substantially smaller.
References and credits:
The competition protocol was designed by Isabelle Guyon.
The hadaca team members are Salim Tabarani, Sid Ali Hamideche, Luc Gibaud, Martin Bauw, Malik Kazi Aoual and Cedric des Lauriers.
The hadaca team was supervised by Magali Richard and Raphael Bacher. They are also the database donors.
The starting kit was adapted from an Jupyper notebook designed by Balazs Kegl for the RAMP platform.
This challenge was generated using Chalab, a competition wizard designed by Laurent Senta.
The problem is a multiclass classification problem. Each sample (a patient) is characterized by methylation informations related to the medical condition (5000 features). You must predict the cancer stage (10 categories).
You are given for training a data matrix X_train of dimension num_training_samples x num_features and an array y_train of labels of dimension num_training_samples. You must train a model which predicts the labels for two test matrices X_valid and X_test.
There are 2 phases:
This sample competition allows you to submit either:
The submissions are evaluated using the macro-average precision metric. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally) contrary to whereas a micro-average which will aggregate the contributions of all classes to compute the average metric.
The evaluation does not involve the matrix factorization. However, the quality of this factorization is both important to obtain good classification results and to study biological insights.
Submissions must be made before the end of phase 1. You may submit 5 submissions every day and 100 in total.
This challenge is governed by the general ChaLearn contest rules.
Start: Nov. 15, 2018, midnight
Description: Development phase: tune your models and submit prediction results, trained model, or untrained model.
Start: April 30, 2019, midnight
Description: Final phase (no submission, your last submission from the previous phase is automatically forwarded).
You must be logged in to participate in competitions.Sign In