Leaderboard on Results tab has been DEPRECATED, please view your results on the "Leader Board".
Temporal relational data is very common in industrial machine learning applications, such as online advertising, recommender systems, financial market analysis, medical treatment, fraud detection, etc. With timestamps to indicate the timings of events and multiple related tables to provide different perspectives, such data contains useful information that can be exploited to improve machine learning performance. However, currently, the exploitation of temporal relational data is often carried out by experienced human experts with in-depth domain knowledge in a labor-intensive trial-and-error manner.
In this challenge, participants are invited to develop AutoML solutions to binary classification problems for temporal relational data. The provided datasets are in the form of multiple related tables, with timestamped instances. Five public datasets (without labels in the testing part) are provided to the participants so that they can develop their AutoML solutions. Afterward, solutions will be evaluated with five unseen datasets without human intervention. The results of these five datasets determine the final ranking.
This is the first AutoML competition that focuses on temporal relational data and it will pose new challenges to the participants, as listed below:
- How to automatically generate useful temporal information?
- How to efficiently merge the information provided by multiple related tables?
- How to automatically capture meaningful inter-table interactions?
- How to avoid data leak in an automatic manner, when the data is temporal?
Additionally, participants should also consider:
- How to automatically and efficiently select appropriate machine learning model and hyper-parameters?
- How to make the solution more generic, i.e., how to make it applicable for unseen tasks?
- How to keep the computational and memory cost acceptable?
Visit "Learn the Details - Instructions" and follow the steps to participate.
June 26th, 2019, 11:59 p.m. UTC: End of the Feedback Phase; Beginning of the Check Phase; Codes from Feedback phase is migrated automatically to Check Phase.
July 4th, 2019, 08:00 a.m. UTC: Notification of check results.
July 9th, 2019, 11:59 p.m. UTC: End of the Check Phase; Deadline of code resubmission; Beginning of the AutoML Phase; Organizers start code verification.
July 10th, 2019, 11:59 p.m. UTC: Deadline for submitting the fact sheets.
July 15th, 2019, 11:59 p.m. UTC: End of the AutoML Phase; Beginning of the post-competition process
July 20th, 2019: Announcement of the KDD Cup Winner.
Aug 4th, 2019: Beginning of KDD 2019.
The annual KDD conference is the premier interdisciplinary conference bringing together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data.
KDD 2019 Conference:
Anchorage, Alaska USA
KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. SIGKDD-2019 will take place in Anchorage, Alaska, US from August 4 - 8, 2019. The KDD Cup competition is anticipated to last for 2-4 months, and the winners will be notified by mid-July 2019. The winners will be honored at the KDD conference opening ceremony and will present their solutions at the KDD Cup workshop during the conference.
In KDD Cup 2019, there are three competition tracks:
Automated Machine Learning Competition Track (Auto-ML Track) (This competition)
Regular Machine Learning Competition Track (Regular ML Track)
“Research for Humanity” Reinforcement Learning Competition Track (Humanity RL Track)
Taposh Dutta-Roy (KAISER PERMANENTE)
Wenjun Zhou (UNIVERSITY OF TENNESSEE KNOXVILLE)
Iryna Skrypnyk (PFIZER)
Please follow the instructions below to begin your KDD Cup 2019:
1. Read "Learn the Details - Terms and Conditions" carefully;
2. To participate: click "Participate" to register the competition;
3. Read "Learn the Details - Setting" to learn about the settings of this challenge;
4. Read "Learn the Details - Data" to learn the data format;
5. Read "Learn the Details - Submission" to learn the requirements of submissions. Please pay special attention to the running time constraints and the running environment.
6. Read "Learn the Details - Evaluation" to learn the evaluation method that scores your submission;
7. Visit "Learn the Details - Get Started" and download the starting-kit. Follow the README file to set up your environment;
8. After registration, you can get the data for the Feedback Phase in "Participate - Get Data".
9. Develop your solutions and submit it in "Participate - Submit".
10. View your results on the "Leader Board".
Prizes sponsored by 4paradigm will be granted to top ranking participants (Interpretability of your solution will be used as tie-breaking criterion), provided they comply with the rules of the challenge (see the terms and conditions, section). The distribution of prizes will be as follows.
To be eligible for prizes you must: publicly release your code under an open source license, submit factsheet describing your solution, present the solution at KDD Cup Workshop 2019, sign the prize acceptance format and adhering to the rules of the challenge.
This challenge focuses on the problem of binary classification for temporal relational data collected from real-world businesses. According to the temporal nature of most real-world applications, the datasets are chronologically split into training and testing parts. Both the training and testing parts consist of a main table, a set of related tables, and a relation graph:
- The main table contains instances (with labels in the training part), some features, and timestamps. This is the target of the binary classification.
- Related tables contain valuable auxiliary information about the instances in the main table and can be utilized to improve predictive performance. Entries in the related tables occasionally have timestamps.
- The relations among data in different tables are described by a Relation Graph. It should be noted that any two tables (main or related table) can have a relation, and any pair of tables can have at most one relation. It is guaranteed that the Relation Graphs are the same in training and testing parts.
The following figure illustrates the form of the datasets:
More details about the data can be found on "Learn the Details - Data" page.
Participants should form teams made of one or multiple persons. Team members should share one single account, see rules.
Teams are required to submit AutoML solutions that automatically build machine learning models by using training main table, related tables and relation graph. Once trained, the models should take the testing main table (labels excluded), related tables, and relation graph as input and predict the testing labels. Solutions will be tested under restricted resources and time that will be the same for every competitor (see Learn the Details - Submissions for more details).
A practical AutoML solution should be able to generalize to a wide range of unseen learning tasks. In order to enable the participants to develop and evaluate these solutions, we prepared a total of 10 temporal relational datasets for the competition, five out of which are termed as ‘public datasets’ and the others ‘private datasets’. The challenge comprises three phases:
- Feedback Phase: In this phase, the participants are provided with the training data of five public datasets to develop their AutoML solutions. The code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method in a holdout set. The maximum submission number every day is restricted. Participants can download the labeled training data and the unlabeled testing sets of five public datasets so they can prepare their code submissions at home. The LAST code submission of each team in this phase will be forwarded to the next phase for final testing.
- Check Phase: In this phase, participants will be able to check whether their submissions (migrated from the feedback phase) can be successfully evaluated with the five private datasets. Participants will get informed if their submissions fail on any of the five private datasets due to time or memory exceeding. If a failure occurs, each team has only one chance to resubmit a new version. No testing performance will be revealed.
- AutoML Phase: This is the blind test phase with no submission. In this phase, solutions will be tested with their performances on private datasets. Participants’ submissions will automatically train machine learning models without human intervention. The final score will be evaluated by the result of the blind testing.
Score defined in "Learn the Details - Evaluation" will be used as the evaluation metric and the average score on all five public/private datasets will be used to score a solution in the Feedback/AutoML Phases (see "Learn the Details - Evaluation" for more details). Solutions with better interpretability will be given preference in case there is a tie in the final score. The interpretability of the solutions will be judged by a committee of experts (some experts from the organization team, and some invited experts).
Please note that the final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.
Computational resources are limited in all three phases to ensure that solutions are adequately efficient.
The following figure illustrates how submissions work in both Feedback and AutoML Phases:
More details about submission/evaluation can be found on "Learn the Details - Submission/Evaluation" page.
This page describes the datasets used in Automatic Machine Learning Challenge for Temporal Relational Data.
Each dataset is split into two subsets, namely the training set and the testing set.
Both sets have:
Each table file is a CSV file that stores a table (main or related), with '\t' as the delimiter. The first row indicates the names of features, a.k.a 'schema', and the following rows are the records.
The type of each feature can be found in the info dictionary that will be introduced soon.
There are 4 types of features, indicated by "cat", "num", "multi-cat", and "time", respectively:
Note: Categorical/Multi-value Categorical features with a large number of values that follows a power law might be included.
The label file is associated only with the main table in the training set. It is a CSV file that contains only one column, with the first row as the header and the remaining indicating labels associated with instances in the main table.
Important information about each dataset is stored in a python dictionary structure named as info, which acts as an input of the participants' AutoML solutions. For public datasets, we will provide an info.json file that contains the dictionary. Here we give details about info.
Descriptions of the keys in info:
Four table relations are considered in this challenge:
Different types of table relations may lead to different ways to capture temporal information and inter-table interactions.
Participants can download the public datasets in the "Participate - Files - Public Data" page.
Each submission should be a compressed directory (.zip file). Inside the directory, we expect to find a model.py file, where a Model class is defined with at least three methods, __init__(), fit()and predict(). In both feedback and AutoML stages, we will instantiate a Model object, automatically call its fit method to train a machine learning model with the training set, and call its predict method to make predictions on the testing set, and evaluate the performance. Below we define the interface of Model class in detail.
The python version on our platform is 3.6.4+.
The initialization method is __init__(self, info).
The training method is fit(self, train_data, train_label, time_remain), where:
The predicting method is predict(self, test_data, time_remain), where
We restrict the running time of the submitted solution on each dataset: the total time of executing Model.__init__(), Model.fit() and Model.predict() should be less than the "time_budget" defined in info (sec). Model.fit() and Model.predict() should receive a time_remain variable that indicates the remaining time to finish the whole procedure (Model.__init__() -> Model.fit() -> Model.predict).
In the starting-kit (see "Learn the Details - Get Started") we provide a docker that simulates the running environment of our challenge platform. Participants can check the python version and installed python packages with the following commands:
python3 --version
pip3 list
On our platform, for each submission, the allocated computational resources are:
Please note that the final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.
The evaluation scheme of this Challenge is depicted in the following figure.
This challenge receives code submissions that satisfy the requirements detailed in the Submission Requirements Section. The submission will be evaluated with five private datasets in a fully automatic manner. For each dataset, AUC on the testing set will be recorded and used to calculate the participant/team's ranking on this dataset.
The score of the solution on a dataset is calculated as:
score = (auc - auc_base) / (auc_max - auc_base),
where auc is the resulting AUC of the solution on the dataset, auc_base is the AUC of the baseline method on the dataset, and auc_max is the AUC of the best solution on the dataset over all teams' submissions. In case that auc_max < auc_base, all submissions will get 0 scores on the dataset. The baseline method can be found in the starting-kit (see "Learn the Details - Get Started" for more details). The scores of the baseline method, submitted by team "chalearn", are also listed on the leaderboard.
The average score on all five public/private datasets will be used to score a solution in the Feedback/AutoML Phases. And the Average score on all five private datasets is used as the final score of a team.
The final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.
Submissions of all participants/teams will be executed automatically with the same computing resources.
For each dataset, the submission should finish model training and prediction with a limited time budget (indicated by info['time_budget']). If the running time constraint is violated on any dataset, the evaluation will be terminated and the submission is considered "failed". If a submission fails in the AutoML Phase, there will be no final score/ranking for its corresponding team.
We provide to the participants a baseline method which automatically solves the binary classification for temporal relational data problem, and is compliant with the submission file format. It can be referred to as a template and start point for the participants to develop their own AutoML solutions. We also give some tips on the challenge.
Additionally, we provide a starting-kit which includes the baseline, some demo data, and a running environment that can be deployed on the PCs of participants so they can test their solutions locally.
Please download the starting-kit from "Participate - Files - Starting Kit" and refer to the README file for instructions.
The following figure shows the flow of our provided baseline method. Its major difference from general AutoML solutions is the Auto-Table-Join step, where temporal information and inter-table interactions are extracted from the given data.
Auto-Table-Join (ATJ) first constructs a tree-structure based on the relations among tables, where the root is the main table and the edges indicate the relation types, as shown in the following figure.
Afterwards, ATJ traverses the tree. A recursive function dfs(u) is defined to: 1) apply dfs() on all the children of u, and 2) join these children to u. When joining a table v to a table u, ATJ ensures that the resulting table has the same row number as u. Two cases should be considered during table joining:
For more details about the baseline method, please refer to the code.
Please download the starting-kit from "Participate - Files - Starting Kit" and find detailed information in the README file.
Participants can use the docker included in the starting-kit to simulate the running environment and test their submissions on their PCs.
You can test your code in the exact same environment as the Codalab environment using docker. You are able to run the ingestion program (to produce predictions) and the scoring program (to evaluate your predictions) on toy sample data.
1. If you are new to docker, install docker from https://docs.docker.com/get-started/.
2. At the shell, change to the starting-kit directory, run
docker run -it --rm -u root -v $(pwd):/app/kddcup codalab/codalab-legacy:py3 bash
3. Now you are in the bash of the docker container, run ingestion program
cd /app/kddcup
python3 ingestion_program/ingestion.py
It runs sample_code_submission and the predictions will be in sample_predictions directory
4. Now run the scoring program:
python3 scoring_program/score.py
It will score the predictions and the results will be in sample_scoring_output directory
Participants can put public datasets into sample_data and test it. The full call of the ingestion program is:
python3 ingestion_program/ingestion.py local sample_data sample_predictions ingestion_program sample_code_submission
The full call of the scoring program is:
python3 scoring_program/score.py local sample_predictions sample_ref sample_scoring_output
Start: April 1, 2019, midnight
Description: Practice on five datasets similar to those of the second phase. Code submission only.
Start: June 26, 2019, 11:59 p.m.
Description: AutoML blind Phase
July 20, 2019, 11:59 p.m.
You must be logged in to participate in competitions.
Sign In