KDD Cup 2019: AutoML for Temporal Relational Data

Organized by LingYue - Current server time: Aug. 23, 2019, 1:13 a.m. UTC

First phase

Feed-back
April 1, 2019, midnight UTC

End

Competition Ends
July 20, 2019, 11:59 p.m. UTC

KDD Cup 2019

The 5th AutoML Challenge:

AutoML for Temporal Relational Data

(Provided and Sponsored by 4Paradigm, ChaLearn, Microsoft and Amazon)

Leaderboard on Results tab has been DEPRECATED, please view your results on the "Leader Board".

Temporal relational data is very common in industrial machine learning applications, such as online advertising, recommender systems, financial market analysis, medical treatment, fraud detection, etc. With timestamps to indicate the timings of events and multiple related tables to provide different perspectives, such data contains useful information that can be exploited to improve machine learning performance. However, currently, the exploitation of temporal relational data is often carried out by experienced human experts with in-depth domain knowledge in a labor-intensive trial-and-error manner.

In this challenge, participants are invited to develop AutoML solutions to binary classification problems for temporal relational data. The provided datasets are in the form of multiple related tables, with timestamped instances. Five public datasets (without labels in the testing part) are provided to the participants so that they can develop their AutoML solutions. Afterward, solutions will be evaluated with five unseen datasets without human intervention. The results of these five datasets determine the final ranking.

This is the first AutoML competition that focuses on temporal relational data and it will pose new challenges to the participants, as listed below:

- How to automatically generate useful temporal information?

- How to efficiently merge the information provided by multiple related tables?

- How to automatically capture meaningful inter-table interactions?

- How to avoid data leak in an automatic manner, when the data is temporal?

Additionally, participants should also consider:

- How to automatically and efficiently select appropriate machine learning model and hyper-parameters?

- How to make the solution more generic, i.e., how to make it applicable for unseen tasks?

- How to keep the computational and memory cost acceptable?

 

Visit "Learn the Details - Instructions" and follow the steps to participate.

 

Timeline (UTC)

  • Apr 1st, 2019: Beginning of the competition, the release of public datasets. Participants can start submitting codes and obtaining immediate feedback in the leaderboard.
  • June 26th, 2019, 11:59 p.m. UTC: End of the Feedback Phase; Beginning of the Check Phase; Codes from Feedback phase is migrated automatically to Check Phase.

  • July 4th, 2019, 08:00 a.m. UTC: Notification of check results.

  • July 9th, 2019, 11:59 p.m. UTC: End of the Check Phase; Deadline of code resubmission; Beginning of the AutoML Phase; Organizers start code verification.

  • July 10th, 2019, 11:59 p.m. UTC: Deadline for submitting the fact sheets.

  • July 15th, 2019, 11:59 p.m. UTC: End of the AutoML Phase; Beginning of the post-competition process

  • July 20th, 2019: Announcement of the KDD Cup Winner.

  • Aug 4th, 2019: Beginning of KDD 2019.

 

Brought to you by

4 ParadigmmicrosoftAmazon

 

About

About KDD 2019 Conference

The annual KDD conference is the premier interdisciplinary conference bringing together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data.

KDD 2019 Conference:

  • August 4 - 8, 2019
  • Anchorage, Alaska USA

  • Dena’ina Convention Center and William Egan Convention Center

About Other 2019 KDD Cup Competitions

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. SIGKDD-2019 will take place in Anchorage, Alaska, US from August 4 - 8, 2019. The KDD Cup competition is anticipated to last for 2-4 months, and the winners will be notified by mid-July 2019. The winners will be honored at the KDD conference opening ceremony and will present their solutions at the KDD Cup workshop during the conference. 

In KDD Cup 2019, there are three competition tracks:

About KDD Cup Chairs

  • Taposh Dutta-Roy (KAISER PERMANENTE)

  • Wenjun Zhou (UNIVERSITY OF TENNESSEE KNOXVILLE)

  • Iryna Skrypnyk (PFIZER)

 

Contact the organizers.

Instructions

Please follow the instructions below to begin your KDD Cup 2019:

1. Read "Learn the Details - Terms and Conditions" carefully;

2. To participate: click "Participate" to register the competition;

3. Read "Learn the Details - Setting" to learn about the settings of this challenge;

4. Read "Learn the Details - Data" to learn the data format;

5. Read "Learn the Details - Submission" to learn the requirements of submissions. Please pay special attention to the running time constraints and the running environment.

6. Read "Learn the Details - Evaluation" to learn the evaluation method that scores your submission;

7. Visit "Learn the Details - Get Started" and download the starting-kit. Follow the README file to set up your environment;

8. After registration, you can get the data for the Feedback Phase in "Participate - Get Data".

9. Develop your solutions and submit it in "Participate - Submit".

10. View your results on the "Leader Board".

 

 

Contact the organizers.

Terms and Conditions

Prizes

Prizes sponsored by 4paradigm will be granted to top ranking participants (Interpretability of your solution will be used as tie-breaking criterion), provided they comply with the rules of the challenge (see the terms and conditions, section). The distribution of prizes will be as follows.

  • First place: 15,000USD + Certificate
  • Second place: 10,000USD + Certificate
  • Third place: 5,000USD + Certificate
  • 4th - 10th Places: 500USD each


To be eligible for prizes you must: publicly release your code under an open source license, submit factsheet describing your solution, present the solution at KDD Cup Workshop 2019, sign the prize acceptance format and adhering to the rules of the challenge.

Challenge Rules

  • General Terms: This challenge is governed by the General ChaLearn Contest Rule Terms, the Codalab Terms and Conditions, and the specific rules set forth.
  • Announcements: To receive announcements and be informed of any change in rules, the participants must provide a valid email.
  • Conditions of participation: Participation requires complying with the rules of the challenge. Prize eligibility is restricted by the USA and Chinese government export regulations. The organizers, sponsors, their students, close family members (parents, sibling, spouse or children) and household members, as well as any person having had access to the truth values or to any information about the data or the challenge design giving him (or her) an unfair advantage, are excluded from participation. A disqualified person may submit one or several entries in the challenge and request to have them evaluated, provided that they notify the organizers of their conflict of interest. If a disqualified person submits an entry, this entry will not be part of the final ranking and does not qualify for prizes. The participants should be aware that ChaLearn and the organizers reserve the right to evaluate for scientific purposes any entry made in the challenge, whether or not it qualifies for prizes.
  • Dissemination: Top-ranked participants will be invited to attend KDD Cup Workshop 2019 to describe their methods and findings. The challenge is part of the KDD 2019 conference. 
  • Registration: The participants must register to Codalab and provide a valid email address. Teams must be registered as a regular Codalab user with the team name as Codalab ID and a group email shared by all team members. One person can only be part of only one team. Teams or solo participants registering multiple times to gain an advantage in the competition may be disqualified. Each team needs to designate a leader, who is responsible for communication with the organizers. Each team/solo participant should have at least one successful submission two weeks before the start of the check phase. Otherwise, it may be disqualified.
  • Anonymity: The participants who do not present their results at the workshop can elect to remain anonymous by using a pseudonym. Their results will be published on the leaderboard under that pseudonym, and their real name will remain confidential. However, the participants must disclose their real identity to the organizers to claim any prize they might win. See our privacy policy for details.
  • Submission method: The results must be submitted through this CodaLab competition site. The participants can make up to 2 submissions per day in the feedback phase. Using multiple accounts to increase the number of submissions is NOT permitted. The entries must be formatted as specified on the "Learn the Details - Submission" page. Each team/solo participant is obligated to provide a short report (fact sheet) describing their final solution to be eligible for prizes.. There is NO submission in AutoML blind test phase (the submissions from the previous Feed-back phase migrate automatically). In case of any problem, send email to kddcup2019@4paradigm.com. 
  • Evaluation method: Winners of the competition are chosen based on the final score in the AutoML Phase. In case there is a tie in the final scores, solutions with better interpretability will be given preference. For more details of evaluation please see "Learn the Details - Evaluation" page. By enrolling to this competition you grant the organizers rights to process your submissions for the purpose of evaluation and post-competition research.
  • Cheating: We forbid people during the development phase to attempt to get a hold of the solution labels on the server (though this may be technically feasible). For the final phase, the evaluation method will make it impossible to cheat in this way. Generally, participants caught cheating will be disqualified.
     

 

Contact the organizers.

Setting

This challenge focuses on the problem of binary classification for temporal relational data collected from real-world businesses. According to the temporal nature of most real-world applications, the datasets are chronologically split into training and testing parts. Both the training and testing parts consist of a main table, a set of related tables, and a relation graph:

- The main table contains instances (with labels in the training part), some features, and timestamps. This is the target of the binary classification.

- Related tables contain valuable auxiliary information about the instances in the main table and can be utilized to improve predictive performance. Entries in the related tables occasionally have timestamps.

- The relations among data in different tables are described by a Relation Graph. It should be noted that any two tables (main or related table) can have a relation, and any pair of tables can have at most one relation. It is guaranteed that the Relation Graphs are the same in training and testing parts.

The following figure illustrates the form of the datasets:

More details about the data can be found on "Learn the Details - Data" page.

Participants should form teams made of one or multiple persons. Team members should share one single account, see rules.

Teams are required to submit AutoML solutions that automatically build machine learning models by using training main table, related tables and relation graph. Once trained, the models should take the testing main table (labels excluded), related tables, and relation graph as input and predict the testing labels. Solutions will be tested under restricted resources and time that will be the same for every competitor (see Learn the Details - Submissions for more details).

A practical AutoML solution should be able to generalize to a wide range of unseen learning tasks. In order to enable the participants to develop and evaluate these solutions, we prepared a total of 10 temporal relational datasets for the competition, five out of which are termed as ‘public datasets’ and the others ‘private datasets’. The challenge comprises three phases:

- Feedback Phase: In this phase, the participants are provided with the training data of five public datasets to develop their AutoML solutions. The code will be uploaded to the platform and participants will receive immediate feedback on the performance of their method in a holdout set. The maximum submission number every day is restricted. Participants can download the labeled training data and the unlabeled testing sets of five public datasets so they can prepare their code submissions at home. The LAST code submission of each team in this phase will be forwarded to the next phase for final testing.

- Check Phase: In this phase, participants will be able to check whether their submissions (migrated from the feedback phase) can be successfully evaluated with the five private datasets. Participants will get informed if their submissions fail on any of the five private datasets due to time or memory exceeding. If a failure occurs, each team has only one chance to resubmit a new version.  No testing performance will be revealed.

- AutoML Phase: This is the blind test phase with no submission. In this phase, solutions will be tested with their performances on private datasets. Participants’ submissions will automatically train machine learning models without human intervention. The final score will be evaluated by the result of the blind testing.

Score defined in "Learn the Details - Evaluation" will be used as the evaluation metric and the average score on all five public/private datasets will be used to score a solution in the Feedback/AutoML Phases (see "Learn the Details - Evaluation" for more details). Solutions with better interpretability will be given preference in case there is a tie in the final score. The interpretability of the solutions will be judged by a committee of experts (some experts from the organization team, and some invited experts).

Please note that the final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.

Computational resources are limited in all three phases to ensure that solutions are adequately efficient.

The following figure illustrates how submissions work in both Feedback and AutoML Phases:

More details about submission/evaluation can be found on "Learn the Details - Submission/Evaluation" page.

 

 

Contact the organizers.

Data

This page describes the datasets used in Automatic Machine Learning Challenge for Temporal Relational Data.

Components

Each dataset is split into two subsets, namely the training set and the testing set.

Both sets have:

  • a main table file that stores the main table (label excluded);
  • multiple related table files that store the related tables;
  • an info dictionary that contains important information about the dataset, including table relations;
  • The training set has an additional label file that stores labels associated with the main table.

Table files

Each table file is a CSV file that stores a table (main or related), with '\t' as the delimiter. The first row indicates the names of features, a.k.a 'schema', and the following rows are the records.

The type of each feature can be found in the info dictionary that will be introduced soon. 

There are 4 types of features, indicated by "cat", "num", "multi-cat", and "time", respectively:

  • cat: categorical feature, an integer
  • num: numerical Feature, a real value.
  • multi-cat: multi-value categorical Feature: a set of integers, split by the comma. The size of the set is not fixed and can be different for each instance. For example, topics of an article, words in a title, items bought by a user and so on.
  • time: time feature, an integer that indicates the timestamp.

Note: Categorical/Multi-value Categorical features with a large number of values that follows a power law might be included.

Label file

The label file is associated only with the main table in the training set. It is a CSV file that contains only one column, with the first row as the header and the remaining indicating labels associated with instances in the main table.

info dictionary

Important information about each dataset is stored in a python dictionary structure named as info, which acts as an input of the participants' AutoML solutions. For public datasets, we will provide an info.json file that contains the dictionary. Here we give details about info.

Descriptions of the keys in info:

  • time_budget: time budget for this dataset (sec). See "Learn the Details - Submissions" page for details about the running time constraint;
  • time_col: the column name of the primary timestamp; Each dataset has one unique time_col; time_col is definitely contained in the main table, but not necessarily in a related table;
  • start_time: DEPRECATED.
  • tables: a dictionary that stores information about tables. Each key indicates a table, and its corresponding value is a dictionary that indicates the type of each column in this table. Two kinds of keys are contained in tables:
    • main: the main table;
    • table_{i}: the i-th related table.
    • There are 4 types of features, indicated by "cat", "num", "multi-cat", and "time", respectively:
      • cat: categorical feature, an integer
      • num: numerical Feature, a real value.
      • multi-cat: multi-value categorical Feature: a set of integers, split by the comma. The size of the set is not fixed and can be different for each instance. For example, topics of an article, words in a title, items bought by a user and so on.
      • time: time feature, an integer that indicates the timestamp.
      • Note: Categorical/Multi-value Categorical features with a large number of values that follows a power law might be included.
  • relations: a list that stores table relations in the dataset. Each relation can be represented as an ordered table pair (table_A, table_B), a key column key that appears in both tables and acts as the pivot of table joining, and a relation type type. Different relation types will be introduced shortly.

Relations Between Tables

Four table relations are considered in this challenge:

  • one-to-one (1-1): the key columns in both table_A and table_B have no duplicated values;
  • one-to-many (1-M): the key column in table_A has no duplicated values, but that in table_B may have duplicated values;
  • many-to-one (M-1): the key column in table_A may have duplicated values, but that in table_B has no duplicated values;
  • many-to-many (M-M): the key columns in both table_A and table_B may have duplicated values.

Different types of table relations may lead to different ways to capture temporal information and inter-table interactions.

Participants can download the public datasets in the "Participate - Files - Public Data" page.

 

 

Contact the organizers.

Submission Requirements

Interface

Each submission should be a compressed directory (.zip file). Inside the directory, we expect to find a model.py file, where a Model class is defined with at least three methods, __init__(), fit()and predict(). In both feedback and AutoML stages, we will instantiate a Model object, automatically call its fit method to train a machine learning model with the training set, and call its predict method to make predictions on the testing set, and evaluate the performance. Below we define the interface of Model class in detail.

The python version on our platform is 3.6.4+.

Initialization

The initialization method is __init__(self, info).

Model training

The training method is fit(self, train_data, train_label, time_remain), where:

  • train_data is a python dictionary. Its each key corresponds to a key defined in info['tables']. E.g., the key may be "main" or "table_{i}". For each key (table), its corresponding value is a pandas DataFrame that represents the corresponding data, with a header that indicates the feature names.
  • train_label is a pandas Series that stores the label, with its name set as "label".
  • time_remain is a scalar that indicates the remaining time (sec) for executing Model.fit() and Model.predict().

Predicting

The predicting method is predict(self, test_data, time_remain), where

  • test_data is a pandas DataFrame that stores the data of the main table (without labels) for testing.
  • time_remain is a scalar that indicates the remaining time (sec) for executing Model.predict().
  • predict() should return a pandas Series object that contains the predicted labels for instances in the testing main table.

Running Time Constraint

 We restrict the running time of the submitted solution on each dataset: the total time of executing Model.__init__(), Model.fit() and Model.predict() should be less than the "time_budget" defined in info (sec). Model.fit() and Model.predict() should receive a time_remain variable that indicates the remaining time to finish the whole procedure (Model.__init__() -> Model.fit() -> Model.predict). 

Running Environment

In the starting-kit (see "Learn the Details - Get Started") we provide a docker that simulates the running environment of our challenge platform. Participants can check the python version and installed python packages with the following commands:

 python3 --version

 pip3 list

On our platform, for each submission, the allocated computational resources are:

  • CPU: 4 Cores
  • Memory: 16 GB

 

Contact the organizers.

Evaluation

Please note that the final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.

The evaluation scheme of this Challenge is depicted in the following figure.

This challenge receives code submissions that satisfy the requirements detailed in the Submission Requirements Section. The submission will be evaluated with five private datasets in a fully automatic manner. For each dataset, AUC on the testing set will be recorded and used to calculate the participant/team's ranking on this dataset.

 

Score

The score of the solution on a dataset is calculated as:

score = (auc - auc_base) / (auc_max - auc_base),

where auc is the resulting AUC of the solution on the dataset, auc_base is the AUC of the baseline method on the dataset, and auc_max is the AUC of the best solution on the dataset over all teams' submissions. In case that auc_max < auc_base, all submissions will get 0 scores on the dataset. The baseline method can be found in the starting-kit (see "Learn the Details - Get Started" for more details). The scores of the baseline method, submitted by team "chalearn", are also listed on the leaderboard.

The average score on all five public/private datasets will be used to score a solution in the Feedback/AutoML Phases. And the Average score on all five private datasets is used as the final score of a team.

The final score evaluates the LAST code submission, i.e., the last code submission in the Feedback Phase, or the resubmission in the Check Phase if there is one.

Submissions of all participants/teams will be executed automatically with the same computing resources.

For each dataset, the submission should finish model training and prediction with a limited time budget (indicated by info['time_budget']). If the running time constraint is violated on any dataset, the evaluation will be terminated and the submission is considered "failed". If a submission fails in the AutoML Phase, there will be no final score/ranking for its corresponding team.

 

 

 

Contact the organizers.

Get Started

We provide to the participants a baseline method which automatically solves the binary classification for temporal relational data problem, and is compliant with the submission file format. It can be referred to as a template and start point for the participants to develop their own AutoML solutions. We also give some tips on the challenge.

Additionally, we provide a starting-kit which includes the baseline, some demo data, and a running environment that can be deployed on the PCs of participants so they can test their solutions locally.

Please download the starting-kit from "Participate - Files - Starting Kit" and refer to the README file for instructions.

Baseline

The following figure shows the flow of our provided baseline method. Its major difference from general AutoML solutions is the Auto-Table-Join step, where temporal information and inter-table interactions are extracted from the given data.

 

 

Auto-Table-Join (ATJ) first constructs a tree-structure based on the relations among tables, where the root is the main table and the edges indicate the relation types, as shown in the following figure.

Afterwards, ATJ traverses the tree. A recursive function dfs(u) is defined to: 1) apply dfs() on all the children of u, and 2) join these children to u. When joining a table v to a table u, ATJ ensures that the resulting table has the same row number as u. Two cases should be considered during table joining:

  • Non-temporal Case: In this case, at least one of u or v does not contain the timestamp column.
    • For 1-1 and M-1 relations, ATJ simply joins v to u with pandas function join().
    • For 1-M and M-M relations, an intermediate table w is generated by operating groupby operations on v according to the key. Afterwards, 1-1 or M-1 relations will be established between u and w, and ATJ applies join() function on them.
  • Temporal Case: If both u and v have timestamp columns, we need to prevent data leak during table join. For each record in u, only records in v that occurs prior to it can be utilized during joining. In such case, ATJ first merges u and v into an intermediate table w where records are sorted in chronological order. A sliding time-window (covering the current record and last several ones) is then used to traverse w. Once a record i_u extracted from u is visited, records from v with the same key and in the window are used in joining. To be more specific, groupby operations are applied on these records and the results are appended to i_u.

For more details about the baseline method, please refer to the code.

 

Starting Kit

Please download the starting-kit from "Participate - Files - Starting Kit" and find detailed information in the README file.

Participants can use the docker included in the starting-kit to simulate the running environment and test their submissions on their PCs.

Local development and testing:

You can test your code in the exact same environment as the Codalab environment using docker. You are able to run the ingestion program (to produce predictions) and the scoring program (to evaluate your predictions) on toy sample data.

Quick Start

1. If you are new to docker, install docker from https://docs.docker.com/get-started/.

2. At the shell, change to the starting-kit directory, run

  docker run -it --rm -u root -v $(pwd):/app/kddcup codalab/codalab-legacy:py3 bash

3. Now you are in the bash of the docker container, run ingestion program

  cd /app/kddcup

  python3 ingestion_program/ingestion.py

It runs sample_code_submission and the predictions will be in sample_predictions directory

4. Now run the scoring program:

  python3 scoring_program/score.py

It will score the predictions and the results will be in sample_scoring_output directory

Remark

Participants can put public datasets into sample_data and test it. The full call of the ingestion program is:

python3 ingestion_program/ingestion.py local sample_data sample_predictions ingestion_program sample_code_submission

The full call of the scoring program is:

python3 scoring_program/score.py local sample_predictions sample_ref sample_scoring_output

 

 

Contact the organizers.

Feed-back

Start: April 1, 2019, midnight

Description: Practice on five datasets similar to those of the second phase. Code submission only.

AutoML5 blind test

Start: June 26, 2019, 11:59 p.m.

Description: AutoML blind Phase

Competition Ends

July 20, 2019, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In