The team “Bestpetting”, including the Kaggle “Grandmaster” Pavel Pleskov, was caught cheating after winning the first place and $10.000 in the Kaggle “PetFinder.my Adoption Prediction”-Challenge. Three aspects are very interesting and worth a closer look: How did they do it? How did they get caught? How could such cheating be prevented?
How did they do it?
The goal of the competition was to predict how quickly pets in a shelter are getting adopted. The dataset included the “adoption speed” as label, and various features including “species”, a picture, “color”, “gender”, “sterilization”, “health”, “fee” and more. (I have a little stomachache regarding the ethical implications of such a prediction, but let’s keep this aside here.)
The key of the fraudulent cheat was to obtain data from the private test set, that is not known to the participants of the competition and is only used to evaluate the submitted models. The organizers of the challenge speculated, that the data including the labels was scraped from the website of PetFinder.my.
The use of this fraudulent dataset would have quickly become apparent, but the team took great pains to conceal their tactics:
They hid the scraped 3.500 samples including label and an obfuscated hash value of some features inside another external dataset “Cute Cats and Dogs from Pixabay.com” (in general, it was allowed to use external public datasets to improve the model) with ~33.000 samples.
Then something reminding me of a “test bench recognition” was implemented: During the prediction phase, they generated the same (obfuscated) hash from the test samples, and looked this hash up in their external dataset. If they got a hit (the testing sample is identical to one of the scraped and hidden samples), they delivered the actual scraped target value instead of the value predicted by their model.
To not raise suspicion, they did this only for every 10th sample, so they won’t get results that are “too good to be true”.
The details on the cheat, especially the effort the team spent into obfuscating their actions, is explained by Benjamin Minixhofer, who also discovered the cheat in the first place. It is definitely worth reading. For a more “journalistic” summary of the events and some background on the cheaters and the consequences they faced, I recommend this article on The Register.
How did they get caught?
The fraud was not discovered by the operators of Kaggle. Benjamin Minixhofer, a 19-year young Austrian who also participated the challenge very successfully, was contacted by the CEO of PetFinder.my (the initiators of the competition) and asked to combine tactics of the winning solutions into a productive system. For this reason, Benjamin got access to the source code of the submitted solutions, which hasn’t been publicly available. He discovered this cheat, that put the “Bestpetting”-Team to the first place instead of the ~100th place, where they would have ended up without their fraudulent dataset.
Two other aspects Benjamin describes are interesting, as they are (retrospectively) indicators that something unusual was going on:
- The distance of the winner team’s score to the second place was quite large.
- The distribution of the last digit in the “ID” column of the external dataset, where the obfuscated hash has been stored to be looked up during testing, was quite uneven. This indicates, that those are not “normal” MD5-hashes, which have a random/even distribution of characters.
How could similar cheating be prevented?
I think there are mainly three things that could have stopped the cheat:
The Kaggle operators could have been more suspicious about the (relatively) good results and could have inspected the source code of the winning team more closely. But if cheats are well hidden, it will always be difficult to detect them in time and budget. Therefore, I consider this option as useful, but not as a solution.
Some commentators on HackerNews discussed the partial liability of the organizers, because the private test data set was available for scraping. To put more effort into preventing access to the test data upfront would be a hurdle for cheaters. But even if less likely, it will probably always be possible that someone gets the data. So this would be helpful but also not a real solution.
In my opinion, the only chance is to force at least the winning teams to make their source code reproducible and public: If a crowd of data scientists inspects the code out of curiosity, the chance is much higher than such a cheat will be uncovered, than if a limited number of people inspect the code “out of duty”. If the tempted cheaters know about this hard to estimate risk, I’m sure they will be less likely to pull off the fraud.
As a side effect, the winning solutions (which are also the most interesting solutions to study) would be available for learning purposes, which I think is one of the main goals of the Kaggle platform.
BTW: The members of the team acted remorseful: they apologized publicly on Twitter and promised to return the prize money, which they had already received.