knowledge: A Challenge in More than One Way

Well, the submission deadline for the first Android Developer Challenge has come and gone, the apps are in, the judges are finished, and the waiting is over. We got a lot of great submissions, and I can tell you personally that the competition was fierce. I didn't see all 1,788 submissions, but I saw quite a lot of them, and I uttered more than one wail of despair as some of my favorite submissions didn't quite make the cut, by razor-thin margins in some cases. But, the judges have spoken.

Speaking of the judges...we'll soon publish a list of who the judges are, but I know many of our developers are still curious: what were all those judges doing? Well, the short answer is that they were judging applications using a custom laptop configuration that we provided. But we thought some people might be interested in the "long" answer, so we put together this blog post. If you're not interested in the gory details of the judging, you can stop here; but if you are interested, read on!

How We Got Started

Making the Challenge fair was by far our primary goal. We knew we had to do whatever we could to make sure that the judges' scores are based solely on their review of the application. We automated as much as possible, to make it easy for judges to focus on judging, and not on administrivia or complicated setup.

The first thing we realized was that we were going to have way more submissions than any single judge could look at. No one could review all 1,788 submissions in a reasonable amount of time. On the other hand, we definitely needed more than one judge reviewing each submission. Our goal was to have each submission reviewed by four different judges, with a minimum of three.

The big question was then: how many judges would we need?

For 1,788 submissions, a panel of 4 judges per application meant that we needed a whopping 7,152 reviews to be performed. Since our judges would have to be crazy to agree to do more than 75 reviews, we needed at least 95 judges. In the end we recruited around 125, including backup judges.

Making Order out of Chaos

The next thing we realized was that judges need to be able to actually review the submissions. Since the judges came from our Open Handset Alliance partners and many are not engineers, we knew that we couldn't send instructions like "run the M5-RC15 emulator, open a terminal window, and run the command adb push geodb /data/misc/location�and don't forget the --sdcard option!" They'd think we were quoting Star Wars.

Besides that, we also knew that once we gave the judges their assignments, what they did was out of our hands. We couldn't control how the judges review the applications, but we could certainly make it as easy as possible for the judges to do a thorough review.

So, we built a program in wxPython that automates judging. This application launches a clean emulator for each submission, supports emulator features like SD card images and mock location providers, and allows judges to launch multiple emulators and simulate calls and SMS messages for applications which need that functionality. We asked our friendly neighborhood Google Tech Stop for 140 laptops, installed Ubuntu Linux and our software on one, and then cloned that installation for use on all the others. We then had a huge shipping party, where we imaged, boxed, and shipped 115 or so laptops in one day.

An important side effect of these custom laptops is that they are all identical. This means that each judge's experience of the submissions was the same, which eliminated the risk of one judge rating an app poorly just because it ran slowly on his personal computer.

Managing All that Data

Once we sent 100+ laptops all over the world, we needed a way to get the data back. Another goal was to eliminate as many sources of human error as possible. With 7,152 reviews to complete, and 4 categories per review, that's 28,608 scores to keep track of. Mistakes would be bound to happen, so filing paperwork or transcribing scores by hand from one file to another was out of the question.

Our solution was the Google Data web API for accessing things like Google Spreadsheets and Google Base. Here's how it worked.

We wrote a Python program to randomly assign applications to judges for review.
Using the Spreadsheets API, that program generated a Google Spreadsheet for each judge, pre-filled with that judge's assigned submissions and space to enter scores.
The program installed on the laptops also used the Spreadsheets API to fetch a given judge's assignments.
When the judge scores a submission, those scores were posted back into the spreadsheet.
After the judging period concluded, a separate program walks over all the judges' spreadsheets, computing the final scores.

This approach had two great things about it: first, it didn't require any new server infrastructure to make it work. Second, our "database" had a built-in rich "admin" UI for managing the data � namely, Google Spreadsheets itself. If any of our judges ran into problems or needed help, we could simply open that spreadsheet in our browser and review or fix problems.

This approach worked quite well, and I'd bet that the judges didn't even know the Spreadsheets API was being used, unless they actively poked around.

Tying Up the Loose Ends

Of course, our work wasn't done once we retrieved all the submission scores. We couldn't just average up the scores, you see. First, judges could recuse themselves from scoring specific submissions; perhaps they were assigned an application similar to one their own company is working on, or perhaps they realized they knew one of the authors. Second, despite our best efforts there was a chance that some judges might have a problem � for instance, if one judge had a poor network connection but reviewed an application that requires the network, then that judge might have scored the application unfairly poorly.

Here are the major outlier scenarios that we were concerned about:

Cases where judges recused themselves.
Submissions where one judge reported a problem with the application, but all the other judges reported good scores. (It seems odd for only one judge to have a problem.)
Cases where one judge's scores were an outlier compared to the other judges' scores.

For the first two cases, we simply discarded the outlying data points, if we had enough. For instance, if three judges reported good scores and one recused herself, we simply dropped that fourth score. If dropping the conflicting score would have brought the application below three reviews, we sent it back for review by a new judge to bring it up to our minimum number of judges per application.

The third case is more subtle. Just because a judge rated an application differently than others doesn't mean that that review is invalid, so we can't simply discard outliers. Instead, we took the highest and lowest scores in each category and gave them half weight. The effect is to bring the average scores a bit closer to the median scores, which helps minimize the impact of unusually high or low scores. This process was applied to all submissions (not just "suspicious" scores) since it has a minimal effect on submissions that don't have a large outlier.

We actually ran the whole process above twice: first we ran it to choose a first cut of the top 100 submissions from the original 1,788, and we then sent those 100 to a second group of judges for selection of the final 50. (Actually, the "top 100" were really "top 119", since we added a few more submissions to accommodate scoring ties in the first round.)

Wrapping Up

Now you know what we've been spending all our time on, and what's been keeping us up at night (sometimes literally)! Throughout, our key objectives were to keep the process fair, let the judges focus on judging, and give applications the benefit of the doubt in cases of scoring outliers.

What's next? Well, the 50 submissions that were awarded a prize now begin the refinement process for their Round 2 submissions, which will award the final, larger prizes to the top 20 applications. I also hope that the developers of the other great apps that didn't receive prizes will consider the second Android Developer Challenge, which should begin later this year.

To everyone, I'd also like to say thanks for participating, and congratulations on your hard work!

knowledge

Tuesday, May 13, 2008

A Challenge in More than One Way

No comments:

Post a Comment