Tuesday, March 09, 2010

Shuffling tricks

I recently read this interesting article about an error in the EU Microsoft Browser Ballot. Briefly:
  • In the EU, Microsoft must give users a choice of browsers in Windows.
  • This is done via a message box with 5 top-choices (Firefox, Opera, Safari, MSIE, and Chrome), and a bunch of other intermediate-choices.
  • The box should display the 5 top-choices in random order.
As it turns out, the order is not exactly random! For instance, MSIE is more likely than not to appear towards the right-end of the list, and Chrome towards the left-end.

The reason for this is programmer error, as described in the article. The programmer implemented the shuffle algorithm incorrectly by assuming that JavaScript uses QuickSort, when in reality it uses some other algorithm.

It turns out that shuffling is notoriously hard to get right.


Consider the following two simple shuffling algorithms:
  1. For each entry Ex in the array: swap Ex with any other entry in the array.
  2. For each entry Ex in the array: swap Ex only with entries after it.
Intuitively, it seems like either approach should produce a good shuffle. However, your intuition would be wrong: the first approach produces a non-uniform shuffle, while the second approach produces a uniform shuffle.

Better writers than I have explained how and why this happens. One good explanation of this is at Coding Horror: The Dangers of Naivete. The crux of the proof is:
  • The total number of entries in a shuffle of n cards is n! (n factorial).
  • The total number of arrangements of cards where you swap each card with any other card is n^n (n to the power n).
  • n! does not divide evenly into n^n, so the shuffle cannot be uniform.
Prove to yourself that the 2nd algorithm above does not suffer from the problem described above. For extra credit, also prove that it is uniform.

In thinking about this problem, I was struck by one non-obvious fact: how can it be that, in the first (non-uniform) algorithm some configurations occur more than others? I mean, think about it:
  • Start with an array in some order.
  • For each element, randomly swap it with some other element.
Somehow, this makes it so that some configurations occur more often than others, and reliably so! Although the algorithm is random, somehow it favors certain configurations. This is quite surprising indeed. Really, spend a minute to let this sink in.

I haven't found a good "intuitive" explanation for why this favoritism happens. This explanation, from stackoverflow.com, comes close: "And that's the "intuitive" explanation: in your first algorithm, earlier items are much more likely to be swapped out of place than later items, so the permutations you get are skewed towards patterns in which the early items are not in their original places."

Magician-turned-statistics-professor Persi Diaconis has done a lot of very interesting work in randomness, nature, and our perception thereof. One interesting question he answered was: how many times do you have to shuffle a deck of cards so that it's "random"? Another is: how random is a coin flip, really? Both have surprisingly non-obvious and interesting answers.

No comments: