Wednesday, 19 September 2012

Puzzle 94: Lost in the Shuffle


The following shuffle method purports to shuffle its input array fairly. In other words, it purports to generate all permutations with equal likelihood, assuming that the underlying pseudorandom number generator is fair. Does it make good on its promise? If not, how do you fix it?
import java.util.Random;



public class Shuffle {

    private static Random rnd = new Random();



    public static void shuffle(Object[] a) {

        for (int i = 0; i < a.length; i++)

            swap(a, i, rnd.nextInt(a.length));

    }



    private static void swap(Object[] a, int i, int j) {

        Object tmp = a[i];

        a[i] = a[j];

        a[j] = tmp;

    }

}


Solution 94: Lost in the Shuffle

Looking at the shuffle method, there's nothing obviously wrong with it. It iterates through the array, swapping a randomly chosen element from the array into each location. That ought to shuffle the array fairly, right? Wrong. There's a big difference between saying "There's nothing obviously wrong with this code" and "There's obviously nothing wrong with this code." In this case, there's something very wrong, but it isn't obvious unless you specialize in algorithms.
If you call the shuffle method on an array of length n, the loop iterates n times. In each iteration, the method chooses one of the n integers between 0 and n - 1. Therefore, there are nn possible executions of the method. We assumed the random number generator is fair, so each execution occurs with equal likelihood. Each execution generates one permutation of the array. There is, however, one small problem: There are n! distinct permutations of an array of length n. (The exclamation point after n indicates the factorial operation: n factorial is defined as n x (n - 1) x (n - 2) x ...x 1.) The problem is that nn is not divisible by n! for any n greater than 2, because n! has every prime factor from 2 through n, but nn has only the prime factors that make up n. This proves beyond a shadow of a doubt that the shuffle method generates some permutations more often than others.
To make this concrete, let's consider an array of length 3 containing the strings "a", "b", and "c". There are 33 = 27 possible executions of the shuffle method. All are equally likely, and each generates some permutation. There are 3! = 6 distinct permutations of the array: {"a", "b", "c"}, {"a", "c", "b"}, {"b", "a", "c"}, {"b", "c", "a"}, {"c", "a", "b"}, and {"c", "b", "a"}. Because 27 is not divisible by 6, some of these permutations must be generated by more executions than others, so the shuffle method is not fair.
One problem with this proof is that it offers no intuition into the bias induced by the method; it merely proves that a bias exists. Often the best way to gain some insight is to perform an experiment. We ran a program that calculates the expected value of the element at each position when the method is run on the "identity array," where a[i] = i. Loosely speaking, the expected value is the average value that you'll see in the element if you run the shuffle method repeatedly. If the shuffle method were fair, the expected value would be the same for each element: ((n -1 ) / 2). Figure 10.1 shows the expected value for each element in an array of length 9. Note the distinctive shape of the graph: It starts low, increases beyond the fair value (4), and settles down to the fair value in the last element.
Figure 10.1. Expected values for the shuffle method on the identity array.


Why does the graph have this shape? We don't know all the details but we can offer some intuition. Let's restrict our attention to the array's first element. After the first iteration of the loop, it has the correct expected value of (n - 1) / 2. In the second iteration, however, there is 1 chance in n that the random number generator will return 0 and the value in the first element will be replaced by 1 or 0. In other words, the second iteration systematically reduces the expected value of the first element. In the third iteration, there's a further 1 chance in n that the first element is replaced by 2, 1, or 0, and so on. For the first n / 2 iterations of the loop, the expected value of the first element decreases. For the second n / 2 iterations, it increases but never catches up to its fair value. Note that the last element in the array is guaranteed to have the correct expected value, as the last step in the execution of the method is to select it at random from all the elements in the array.
OK, so our shuffle method is broken. How do we fix it? Use the shuffle method provided by the library:
import java.util.*;

public class Shuffle {

    public static void shuffle(Object[] a) {

        Collections.shuffle(Arrays.asList(a));

    }

}


Whenever the libraries provide a method that does what you need, use it [EJ Item 30]. Generally speaking, the libraries provides high-quality solutions requiring a minimum of effort on your part.
On the other hand, after you suffered through all that math, it seems unfair not to tell you how to fix the broken shuffle method. The fix is actually quite straightforward. In the body of the loop, swap the current element with an element selected at random from the portion of the array starting at the current element and extending to the end of the array. Do not touch an element once you've swapped a value into it. This is essentially the algorithm that is used by the library method:
public static void shuffle(Object[] a) {

    for (int i = 0; i < a.length; i++)

        swap(a, i, i + rnd.nextInt(a.length - i));

}


It's easy to prove this method fair by induction. For the base case, observe that it is trivially fair for an array of length 0. For the induction step, if you apply the method to an array of length n > 0, it correctly selects a random element for the zeroth position of the result array. Then, it iterates over the remainder of the array: At each position, it selects an element chosen at random from the "subarray" beginning at that position and extending to the end of the original array. But that is exactly what the method would do if it were applied directly to the subarray of length n - 1, starting at position 1 in the original array. This completes the proof. It also suggests a recursive formulation of the shuffle method, whose details are left as an exercise for the reader.
You might think that that's all there is to this story, but there's one final chapter. Do you suppose that the fixed shuffle method generates all permutations of a 52-card deck, represented as a 52-element array, with equal likelihood? After all, we just proved it fair. It probably won't surprise you at this point that the answer is an emphatic no. The problem is that we assumed, way back at the start of the puzzle, that "the underlying pseudorandom number generator is fair." It isn't.
The random number generator, java.util.Random, takes a 64-bit seed, and the sequence of numbers it generates is fully determined by that seed. There are 52! permutations of a 52-card deck, but only 264 seeds. What fraction of the permutations does that cover? Would you believe 2.3 x 10-47 percent? That is a polite way of saying "practically none." If you use java.security.SecureRandom in place of java.util.Random, you get a 160-bit seed, but that buys you surprisingly little: The shuffle method still fails to return some permutations for arrays with more than 40 elements (because 40! > 2160). For a 52-element array, you still get only 1.8 x 10-18 percent of the possible permutations.
Does that mean that you shouldn't trust these pseudorandom number generators for shuffling cards? It depends. They generate a negligible fraction of the possible permutations, but they have no systematic bias that we're aware of. It seems fair to say that these generators are good enough for casual use. If you need a state-of-the-art random number generator, you'll have to look elsewhere.
In summary, shuffling an array, like many algorithms, is tricky. It's easy to get it wrong and hard to tell that you did. All other things being equal, you should use trusted libraries in preference to handwritten code. If you want to learn more about the issues discussed in this puzzle, see [Knuth98 3.4.2].

No comments:

Post a Comment

Your comments are welcome!