Wednesday, 19 September 2012

Puzzle 57: What's in a Name?


This program consists of a simple immutable class that represents a name, with a main method that puts a name into a set and checks whether the set contains the name. What does the program print?
import java.util.*;



public class Name {

    private final String first, last;



    public Name(String first, String last) {

        this.first = first;

        this.last = last;

    }



    public boolean equals(Object o) {

        if (!(o instanceof Name))

            return false;

        Name n = (Name)o;

        return n.first.equals(first) && n.last.equals(last);

    }

    

    public static void main(String[] args) {

        Set<Name> s = new HashSet<Name>();

        s.add(new Name("Mickey", "Mouse"));

        System.out.println(

            s.contains(new Name("Mickey", "Mouse")));

    }

}


Solution 57: What's in a Name?

A Name instance consists of a first name and a last name. Two Name instances are equal, as computed by the equals method, if their first names are equal and their last names are equal. First names and last names are compared using the equals method defined in String. Two strings are equal if they consist of the same characters in the same order. Therefore, two Name instances are equal if they represent the same name. For example, the following method invocation returns true:
new Name("Mickey", "Mouse").equals(new Name("Mickey", "Mouse"))


The main method of the program creates two Name instances, both representing Mickey Mouse. The program puts the first instance into a hash set and then checks whether the set contains the second. The two Name instances are equal, so it might seem that the program should print true. If you ran it, it almost certainly printed false. What is wrong with the program?
The bug is that Name violates the hashCode contract. This might seem strange, as Name doesn't even have a hashCode method, but that is precisely the problem. The Name class overrides the equals method, and the hashCode contract demands that equal objects have equal hash codes. To fulfill this contract, you must override hashCode whenever you override equals [EJ Item 8].
Because it fails to override hashCode, the Name class inherits its hashCode implementation from Object. This implementation returns an identity-based hash code. In other words, distinct objects are likely to have unequal hash values, even if they are equal. Name does not fulfill the hashCode contract, so the behavior of a hash set containing Name elements is unspecified.
When the program puts the first Name instance into the hash set, the set puts an entry for this instance into a hash bucket. The set chooses the hash bucket based on the hash value of the instance, as computed by its hashCode method. When it checks whether the second Name instance is contained in the hash set, the program chooses which bucket to search based on the hash value of the second instance. Because the second instance is distinct from the first, it is likely to have a different hash value. If the two hash values map to different buckets, the contains method will return false: The beloved rodent is in the hash set, but the set can't find him.
Suppose that the two Name instances map to the same bucket. Then what? All HashSet implementations that we know of have an optimization in which each entry stores the hash value of its element in addition to the element itself. When searching for an element, the implementation selects the appropriate hash bucket and traverses its entries, comparing the hash value stored in each entry with the hash value of the desired element. Only if the two hash values are equal does the implementation check the elements for equality. This optimization makes sense because it is usually much cheaper to compare hash codes than elements.
Because of this optimization, it is not enough for the hash set to search in the right bucket; the two Name instances must have equal hash values in order for the hash set to recognize them as equal. The odds that the program prints TRue are therefore the odds that two consecutively created objects have the same identity hash code. A quick experiment showed these odds to be about one in 25,000,000. Results may vary depending on which Java implementation is used, but you are highly unlikely to see the program print TRue on any JRE we know of.
To fix the problem, simply add an appropriate hashCode method to the Name class. Although any method whose return value is determined solely by the first and last name will satisfy the contract, a high-quality hash function should attempt to return different hash values for different names. The following method will do nicely [EJ Item 8]. Once this method is added, the program will print true as expected:
public int hashCode() {

    return 37 * first.hashCode() + last.hashCode();

}


In summary, always override hashCode when you override equals. More generally, obey the general contract when you override a method that has one. This is an issue for most of the non-final methods declared in Object [EJ Chapter 3]. Failure to follow this advice can result in arbitrary, unspecified behavior.

No comments:

Post a Comment

Your comments are welcome!