Hashing, equality, and comparison

If two items in a collection are equal, they must hash to the same value. This is important to understand if you are defining a subclass of MCollectible or if you are defining a subclass of TComparator. For two elements in a collection to be considered equal, they must produce identical hash values.

Although hash values must be equal for objects to be equal, the converse need not be true. Two objects hashing to the same value does not necessarily mean the objects are equal. For reasons of efficiency, object hash values of two objects are first compared to test for equality, before a more expensive comparison is made. If two objects do not hash to the same value, they cannot be equal, so the second, more expensive test can be avoided. Because most objects are not equal when compared, the quick hash comparison, based on the comparison of two longs, can be used to determine that two objects are not equal.

If hash values are not equal, elements are not equal

Two criteria must be met in order for two objects to be considered equal. First both objects must produce the same hash value. If they do not produce the same hash value they can not be equal. Second, the comparator calls its Compare function with each of the two objects being compared as arguments. The comparators Compare function in turn calls the Compare member function defined for the particular class of objects in the collection.

However, the comparator's Compare function is only called if the comparator's GetHash function has returned the same value for both objects. If Compare returns a value that indicates equality, the objects are considered equal. Thus if you subclass, TComparator, you must override both GetHash and Compare in a way that meets these two criteria. You must assure that objects which you consider equal will return identical hash values. Also, the hashing algorithm selected is important as an algorithm which provides good distribution of objects which should not be considered equal. Calls to a potentially costly implementation of Compare can be replaced by a simple comparison of the two long values returned by the hash function for each of the elements being compared.

Defining a comparator to reverse sort order

Consider a program which requires a non-standard comparator for a collection. For example, you might want to use comparator to sort a collection of text elements in reverse lexical order. You could do this be creating a special class of comparator.

  class TReversedComparator : public TMOrderableCollectibleComparator<TStandardText> {
  public:
      virtual EComparisonResult   Compare(const TStandardText& leftObj,
                                          const TStandardText& rightObj) const;
      virtual HashResult          GetHash(const TStandardText& obj) const 
                                          { return obj.Hash(); }
  };

Whenever you subclass TComparator or one of its descendants you must override the definitions of Compare and GetHash. GetHash is implemented inline in this definition and returns the hash value defined by TStandardText::Hash for the specific elements being compared. Here's a way to implement compare so that it reverses the normal order for sorting a collection of TStandardText objects.

  EComparisonResult TReversedComparator::Compare(const TStandardText& leftObj,
                                                 const TStandardText& rightObj) const
  {
      qprintf("ENTER comparator\n");
      if (leftObj.IsLessThan(&rightObj))
          return kGreaterThan;
      if (leftObj.IsGreaterThan(&rightObj))
          return kLessThan;
      return kEqual;
  }

Note that TReversedComparator was not implemented as a Template class, though it could have been if you were required to use TReversedComparator with different types of MOrderableCollectible objects. Not making the comparator class a template simplifies its syntax when used in declarations.

    TReversedComparator* comparator = new TReversedComparator;

Attaching this comparator to a TSortedSequenceOf<TStandardText> will cause collection elements to be maintained internally in reversed lexical order.

Here's a complete program.

Sample program using reversed-order comparator

  // Copyright (c) 1994 Taligent, Inc. All Rights Reserved.
  
  #ifndef Taligent_STANDARDTEXT
  #include <StandardText.h>
  #endif
  
  #ifndef _H_STDIO
  #include <stdio.h>
  #endif
  
  void printCollection(TCollectionOf<TStandardText>* collection)
  {
      int i = 0;                          // tracks order of elements
      TIteratorOver<TStandardText>* iterator = collection->CreateIterator();
      for (TStandardText* text = iterator->First();
           text != NIL;
           text = iterator->Next(), i++) {
  
          ////////////////////////////////////////////////////
          // buffer must be large enough for
          // longest possible string + zero terminator
          ////////////////////////////////////////////////////
          static const int kBufferSize = 256;
          static char extractBuffer[kBufferSize];
  
          text->Extract(TTextRange::GetMaximumRange(),
                        extractBuffer,
                        kBufferSize - 1);
          printf("\n%3d> %s", i, extractBuffer);
      }
      printf("\n");
  
      delete iterator;
  }
  
  
  
  class TReversedComparator : public TMOrderableCollectibleComparator<TStandardText> {
  public:
      virtual EComparisonResult   Compare(const TStandardText& leftObj, const TStandardText& rightObj) const;
      virtual HashResult          GetHash(const TStandardText& obj) const { return obj.Hash(); }
  };
  
  
  EComparisonResult TReversedComparator::Compare(const TStandardText& leftObj,
                                                 const TStandardText& rightObj) const
  {
      if (leftObj.IsLessThan(&rightObj))
          return kGreaterThan;
      if (leftObj.IsGreaterThan(&rightObj))
          return kLessThan;
      return kEqual;
  }
  
  
  int main()
  {
      TReversedComparator* comparator = new TReversedComparator;
  
      TSortedSequenceOf<TStandardText>* collection =
          new TSortedSequenceOf<TStandardText>(comparator, NIL);
  
      collection->Add(new TStandardText("a"));
      collection->Add(new TStandardText("b"));
      collection->Add(new TStandardText("c"));
  
      collection->Add(new TStandardText("a"));
      collection->Add(new TStandardText("b"));
      collection->Add(new TStandardText("c"));
  
  
      printCollection(collection);
  
      delete collection;
      return 0;
  }

The output shows the reversed order of TStandardText objects within the collection.

  1> c
  2> b
  3> b
  4> a
  5> a

Selecting a good hash function can be important for the efficiency of comparisons. The ideal is to use a hash function that produces a unique integer value for each element in the collection. The ideal is rarely possible, but if it were, comparisons testing for equality only (==, !=) could be determined by hash value alone. The best compromise is to provide a hash function that generates as few duplicate integer values as possible. Good distribution of hash values allows sets, for example, to quickly determine that two items are not equal and thus avoid one call to the comparator's Compare function and two calls to an elements GetValue (or equivalent) function.

In the following slightly contrived example, consider the comparator class that lets you monitor calls to the comparator's Compare member function.

Class for monitoring compare operations

  ////////////////////////////////////////////////////////////
  // class TIntComparator (declaration)
  ////////////////////////////////////////////////////////////
  class TIntComparator : public TMCollectibleComparator<TCollectibleLong> {
  public:
      virtual EComparisonResult   Compare(const TCollectibleLong& leftObj,
                                  const TCollectibleLong& rightObj) const;
      virtual HashResult          GetHash(const TCollectibleLong& obj) const 
                                          { return obj.GetValue(); }
  };
  
  ////////////////////////////////////////////////////////////
  // class TIntComparator (definition)
  ////////////////////////////////////////////////////////////
  EComparisonResult TIntComparator::Compare(const TCollectibleLong& leftObj,
                                            const TCollectibleLong& rightObj) const
  {
      qprintf("ENTER comparator\n");
      if (leftObj.IsEqual(&rightObj))
          return kEqual;
      else
          return kNotEqual;
  }

The TIntComparator::Compare simply returns the value of one integer that is being compared against a second integer. This function provides the ideal hash for longs because there is a unique hash value for each of the possible values of a TCollectibleLong. Assuming we have a printCollection function that works with collections of TCollectibleLongs a main function might be built as follows.

  int main()
  {
      TIntComparator *comparator = new TIntComparator;
      TSetOf<TCollectibleLong>* collection =
          new TSetOf<TCollectibleLong>(comparator, NIL);
  
      collection->Add(new TCollectibleLong(0));
      collection->Add(new TCollectibleLong(1));
      collection->Add(new TCollectibleLong(2));
      collection->Add(new TCollectibleLong(1));
      collection->Add(new TCollectibleLong(1));
  
      printCollection(collection);
  
      delete collection;
      return 0;
  }

Most important is that three duplicate items are added to the set. The output from this program shows that the comparator's Compare function is called exactly twice.

  ENTER comparator
  ENTER comparator
  
    0> 0
    1> 1
    2> 2

This makes sense because only in the case of the second and third additions of TCollectibleLong(1) elements does GetHash produce values that collide. Only when hash values collide is your Compare function called to see if two elements are equal.

If the definition of the hash function were changed to always return a value of 0, you'd have the worst kind of hash function possible, as shown in the following definition for GetHash.

  class TIntComparator : public TMCollectibleComparator<TCollectibleLong> {
  public:
      ...
      virtual HashResult          GetHash(const TCollectibleLong& obj) const { return 0; }
  };

All elements will map to the same hash value and your TIntComparator::Compare must be called each time a new element is added to the set to check for duplicate members. With this change the above program will produce the following output.

  ENTER comparator
  ENTER comparator
  ENTER comparator
  ENTER comparator
  ENTER comparator
  ENTER comparator
  ENTER comparator
  
    0> 0
    1> 2
    2> 1

When possible avoid the overhead associated with calls to Compare by defining GetHash to produce an evenly distributed range of hash values for your collection elements.

[Contents] [Previous] [Next]

Click the icon to mail questions or corrections about this material to Taligent personnel.

Generated with WebMaker