This is the first draft, so there are probably all kinds of awful errors (thought hopefully not in the code! ). Enjoy, and then rip it to shreds so that I can improve it.


Part II - Randomized Binary Search Trees


In part I you learned that binary search trees are a powerful data structure for storage and retrieval of data. However, you also learned that basic binary search trees have exceptionally bad worst case performance. There are three worst cases corresponding to data being input in ascending sorted order:
Code:
Diagram 1:

  A
   \
    B
     \
      C
       \
        D
         \
          E
Data being inserted in descending sorted order:
Code:
Diagram 2:

          E
         /
        D
       /
      C
     /
    B
   /
  A
Or data being inserted in alternating order:
Code:
Diagram 3:

  A
   \
    E
   /
  B
   \
    D
   /
  C
None of these make for particularly efficient binary search trees, they are all equivalent to linked lists. The result of all of this is that the good O(log n) performance of binary search trees degenerates to O(n). Because binary search trees are used for large data sets, this can result in severe performance problems. Unfortunately, the classic methods for maintaining an optimal tree are frighteningly complex. A simple solution is to recognize the performance of the average case of a binary search tree.

A performance of a basic binary search tree is O(log n) on average. The average case is when the data being inserted is sufficiently random to create what is called a balanced tree, where at any given node that is not a leaf, there are two choices of direction:
Code:
Diagram 4:

    B
   / \
  A   D
     / \
    C   E
The simplest way to create an average case binary search tree is to permute the input randomly. However, sometimes we don't have access to all of the input at one time. In fact, usually we will be inserting items one at a time into the tree. This makes it very difficult to permute our data before inserting it into the tree.

Let us look at this a different way. An average case binary search tree can be created if every item in the data has an equal probability of being the root item. This suggests the use of the root insertion routines we wrote in the last tutorial. At every point in the insertion, make a random selection of root insertion or basic insertion. The end result will be an average case tree. This algorithm is sound, but in this tutorial we will be going with a variation on binary search trees and heaps called a treap. The treap is simple to write, results in the average case with a high probability, and has the added benefit that we have both a binary search tree and a heap (two useful data structures) at our disposal. A treap follows a few rules:
Code:
Binary search tree rules:

For a node u with a left child v and a right child w:
  v->data < u->data < w->data

Heap rules:

For a node u with a left child v and a right child w:
  w->priority < v->priority < u->priority
By assigning priorities randomly or by hashing the data and then maintaining both the binary search tree condition and the heap condition, we have an excellent chance of getting a balanced binary search tree. The worst case will still be O(n), but it will be very rare. In cases where guaranteed best case isn't necessary, a treap will perform adequately and will be quick to implement.

Exercise 1: Can you think of any other combinations of input that would create a degenerate tree?

Exercise 2: Verify that by meeting both the binary search tree conditions and the heap conditions, the result is an average case binary search tree.

Exercise 3: Determine the probability of a sufficiently degenerate tree using a treap.

Bonus Exercise: Write an insertion function that randomly uses root insertion or basic insertion at every node. This algorithm may be difficult because only an informal description is given above.


Insertion

Insertion into a treap is straightforward. With a new item that has a random priority, simply perform a normal binary search tree insertion and then walk back up the tree performing rotations until the heap condition is met. By using rotations we can guarantee that the binary tree condition will remain valid. The recursive algorithm to do this is trivial:
Code:
Source 1:

treap_insert ( T, n, k ) begin
  if T == null then
    T = n
  else
    dir = k > T.data
    T.link[dir] = treap_insert ( T.link[dir], n, k )
    if T.priority < T.link[dir].priority then
      save = T.link[dir];
      T.link[dir] = save.link[!dir];
      save.link[!dir] = T;
      T = save;
  endif

  return T
end
To convert this to C/C++ code we will need to define a few types first. The node structure is little changed from the last tutorial:
Code:
Source 2:

struct pretree_node {
  struct pretree_node *link[2];
  int priority;
  int data;
} NullItem = { &NullItem, &NullItem, RAND_MAX, 0 }, *Null = &NullItem;
Notice the added priority member. Of immediate notice is the fact that I have defined two variables of pretree_node, an instance initialized to boundary values and a pointer initialized to the address of the instance. Our treap will be using Null as a sentinel marking the invalid links, much like the null pointer did previously. The reason for this will become clear when we implement pretree_remove. For now, just remember that Null is used wherever NULL was used in the last tutorial.

The pretree type is the same as it has been (the size member has been removed for brevity):
Code:
Source 3:

struct pretree {
  struct pretree_node *header;
};
Now we can begin writing pretree_insert. Remember that recursive solutions will often need helper functions to initialize and finalize the algorithm. In this case we need only to assign the final result to the header:
Code:
Source 4:

void pretree_insert (
  struct pretree *tree,
  struct pretree_node *new_item,
  int search_key )
{
  tree->header = insert_subtree ( tree->header, new_item, search_key );
}
The search_key argument is used to find the appropriate location in the treap. Simpler implementations will use new_item->data, but in some cases this may not be sufficient. The new_item argument is an initialized pretree_node, meaning all of its members will be in a predictable state. In my test code I have used the following (naive) function to ensure this:
Code:
Source 5:

struct pretree_node *makenode ( int data )
{
  struct pretree_node *rn = malloc ( sizeof *rn );
  rn->data = data;
  rn->priority = rand();
  rn->link[0] = rn->link[1] = Null;
  return rn;
}
The function is then called like so:
Code:
Source 6:

pretree_insert ( &t, makenode ( i ), i );
While the code in my test scaffolding is far from production quality, it gives you an idea of how pretree_insert can be used. But enough prelimiaries, on to the recursive function, insert_subtree:
Code:
Source 7:

struct pretree_node *insert_subtree (
  struct pretree_node *header,
  struct pretree_node *new_item,
  int search_key )
{
  if ( header == Null )
    header = new_item;
  else {
    int dir = search_key > header->data;
    header->link[dir] = insert_subtree ( header->link[dir], new_item, search_key );
    if ( header->priority < header->link[dir]->priority ) {
      struct pretree_node *save = header->link[dir];
      header->link[dir] = save->link[!dir];
      save->link[!dir] = header;
      header = save;
    }
  }

  return header;
}
This function is painfully simple. The only potentially confusing part is the use of dir to determine a right of left rotation. One can avoid this complexity by using two cases instead:
Code:
Source 8:

if ( search_key < header->data ) {
  header->link[0] = insert_subtree ( header->link[0], new_item, search_key );
  if ( header->priority < header->link[0]->priority ) {
    struct pretree_node *save = header->link[0];
    header->link[0] = save->link[1];
    save->link[1] = header;
    header = save;
  }
}
else {
  header->link[1] = insert_subtree ( header->link[1], new_item, search_key );
  if ( header->priority < header->link[1]->priority ) {
    struct pretree_node *save = header->link[1];
    header->link[1] = save->link[0];
    save->link[0] = header;
    header = save;
  }
}
Of course, that defeats the purpose of using the link array in pretree_node. Those familiar with boolean logic will have no trouble with the function.

The internal workings of this algorithm may not be immediately obvious to anyone not familiar with recursion. The nonrecursive solution is more verbose, but it provides insight into what is really going on. It also has a striking similarity to the root insertion function from the last tutorial:
Code:
Source 9:

int pretree_insert(
  struct pretree *tree,
  struct pretree_node *new_item,
  int search_key )
{
  struct pretree_node *walk;
  struct pretree_node *up[50];
  int dir;
  int top = 0;

  /*
   * Handle empty tree case
   */
  if ( tree->header == Null ) {
    tree->header = new_item;
    return 1;
  }
  /*
   * Search for an empty link
   * Return failure if a duplicate is found
   */
  walk = tree->header;
  for ( ; ; ) {
    if ( walk->data == search_key )
      return 0;
    dir = walk->data < search_key;
    if ( walk->link[dir] == Null )
      break;
    up[top++] = walk;
    walk = walk->link[dir];
  }
  /*
   * Insert the new item
   */
  walk->link[dir] = new_item;
  /*
   * Walk back up and rotate
   */
  while ( new_item != tree->header ) {
    if (walk->priority > new_item->priority)
      break;
    dir = new_item == walk->link[1];
    walk->link[dir] = new_item->link[!dir];
    new_item->link[!dir] = walk;
    /* Notify the rest of the tree of changes */
    if ( top > 0 && up[top - 1] != Null ) {
      dir = ( walk == up[top - 1]->link[1] );
      up[top - 1]->link[dir] = new_item;
    }
    /* Reseat header if needed */
    if ( tree->header == walk )
      tree->header = new_item;
    /* Move up and start over */
    walk = up[--top];
  }

  return 1;
}
A return value is used for error codes in this implementation. One should use them wherever possible. The only reason the recursive function neglects to return a useful error condition is because doing so would complicate the code without adding anything. The details of the loop that climbs up the tree performing rotations are left as an exercise. Pencil and paper are invaluable here.

Exercise 4: Rewrite the treap to use left and right pointers instead of the link array. Is the clarity gained worth the increase in code size?

Exercise 5: What possible use could Null be? Try to guess what the reason for using a sentinel over a null pointer is.

Exercise 6: Rewrite source 5 and 6 such that the code is more robust and better suited for production code.

Exercise 7: Trace the execution of source 7. Verify that it does what you expect.

Exercise 8: Rewrite source 4 and 7 to return an appropriate error code on any failure.

Exercise 9: Compare source 9 with source 19 of the part I. Note the similarities.

Bonus Exercise: Is rand the best way to find a good priority for our treap? Determine a more effective way of calculating random priorities.


Removal

Removal of a node is not as important as insertion, and many texts omit the code for it. While I can sympathize, I prefer to give at least a little working code whenever it would be of benefit. The algorithm for removing from a treap is exactly the opposite of insertion. Find the item and perform rotations (walking the item down the tree) until it is a leaf, then remove it. The algorithm is short:
Code:
Source 10:

treap_remove ( T, o, k ) begin
  if T != Null then
    if k < T.data then
      T.link[0] = treap_remove ( T.link[0], o, k )
    else if T.data < k then
      T.link[1] = treap_remove ( T.link[1], o, k )
    else
      dir = T.link[0]->priority > T.link[1]->priority
      save = T.link[dir]
      T.link[dir] = save.link[!dir]
      save.link[!dir] = T
      T = save;
      if T != Null then
        T = treap_remove ( T, o, k )
      else
        o = T.link[1] # Save the old item
        T.link[1] = Null;
      endif
    endif
  endif

  return T
end
The search is simple: If T is not Null, move left or right accordingly. Otherwise perform the appropriate rotation and continue to recurse if T is not null, otherwise save the old item and remove it from the tree. The walk back up the recursive chain fixes any unraveled links.

Once again we need a helper function to prepare and finalize the recursive function:
Code:
Source 11:

void pretree_remove (
  struct pretree *tree,
  struct pretree_node **old_item,
  int search_key )
{
  tree->header = remove_subtree ( tree->header, old_item, search_key );
}
Very easy, just like for insertion. The old_item argument is a pointer to a pointer so that we can save the old item and client code can deal with it as the application demands. Were our code to go ahead and release the memory for the node, it might not be what the client wanted, so this function gives clients flexibility at the cost of a little more work on their part.

The recursive function in C/C++ is a straight translation of the pseudocode:
Code:
Source 12:

struct pretree_node *remove_subtree (
  struct pretree_node *header,
  struct pretree_node **old_item,
  int search_key )
{
  if ( header != Null ) {
    if ( search_key < header->data )
      header->link[0] = remove_subtree ( header->link[0], old_item, search_key );
    else if ( header->data < search_key )
      header->link[1] = remove_subtree ( header->link[1], old_item, search_key );
    else {
      int dir = header->link[0]->priority > header->link[1]->priority;
      struct pretree_node *save = header->link[dir];
      header->link[dir] = save->link[!dir];
      save->link[!dir] = header;
      header = save;
      if ( header != Null )
        header = remove_subtree ( header, old_item, search_key );
      else {
        *old_item = header->link[1];
        header->link[1] = Null;
      }
    }
  }

  return header;
}
By studying this code, you can see why I chose to use a sentinel instead of a null pointer to mark the end of a path in the treap. If I had used NULL, this section of code
Code:
Source 13:

if ( header != Null )
  header = remove_subtree ( header, old_item, search_key );
else {
  *old_item = header->link[1];
  header->link[1] = Null;
}
Would surely break because header is dereferenced in the case where it is an invalid item. Once again, I encourage you to run the code on both paper and a computer so that you might fully understand how it works.

Exercise 10: Trace the execution of source 12. Verify that it performs as expected.

Exercise 11: Rewrite all source up to this point to use a null pointer instead of a sentinel node. Explain why I chose to use a sentinel for this tutorial.

Exercise 12: Rewrite source 12 to use a nonrecursive solution.

Exercise 13: Use a walk-down-rotation scheme for basic binary tree removal instead of deletion-by-copying. Explain whether or not this would be a suitable alternative to the deletion methods given in part I.


Conclusion

Binary search trees are a good start, but they have a nasty worst case that happens to be very common. By using a randomized algorithm to restructure the tree as data are inserted, we can avoid those worst cases. While I chose to describe only one variation of randomized binary search trees (the treap), there are many others. The similarity to the quicksort algorithm should not go unnoticed either. The basic quicksort is like the basic binary search tree: fast, but with a common worst case that degenerates the algorithm badly. With the benefit of randomization, both quicksort and binary search trees can minimize the probability of encountering a worst case situation.

In the next installment, we will discuss ways of guaranteeing near optimal balance of a binary search tree at all times.