![]() |
| | #1 |
| Code Goddess Join Date: Sep 2001
Posts: 9,661
| Binary Search Trees Part III ![]() Part III - Balanced Binary Search Trees Binary search trees are easy to implement and understand for beginning programmers. Unfortunately, in a production environment these trees are typically avoided for one simple reason: performance in the worst case. Let us assume that we have a basic recursive binary search tree with a pretty print routine: Code: #include <stdio.h>
#include <stdlib.h>
struct node {
int data;
struct node *link[2];
};
struct node *make_node ( int data )
{
struct node *tree = malloc ( sizeof *tree );
if ( tree == NULL )
return NULL;
tree->data = data;
tree->link[0] = tree->link[1] = NULL;
return tree;
}
struct node *insert_node ( struct node *tree, int data )
{
if ( tree == NULL )
tree = make_node ( data );
else {
int dir = data > tree->data;
tree->link[dir] = insert_node ( tree->link[dir], data );
}
return tree;
}
void tree_structure ( struct node *tree, int level )
{
int i;
if ( tree == NULL ) {
for ( i = 0; i < level; i++ )
printf ( "\t" );
printf ( "~\n" );
return;
}
tree_structure ( tree->link[1], level + 1 );
for ( i = 0; i < level; i++ )
printf ( "\t" );
printf ( "%d\n", tree->data );
tree_structure ( tree->link[0], level + 1 );
}
Degenerate Trees Any operation that involves searching on a binary search tree will typically perform in logarithmic time on average. On average this is good, and binary search trees serve their purpose wonderfully. However, problems arise when the average case turns into the worst case. Many logarithmic algorithms suffer from quadratic worst case performance, and basic binary search trees are no exception. If the data are inserted into a basic binary search tree in sorted order or alternating order, the result will be a tree with a height equal to the number of elements in the tree. As you learned earlier in this series, the height of a tree is the longest path from the root to a leaf. When the height is short, the tree branches nicely and searching is performed in logarithmic time: Code: 8 4 12 1 6 9 13 Code: 1
4
6
8
9
12
13
Fortunately, all is not lost. There are many ways to force a binary search tree to maintain a short height even when faced with sorted input. Unfortunately, these methods are generally quite complex and very few programmers enjoy writing or maintaining them. Luckily for you, I am one of those who enjoys it very much. :-) In this part of the binary search tree tutorial, we will cover balanced binary search trees. These trees use various clever techniques for maintaining a nice branching structure on the tree so that it will perform in logarithmic time, guaranteed. Because of the vast number of techniques for balancing a binary search tree, I cannot hope to fit all of them into a single article. Therefore, I will focus on three popular balanced trees: AVL trees, Red Black trees, and Splay trees. Guaranteed Logarithmic Performance Balanced binary search trees are designed with good performance in mind, regardless of the ordering of insertions and deletions. Balanced trees fall into one of three categories. The first we have already covered in part II, Randomized trees. These trees reduce the chances of a degenerate tree to virtually nothing by utilizing random numbers in operations that modify the tree structure. Randomized algorithms are becoming more popular because they are generally simple to write and maintain as well as extend. Their performance is rather difficult to analyze, but most of us just leave that to the brainy mathematicians and trust that the algorithms work properly. :-) The second category is by far the most common. Optimized algorithms guarantee good performance by working very hard to maintain the best possible case without hurting performance. Optimized algorithms are difficult to write and maintain, and a PHD in algorithm analysis is usually required to extend them without breaking something. ;-) The optimized balanced tree algorithms that we will be studying in this article are for AVL trees and Red Black trees. AVL trees are guaranteed to perform well because they are balanced according to height. If the height of a node's subtree grows higher than the other by more than one, the tree is restructured to fix it. Most of the time, the height of an AVL tree will be very close to log N, with a worst case of 1.44log N. This is as close to optimal as possible while still being practical. AVL trees are best used when the data is known to come in sorted order with high frequency, and any searches will be fairly random. Red Black trees are an abstraction of the infamous B-tree that is generally used for database applications. This abstraction has largely been forgotten because Red Black trees are believed to be simpler without it, but I feel that knowing the rationale behind the red black scheme helps to understand the concept. Red Black trees are not as optimal as AVL trees; they have a maximum height of 2 * log N. This is not necessarily a bad thing because Red Black trees do not work nearly as hard to maintain the strucutre. Thus, a Red Black tree may perform better than an AVL tree despite having a larger average height. Red Black trees are best suited to applications where the input is largely random with occasional spouts of ordered data. The third category consists of Amortized algorithms. These are algorithms that guarantee performance over a large number of operations. The Splay tree is what we will be covering. A splay tree requires that every time an item is searched for, it is lifted to the root of the tree. Splaying is not like the root insertion routine we wrote in Part I though; there are a few slippery areas. Most notable about splay trees is the extreme difficulty in analyzing them. This is because the tree structure is dynamic in ways that cannot be predicted since the splaying is performed based on search hits. While any one operation on a splay tree may be inefficient, it has been proven (through mathematical proofs that I will never understand) that the total running time of all operations put together is logarithmic. Hence the title of Amortized, the expensive operations are averaged with the more abundant less expensive operations to give excellent performance overall. Splay trees are best suited to applications where data is searched for in clusters or the same few items are searched for often. AVL Trees Concept The AVL tree is named after its creators, G.M. Adelson-Velsky and E.M. Landis. The general idea is that for every node, the height of its left subtree never differs by more than 1 by the height of its right subtree. When all nodes in a tree meet this requirement, the tree is balanced. The following is a diagram showing an AVL tree of ten nodes: Code: 8\
4| 12|
2| 6| 10| 14\
a b c d 9| 11| i 15|
e f g h j k
Note that when some new items are added, the tree does not go out of balance. If we added a node a, b, c, d, or i then no restructuring is necessary. If a new item is added anywhere else, the structure of the tree will need to be adjusted to accound for it. The tree structure will need to be fixed when a node with a balance factor of \ has its right subtree grow in height, or when a node with a balance factor of / has its left subtree grow in height. It isn't too difficult to see that the problem areas arise in four distinct cases, two for a right imbalance and two for a left imbalance. The first case is where a node's right subtree whose right subtree has grown in height. In the following diagram, A's right node B has increased its right height: Code: A\
h B\
h h+1
Code: B| A| h+1 h h The second case consists of a node whose right subtree has had its left subtree grow in height. In the following diagram, A's right node B has increased its left height through C: Code: A\
h B/
C h
h-1 h
Code: A C|
h C A B
---->
h-1 B h h h h
h h
The symmetric cases (where the imbalance is in A's left subtree) are identical. In our implementation we will take advantage of this to make the code shorter and easier to follow. Removal of an item is more complicated and can be performed several ways. By far the simplest is to use simple deletion by copying (as explained in Part I). However, the algorithm cannot end there because an imbalance may have occurred by shortening a node's subtree to the point where rebalancing is required. Therefore, the algorithm must then move down to the successor (or predecessor, depending on how the deletion is implemented) and walk back up the tree from there while changing balance factors and performing rotations if necessary. For node removal the principles are the same, consisting of single and double rotations with careful modification of the balance factors, so we will not cover the process until later. When we get to the implementation for remove_node, hopefully will become clear. :-) Exercises 1) Diagram the symmetric cases for rebalancing, where A's left subtree has increased in height. 2) Using the descriptions given, try to work out what changes need to be made to the balance factors of A and C when a double rotation is performed. 3) What is the time complexity of insertion and removal from an AVL tree? Balance Factors AVL trees theoretically only need two extra bits added to a node to handle the balance factors. This is the case when the balance factors are only allowed to hold three values, -1 when the left subtree is higher, +1 when the right subtree is higher, and 0 when both subtrees are of equal height. In our implementation these values will be slightly different, where 0 is used for a higher left subtree, 1 for a higher right subtree, and 2 for equal height. The choice of value for equal height for our implementation is arbitrary as long as it isn't either 0 or 1. The C structure for an AVL node can be written a number of ways, but the most common are using a signed char for the balance factor: Code: struct node {
int data;
signed char balance;
struct node *link[2];
};
Code: struct node {
int data;
int balance;
struct node *link[2];
};
Code: struct node {
int data;
int balance: 2;
struct node *link[2];
};
Some implementations of AVL trees will write the balance factors as weight counts. A weight count is the number of nodes in both subtrees. The AVL scheme can be maintained by taking the difference of the weight count for the left and right subtree and rebalancing if they differ by more than 1. This often results in less convoluted code than the strict -1, 0, +1 balance factor scheme, but some slippery areas require hackish feeling solutions and the trees can only grow so large before a larger data type is required for the weight count. As an example, consider an implementation with 16 bit integers where the node structure looks like this: Code: struct node {
int data;
int balance;
struct node *link[2];
};
Exercises 1) What other ways can you think of to represent balance factors? 2) Determine the most space efficient order for the fields in an AVL node using a bitfield for the balance factor. 3) Why are two bits the minimum for the AVL balance factor? 4) Try to guess why our implementation needs 0 and 1 for the left and right balance factors and why the equal height balance factor can be anything. Insertion Insertion into an AVL tree consists of three parts: Searching for any empty location to insert at, inserting a new node at that location, and rebalancing the tree. The search and insertion steps are identical to that of a basic binary search tree. However, before we go any further a bit of preliminary code is needed. First we need to modify the basic binary search tree node to allow for balance factors. This is a simple addition of the balance field: Code: struct node {
int data;
int balance;
struct node *link[2];
};
Another auxilliary function that we might find useful is a function designed to create a node: Code: #include <stdlib.h>
struct node *make_node ( int data )
{
struct node *tree = malloc ( sizeof *tree );
if ( tree == NULL )
return NULL;
tree->data = data;
tree->balance = 2;
tree->link[0] = tree->link[1] = NULL;
return tree;
}
Next, because it is a good habit to modularize commonly used operations we will write separate single_rotate and double_rotate functions. These functions will be used heavily in all balanced trees, so it would be a good idea to write them now as they do nothing special with AVL balance factors: Code: struct node *single_rotate ( struct node *tree, int dir )
{
struct node *save = tree->link[dir];
tree->link[dir] = save->link[!dir];
save->link[!dir] = tree;
return save;
}
struct node *double_rotate ( struct node *tree, int dir )
{
struct node *save = tree->link[dir]->link[!dir];
tree->link[dir]->link[!dir] = save->link[dir];
save->link[dir] = tree->link[dir];
tree->link[dir] = save;
save = tree->link[dir];
tree->link[dir] = save->link[!dir];
save->link[!dir] = tree;
return save;
}
Insertion into an AVL tree, as said previously, is identical to that of a basic binary search tree with the exception of the structural changes and maintenance of balance factors on the walk back up to the root. The only issue is determining when rebalancing is required. If we rebalance at every step up the tree, the AVL condition will not be properly maintained without a slew of special cases, so it makes sense to avoid rebalancing if a node is not inserted such as if it already exists in the tree (we will not implement this feature) or after rotations have completely fixed the structure of the tree. At this point no further rotations or balance factor changes are needed, so we need a way to notify the algorithm when (and especially when not) to rebalance itself. The solution is a status flag passed to the insertion function. Assuming that we have a function to handle rebalancing, a recursive implementation is immediate: Code: struct node *insert_node ( struct node *tree, int data, int *deeper )
{
if ( tree == NULL ) {
tree = make_node ( data );
*deeper = 1;
}
else {
int dir = data > tree->data;
tree->link[dir] = insert_node ( tree->link[dir], data, deeper );
if ( *deeper )
tree = rebalance_insert ( tree, dir, deeper );
}
return tree;
}
Using a stack instead of recursion is a viable alternative, but it buys us little with a balanced tree. There are some performance issues, but in this case clarity wins out over speed. Knuth describes a way to insert into an AVL tree without using a stack or recursion by saving the point at which the tree might need to be rebalanced at each insertion because AVL insertion requires at most either a single or a double rotation. This is an excellent idea and our implementation would have used it if not for the desire to give insert and deletion a similar feel. Insertion requires at most one rotation, but deletion could require up to log2(N) rotations. Because of that, the iterative deletion algorithm must use a stack of both visited nodes and directions taken. Our implementation uses recursion for both insertion and deletion, and the result is a similar solution in both cases. I believe that this will aid in understanding. The rebalance_insert function handles the two cases described in the AVL tree concept section. The dir variable is used to avoid the symmetric cases by merging them into one. If dir is 0 then a left rebalance is performed, otherwise dir is 1 and a right rebalance is performed. Nifty, isn't it? :-) Code: struct node *rebalance_insert ( struct node *tree, int dir, int *new_depth )
{
if ( tree->balance == dir ) {
if ( tree->link[dir]->balance == dir ) {
tree->link[dir]->balance = 2;
tree->balance = 2;
tree = single_rotate ( tree, dir );
}
else {
fix_balance_factors ( tree, dir );
tree = double_rotate ( tree, dir );
}
*new_depth = 0;
}
else if ( tree->balance == !dir ) {
tree->balance = 2;
*new_depth = 0;
}
else
tree->balance = dir;
return tree;
}
The second simplest case is when the node's shorter subtree has grown higher. In this case the tree has gotten more balanced and no rotations are necessary, so the balance factor is set to reflect equal balance and the status flag is cleared. Rebalancing is done for this insertion. The third case is when the node's taller subtree has grown higher. The only solution at this point is a rotation, and we now enter the two cases described in the concept section. If the node's dir link has a balance factor of dir then Case 1 applies and a single rotation is performed. The balance factors of the node and it's dir link are change to show that the tree has now become balanced. Both balance factors are of equal height as the diagram in the concept section shows. Case 2 applies if the node's dir link has a balance factor of equal or !dir. According to the diagram, A is tree, B is tree->link[dir], and C is tree->link[dir]->link[!dir]. This means that either of C's subtrees have grown higher, or C is the new node. We know that a double rotation is needed, but how to change the balance factors before the rotation remains a mystery because you already know that it depends on the present setting of the balance factors of C. If C's balance factor is dir then A's new balance factor is 2, signifying equal height and B's new balance factor is !dir. If C's balance factor is !dir then A's new balance factor is dir and B's new balance factor is 2. If C's balance factor is 2 then both A and B are given balance factors of 2 as well. After all of these cases, we make sure that C has a balance factor of 2 (remember the diagram). The code for this magic is simple even if exactly why those balance factors are needed is not: Code: void fix_balance_factors ( struct node *tree, int dir )
{
if ( tree->link[dir]->link[!dir]->balance == dir ) {
tree->balance = 2;
tree->link[dir]->balance = !dir;
}
else if ( tree->link[dir]->link[!dir]->balance == !dir ) {
tree->balance = dir;
tree->link[dir]->balance = 2;
}
else
tree->balance = tree->link[dir]->balance = 2;
tree->link[dir]->link[!dir]->balance = 2;
}
Exercises 1) fix_balance_factors maintains balance factors based on the balance factor of C. Why are these particular changes necessary? In what situations will each case be used? Will different balance factors still work? Which ones? 2) Write down a test run for various inputs on a piece of paper. Follow the code exactly and see what it does. Does the algorithm do what you expected? 3) Can this insertion algorithm be rewritten without the status flag? 4) Rewrite insert_node without recursion using what you learned in Part I. Is it more efficient? Is the performance gain worth the effort? 5) Rewrite insert_node without recursion or the use of a stack. Hint: Save the first node along the search path without an equal balance factor. This is the point where rebalancing may be needed. Deletion Removing a node from an AVL tree is more complicated than inserting a node. The problem is removing an internal node without causing an imbalance elsewhere in the tree that is not known. There are a few different ways of going about AVL deletion, but we will be using a recursive scheme similar to the insertion algorithm. Before we get started, we need a a helper function, just like with insertion. Code: void destroy_node ( struct node *tree )
{
free ( tree );
}
Like insert_node, remove_node will also make use of a status flag for the same reasons. We would not want to perform rotations or change balance factors without reason because that would ruin the already fragile AVL data structure. This is why AVL trees are seen as difficult, even the smallest change can destroy the algorithm because it relies so heavily on the proper setting of balance factors. As we did in Part I, we will use a deletion-by-copying algorithm, but unlike in Part I we will do it almost completely with recursion. The algorithm without any balancing mumbo jumbo looks like this: Code: struct node *remove_node ( struct node *tree, int data )
{
if ( tree == NULL )
;
else if ( data == tree->data ) {
if ( tree->link[0] != NULL && tree->link[1] != NULL ) {
struct node *heir = tree->link[0];
while ( heir->link[1] != NULL )
heir = heir->link[1];
tree->data = heir->data;
tree->link[0] = remove_node ( tree->link[0], tree->data );
}
else {
struct node *save = tree;
tree = tree->link[tree->link[0] == NULL];
destroy_node ( save );
}
}
else {
int dir = data > tree->data;
tree->link[dir] = remove_node ( tree->link[dir], data );
}
return tree;
}
To convert remove_node into an algorithm that can remove an item from an AVL tree and still maintain the AVL condition, we need to consider only two things: how to figure out when to rebalance and where to rebalance. The first question is simple because we can use the same solution as last time, use a status flag. The second problem is not as simple as insertion because we have two unique paths that may need rebalancing. Remember that deletion from an AVL tree may take up to log2(N) rotations, so we need to consider that every node from the predecessor to the root could need rebalancing. This means that we must add a rebalancing check for both the data path and the predecessor path. The second part to this question is where do we set the flag to true and where to false? Setting it to false is simple, the rebalance function will handle that and if the item is not found the remove_node should make sure that no rebalancing is performed. Where to set the flag to true is more difficult. At what point do we know for a fact that the tree has gotten shorter? When we actually remove a node, of course. Just like insertion where when we inserted a node we assumed that the tree grew in height, we can assume that the tree shrinks in height when we actually remove a node. Now, the only time we actually remove a node is when the predecessor is replaced with one of its children. Remember that deletion by copying does not actually remove the first node, it only replaces its data. With all of that in mind, we can come up with something workable on the assumption that rebalance_remove will handle the nitty gritty stuff: Code: struct node *remove_node ( struct node *tree, int data, int *shallower )
{
if ( tree == NULL )
*shallower = 0;
else if ( data == tree->data ) {
if ( tree->link[0] != NULL && tree->link[1] != NULL ) {
struct node *heir = tree->link[0];
while ( heir->link[1] != NULL )
heir = heir->link[1];
tree->data = heir->data;
tree->link[0] = remove_node ( tree->link[0], tree->data, shallower );
if ( *shallower )
tree = rebalance_remove ( tree, 0, shallower );
}
else {
struct node *save = tree;
tree = tree->link[tree->link[0] == NULL];
destroy_node ( save );
*shallower = 1;
}
}
else {
int dir = data > tree->data;
tree->link[dir] = remove_node ( tree->link[dir], data, shallower );
if ( *shallower )
tree = rebalance_remove ( tree, dir, shallower );
}
return tree;
}
Code: struct node *rebalance_remove ( struct node *tree, int dir, int *new_depth )
{
if ( tree->balance == dir )
tree->balance = 2;
else if ( tree->balance == !dir ) {
if ( tree->link[!dir]->balance == !dir ) {
tree->link[!dir]->balance = 2;
tree->balance = 2;
tree = single_rotate ( tree, !dir );
}
else if ( tree->link[!dir]->balance == 2 ) {
tree->link[!dir]->balance = dir;
tree = single_rotate ( tree, !dir );
}
else {
fix_balance_factors ( tree, !dir );
tree = double_rotate ( tree, !dir );
}
}
else {
tree->balance = !dir;
*new_depth = 0;
}
return tree;
}
The case where A's balance factor is !dir means that the shorter subtree has grown shorter and rotations are required. At this point there are three cases for rebalancing. If B's balance factor is also !dir then the familiar single rotation is performed and balance factors for A and B become 2. A new case where B's balance factor is 2 shows up. A single rotation will restore the AVL condition, but setting the balance factors of A and B to 2 will not work. In this case, only B's balance factor is changed to dir before the rotation is made. The final case, where B's balance factor is dir, requires a double rotation and a call to fix_balance_factors. The good news is that this case is identical to the equivalent case in insertion, so we can reuse fix_balance_factors just as we reused single_rotate and double_rotate. After these three cases, we do not reset the status flag because a rotation does not necessarily restore balance to an AVL tree during deletion. We will continue to propagate the rebalancing upward until a case is encountered where we know we can stop, or we have reached the root node and remove_node terminates. This is all that is required for AVL deletion. When you look at it the right way, the algorithm is very simple. Most of the simplicity comes from modularization, reuse, and the use of recursion. A non-recursive version of the algorithm requires a stack of both nodes along the path and directions taken, so a truly elegant non-recursive deletion algorithm isn't possible as it is with insertion. Exercises 1) Only one case in rebalance_remove clears the status flag, why does that case mean that rebalancing can stop? 2) AVL deletion adds a third rotation case where B's balance factor is 2, why must that case be treated separately from the usual single rotation case? 3) Work through several runs of input on paper using remove_node's algorithm as a guide. Does it work like you expected? 4) Rewrite remove_delete without using recursion. Is it simpler? Is it worth the effort? Red Black Trees Concept While Red Black trees are a data structure of their own these days, they were originally an abstraction. I feel that people have trouble with Red Black trees because the abstraction has largely been dismissed and the rationale behind the rules has been forgotten. So we will cover the underlying data structure that the Red Black tree is an abstraction of and then we will put it aside in favor of an implementation simply following the rules. This way not only is the data structure understood, but the meaning behind it is as well. A B-tree of order 4, also called a 2-3-4 tree or 2-4 tree, is a search tree with nodes that can have either 2, 3, or 4 links to other nodes, called (unimaginatively) 2-nodes, 3-nodes, and 4-nodes, respectively. Following this line of thinking, a binary tree consists solely of 2-nodes. Code: 2-node 3-node 4-node
1 1 2 1 2 3
~ ~ ~ ~ ~ ~ ~ ~ ~
Code: Insert 4:
1 2 3 2 2
---> --->
a b c d 1 3 1 3 4
a b c d a b c d e
Unfotunately, the code for a B-tree is very specialized. We would like to be able to use the familar binary search tree operations rather than writing all new operations for a B-tree. That's where the Red Black abstraction comes in. Rather than having explicit 2, 3, and 4-nodes, we can use only 2-nodes (thus a binary tree) and use a flag to tell us if the link is a logical link (making it a part of a 3 or 4-node), or a hard link (making it a link to the next 2, 3, or 4-node). The current accepted terminology gives logical links and hard links a different color, with logical links being red and hard links being black. The color of a node (R == red, B == black) follows the item. Code: 2-node 3-node 4-node
1B 1B 2B 2B
~ ~ ~ 2R or 1R ~ 1R 3R
~ ~ ~ ~ ~ ~ ~ ~
Rule 1: Every path in a Red Black tree must have the same number of black links. This makes sense because for the B-tree condition to be met, we need a way to ensure that all leaves are the same distance from the root. Because black links are "real" links in the abstraction and red links are "fake" links, we can ignore the red links and just make sure that every path in a Red Black tree has the same number of black links. This meets the requirement that all leaves in a B-tree are the same distance from the root. Rule 2: Two red links cannot appear consecutively on any path in a Red Black tree. This also makes sense when you consider that red links make 3-nodes and 4-nodes. Because the largest node we can use is a 4-node, and a 4-node is represented by a black node with two red links, having two red links consecutively would signify the need to either fix a broken 4-node, or perform a split. Either way the B-tree condition is violated and needs to be fixed. Rule 3: The root must be black. This rule is not necessary with general Red Black trees, but it is usually adhered to because it simplifies the algorithms. We will use it as a hard rule because it does not make much sense to allow a red root when using the B-tree abstraction. Tossing aside the abstraction, we have three rules that enforce the Red Black tree condition and help simplify the algorithms: Rule 1: Every path in a Red Black tree must have the same number of black nodes. Rule 2: Two red links cannot appear consecutively on any path in a Red Black tree. Rule 3: The root must be black. By following these rules in an implementation, we can guarantee logarithmic search times. Exercises 1) Why would a red root make some algorithms more difficult? 2) Devise as many cases as you can where two consecutive links would be red, then explain why they are a violation of the B-tree condition. Link Coloration In a Red Black tree, the color of a node is the balance factor, much like AVL trees. Because it is more difficult to actually give links a color, we will be giving each node a color. This can be done numerous ways, but the most common are an enumeration: Code: enum rb_color { RED, BLACK };
struct node {
int data;
enum rb_color color;
struct node *link[2];
};
Code: struct node {
int data;
int red;
struct node *link[2];
};
Exercises 1) Try to come up with as many ways to represent link coloration as you can. 2) Consider ways that the link coloration could be removed completely. Hint: Swap left and right links and test if the left link is greater than the right link. Insertion Insertion into a Red Black tree is fairly straightforward. The algorithm we will use is strikingly similar to the insertion algorithm for AVL trees. However, as for AVL trees we need to define some functions to help us first. We will start with the structure declaration for a Red Black node, and functions to create and destroy a Red Black node: Code: struct node {
int data;
int red;
struct node *link[2];
};
struct node *make_node ( int data )
{
struct node *tree = malloc ( sizeof *tree );
if ( tree == NULL )
return NULL;
tree->data = data;
tree->red = 1;
tree->link[0] = tree->link[1] = NULL;
return tree;
}
void destroy_node ( struct node *tree )
{
free ( tree );
}
Naturally, if the parent of the new node is red then we have violated rule 2 of the Red Black specification. Rather than try to watch for and fix violations of both rule 1 and rule 2, insertion will be careful to only violate and fix rule 2. Deletions will suffer from having to handle violations of rule 1 as well, but we will cross that bridge when we get to it. The next helper function tests a node to see if it is black. Because we will be treating null pointers as black nodes to avoid violations of rule 2, we need to check if the node is null before dereferencing it. The code is simple and removes a bit of clutter from the rebalancing algorithms: Code: int is_black ( struct node *tree )
{
return tree == NULL || !tree->red;
}
Code: struct node *insert_node ( struct node *tree, int data )
{
tree = insert_node_r ( tree, data );
if ( tree->red )
tree->red = 0;
return tree;
}
Code: struct node *insert_node_r ( struct node *tree, int data )
{
if ( tree == NULL )
tree = make_node ( data );
else {
int dir = data > tree->data;
tree->link[dir] = insert_node_r ( tree->link[dir], data );
tree = rebalance_insert ( tree, dir );
}
return tree;
}
Rebalancing in a Red Black tree is required if a red node has a red child. There are three possible outcomes depending on the orientation of the child and the color of the node's sibling. Here are the cases (ignoring symmetric cases): Code: Case 1:
1B 2B
0B 2R ---> 1R 3R
3R 0B
The next case is exactly the same except for the color of 0. If 0 is red then all that needs to be done is a simple color flip to restore the red black condition. The color flip is made by changing 0 and 2 to black and 1 to red: Code: Case 2:
1B 1R
0R 2R ---> 0B 2B
3R 3R
Code: Case 3:
1B 1B 3B
0B 2R ---> 0B 3R ---> 1R 2R
3R 2R 0B
Code: Case 4:
1B 1R
0R 2R ---> 0B 2B
3R 3R
Code: struct node *rebalance_insert ( struct node *tree, int dir )
{
struct node *child = tree->link[dir];
struct node *sibling = tree->link[!dir];
/* Same direction case */
if ( child->link[dir] != NULL ) {
if ( child->red && child->link[dir]->red ) {
if ( is_black ( sibling ) ) {
tree = single_rotate ( tree, dir );
tree->red = 0;
tree->link[0]->red = tree->link[1]->red = 1;
}
else {
tree->red = 1;
tree->link[0]->red = tree->link[1]->red = 0;
}
}
}
/* Opposing direction case */
if ( child->link[!dir] != NULL ) {
if ( child->red && child->link[!dir]->red ) {
if ( is_black ( sibling ) ) {
tree = double_rotate ( tree, dir );
tree->red = 0;
tree->link[0]->red = tree->link[1]->red = 1;
}
else {
tree->red = 1;
tree->link[0]->red = tree->link[1]->red = 0;
}
}
}
return tree;
}
Exercises 1) Rewrite rebalance_insert to avoid relying on conditionals. Note: This may require sweeping changes to insert_node_r. 2) Rewrite insert_node without recursion. 3) Verify that insertion into a Red Black tree never requires more than two rotations. Deletion Just as removing a node from an AVL tree was more complicated than insertion, so it is with Red Black trees. We will use a similar recursive removal algorithm as the one for AVL trees, with the added necessity of an auxilliary function for maintaining rule 3. On top of that, we will also need a status flag to tell the algorithm when to rebalance. We were saved from that with insertion, but with deletion it is much easier this way. The code from remove_node is simple: Code: struct node *remove_node ( struct node *tree, int data )
{
int new_height;
tree = remove_node_r ( tree, data, &new_height );
if ( tree != NULL && tree->red )
tree->red = 0;
return tree;
}
remove_node_r uses the same recursive deletion-by-copying algorithm that we used for AVL deletion. The only difference is that now we need to consider two cases when deleting the inorder predecessor. If the inorder predecessor is red then we can safely delete it without having to rebalance the tree. The reason for this is because removing a red node cannot violate either rule 1 or rule 2, so there is no reason to set the flag and call rebalance_remove. On the other hand, if the inorder predecessor is black then that will violate rule 1 by changing the black height of that path. In that case we need to set the status flag. But there's more! If the node that replaces the black node that we just removed is red, we can simply recolor it black and clear the status flag. Rebalancing accomplished without calling rebalance_remove! Here is the code: Code: struct node *remove_node_r ( struct node *tree, int data, int *new_height )
{
if ( tree == NULL )
*new_height = 0;
else if ( data == tree->data ) {
if ( tree->link[0] != NULL && tree->link[1] != NULL ) {
struct node *heir = tree->link[0];
while ( heir->link[1] != NULL )
heir = heir->link[1];
tree->data = heir->data;
tree->link[0] = remove_node_r ( tree->link[0], tree->data, new_height );
if ( *new_height )
tree = rebalance_remove ( tree, 0, new_height );
}
else {
struct node *save = tree;
*new_height = !tree->red;
tree = tree->link[tree->link[0] == NULL];
destroy_node ( save );
if ( tree != NULL && tree->red ) {
tree->red = 0;
*new_height = 0;
}
}
}
else {
int dir = data > tree->data;
tree->link[dir] = remove_node_r ( tree->link[dir], data, new_height );
if ( *new_height )
tree = rebalance_remove ( tree, dir, new_height );
}
return tree;
}
1) Save the damned. 2) Set or clear new_height based on the color of the doomed node. 3) Execute the node. 4) If the replacement is red, make it black and clear new_height. Other than that, the algorithm is identical to the deletion routine we wrote for AVL trees. The fun part is in rebalance_remove, where rotations are made and colors are flipped. The rebalancing of a Red Black deletion consists of four cases. The first case is if the other child of the node (tree->link[!dir] rather than tree->link[dir]) is red, we enable the other cases with a single rotation. As it turns out, rebalancing is much easier if the sibling is black. Code: Case 1:
1B 3B
0B 3R ---> 1R 4B
2B 4B 0B 2B
Code: Case 2 (A color of E means either red or black):
1E 1E
0B 3B ---> 0B 3R
2B 4B 2B 4B
Code: Case 3a:
1E 1E
0B 3B ---> 0B 2R
2R 3B
case 3b:
1E 2E
0B 2R --> 1B 3B
3B 0B
Code: struct node *rebalance_remove ( struct node *tree, int dir, int *new_height )
{
struct node *sibling = tree->link[!dir];
/* Make sibling black case */
if ( sibling->red ) {
sibling->red = 0;
tree->red = 1;
tree = single_rotate ( tree, !dir );
tree->link[dir]->red = 0;
tree->link[dir]->link[!dir]->red = 1;
}
else {
/* Zero rotation case */
if ( is_black ( sibling->link[0] ) && is_black ( sibling->link[1] ) )
sibling->red = 1;
else {
/* Double rotation case */
if ( is_black ( sibling->link[!dir] ) ) {
sibling->link[dir]->red = 0;
sibling->red = 1;
tree->link[!dir] = single_rotate ( tree->link[!dir], dir );
sibling = tree->link[!dir];
}
/* Single rotation case */
sibling->red = tree->red;
tree->red = 0;
sibling->link[!dir]->red = 0;
tree = single_rotate ( tree, !dir );
*new_height = 0;
}
}
return tree;
}
Exercises 1) Rewrite rebalance_remove to separate the single and double rotation cases. Is this an improvement? 2) Rewrite remove_node without using recursion. 3) Verify that deletion from a Red Black tree never requires more than three rotations. Splaying Splaying is the technique of moving a node to the root of a binary search tree by two levels at a time. There are three cases for splaying (ignoring symmetric cases): Code: Zig
A A
--->
B B
Zig-Zig
A A C
B ---> C ---> A B
C B
Zig-Zag
A B C
B ---> A C ---> B
C A
By splaying for every search in a binary search tree, we can guarantee amortized logarithmic performance. The most amazing thing about splaying is that it has a tendency to cut the height of most nodes on the access path by half! This makes binary search trees designed with splaying in mind balanced binary search trees. In fact, splay trees are often comparable in performance with AVL and Red Black trees. Once again, that is amazing considering the fact that splay trees maintain no balance factors. The analysis of a splay tree is far beyond me, so we'll just move on to a description of the two ways to splay, bottom up and top down. Bottom Up Splaying Bottom up splaying is very simple. Walk down the tree until you find the node that you have been looking for and then splay it back up. There are several boundary conditions that should be kept in mind though. The first boundary condition is that of splaying for a nonexistant item. As we have been using, a status flag is ideal for this task. If the search path gets to a null pointer, clear the flag, otherwise the item was found and we set the flag. The second boundary condition is walking back up we need to be at least the node's grandparent for splaying to work properly. Otherwise null pointers will be dereferenced and all manner of ghoulies show up. The solution to this is the same as we used for Red Black insertion, just make sure that there are three nodes to splay using conditionals. :-) The final case is when the last splay does not bring the item to the root of the tree. This could happen if there are not three nodes to splay near the root. When this is the case, the only possible reason is because the item is a child of the root. So a zig will handle that. Assuming we have a recursive splay_r function, we can write a splay function immediately: Code: struct node *splay ( struct node *tree, int data )
{
int splay = 0;
tree = splay_r ( tree, data, &splay );
/* Final Zig if necessary */
if ( splay && data != tree->data ) {
int dir = data > tree->data;
tree = single_rotate ( tree, dir );
}
return tree;
}
The splay_r function is just a simple binary search tree search. The only difference is that on the return trip, we splay the node if it was found: Code: struct node *splay_r ( struct node *tree, int data, int *splay )
{
if ( tree == NULL )
*splay = 0;
else if ( data == tree->data )
*splay = 1;
else {
int dir = data > tree->data;
tree->link[dir] = splay_r ( tree->link[dir], data, splay );
if ( *splay ) {
if ( tree->link[dir] != NULL ) {
struct node *zig_target = tree->link[dir]->link[dir];
struct node *zag_target = tree->link[dir]->link[!dir];
if ( zig_target != NULL && data == zig_target->data ) {
/* Zig-Zig */
tree = single_rotate ( tree, dir );
tree = single_rotate ( tree, dir );
}
else if ( zag_target != NULL && data == zag_target->data ) {
/* Zig-Zag */
tree->link[dir] = single_rotate ( tree->link[dir], !dir );
tree = single_rotate ( tree, dir );
}
}
}
}
return tree;
}
Nothing in this code should be problematic for you by now, so please study it and do a few test runs. Bottom up splaying is easy, but top down splaying, which we will study next, can be confusing. Exercises 1) Can you think of any way to remove some of the special cases that we had to handle? 2) Rewrite splay using a bottom up algorithm without using recursion. Top Down Splaying Top down splaying is best done iteratively. Bottom up splaying requires a downward search of the tree followed by an upward march while splaying. The problem is that we need to handle several special cases. Top down splaying removes those special cases at the cost of clarity. To perform top down splaying, we walk down the tree and for every node we use a second tree to hold temporary links. The left temporary holds nodes that are less than the root and the right temporary holds nodes that are greater than the root. The operations are as follows. Code: Zig
==========================
left left
~ ~
right right
~ A
a
root ---> root
A B
a B b c
b c
==========================
Zig-Zig
==========================
left left
~ ~
right right
~ B
C
c d
root ---> root
C A
B d a b
A c
a b
==========================
Zig-Zag
left left
~ A
a
right right
~ C
d
root ---> root
C B
A d b c
a B
b c
==========================
Code: struct node *splay ( struct node *tree, int data )
{
struct node N = { 0, { 0, 0 } };
struct node *l = &N, *r = &N;
struct node **save;
if ( tree == NULL )
return tree;
while ( data != tree->data ) {
int dir = data > tree->data;
if ( tree->link[dir] == NULL )
break;
if ( ( dir == 0 && data < tree->link[dir]->data )
|| ( dir != 0 && data > tree->link[dir]->data ) )
{
tree = single_rotate ( tree, dir );
if ( tree->link[dir] == NULL )
break;
}
save = dir ? &l : &r;
(*save)->link[dir] = tree;
*save = tree;
tree = tree->link[dir];
}
/* Reassemble */
l->link[1] = tree->link[0];
r->link[0] = tree->link[1];
tree->link[0] = N.link[1];
tree->link[1] = N.link[0];
return tree;
}
Exercises 1) Rewrite splay with all of the symmetric cases. Is the author crazy for writing the code without them? :-) 2) Execute the code with several example trees. Does it do what you expected? 3) Rewrite splay with a top down algorithm using recursion. Splay Trees Insertion Insertion into a splay tree is dreadfully simple if you already have the splaying function. The code is so simple I'll leave figuring it out as an exercise: Code: struct node *insert_node ( struct node *tree, int data )
{
struct node *walk = make_node ( data );
if ( walk == NULL || tree == NULL )
return walk;
tree = splay ( tree, data );
if ( data == tree->data ) {
destroy_node ( walk );
return tree;
}
else {
int dir = data > tree->data;
walk->link[dir] = tree->link[dir];
walk->link[!dir] = tree;
tree->link[dir] = NULL;
return walk;
}
}
Deletion Deletion from a splay tree is even simpler than insertion. Once again, the code is so simple that figuring it out is left as an exercise: Code: struct node *remove_node ( struct node *tree, int data )
{
struct node *save;
if ( tree == NULL )
return NULL;
tree = splay ( tree, data );
if ( data == tree->data ) {
if ( tree->link[0] == NULL )
save = tree->link[1];
else {
save = splay ( tree->link[0], data );
save->link[1] = tree->link[1];
}
destroy_node ( tree );
tree = save;
}
return tree;
}
Conclusion In this article we looked at several types of balanced binary search tree that guarantee logarithmic performance. You learned that the algorithms are not nearly as complicated as others may have told you, and you also saw how to implement insertion and deletion with all of them. Balanced binary search trees are very useful and I hope that this article was helpful in alleviating some of the mysteries. :-)
__________________ My best code is written with the delete key. |
| Prelude is offline |
| | #2 |
| Unleashed Join Date: Sep 2001
Posts: 1,755
| > comments Excellent work!
__________________ The world is waiting. I must leave you now. |
| Shadow is offline |
| | #3 |
| Registered User Join Date: Jan 2002 Location: Cardiff
Posts: 2,219
| Wow. That is very helpful, thanks. If you made a book with all this stuff in it I would definately buy it. |
| Brian is offline |
| | #4 | ||||||||||||||
| erstwhile Join Date: Jan 2002
Posts: 2,223
| I'm not qualified to comment on the technical aspects so I'll leave that to those who are. This is perhaps a bit premature since it's only a first draft but here's some typos/suggestions after a cursory proof-read. Feel free to do with them as you see fit and apologies in advance for any of my typos/errors/misconceptions: Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
![]() Quote:
Quote:
__________________ CProgramming FAQ Caution: this person may be a carrier of the misinformation virus. Last edited by Ken Fitlike; 09-28-2004 at 03:06 PM. Reason: editing | ||||||||||||||
| Ken Fitlike is offline |
| | #5 |
| Registered User Join Date: Oct 2003
Posts: 719
| Thanks Prelude. We all appreaciate your hard work. Are you going to put this in the FAQ thoug. That would probably be best don't you think?
__________________ When the concrete cases are understood the abstractions are readily made |
| caroundw5h is offline |
| | #6 |
| End Of Line Join Date: Apr 2002
Posts: 6,240
| >>Are you going to put this in the FAQ thoug When it's ready, it will go in here. But with such a big and complex article, its always good to get it proof read first. >>These ASCII drawings don't even come close to doing the algorithm justice Maybe we could do something about that when I put it in the FAQ.
__________________ When all else fails, read the instructions. If you're posting code, use code tags: [code] /* insert code here */ [/code] |
| Hammer is offline |
| | #7 |
| Code Goddess Join Date: Sep 2001
Posts: 9,661
| >Perhaps: ...implementation is arbitrarily as long... ? No, that would make the sentence incomprehensible. ![]() >was successfully added for simplicity's sake I don't like either, so I'll use "was succssfully added for the sake of simplicity". >A reference (to Knuth; book, paper, other publication?) would be nice. Done, good idea. >nitty-gritty would, arguably, be better. The dictionary agrees with you, so I'll change that. >Save the damned what? damned: 1) Condemned, especially to eternal punishment. >Well, I would spell colouration somewhat differently... I pick one style to avoid issues like this. Both are correct, so it's a matter of opinion. ![]() All of the other changes I'll make as well. Thanks for taking the time to proofread my hurried mistakes. ![]() >Maybe we could do something about that when I put it in the FAQ. I'll see if I can come up with something presentable.
__________________ My best code is written with the delete key. |
| Prelude is offline |
| | #8 | ||
| erstwhile Join Date: Jan 2002
Posts: 2,223
| Quote:
![]() Quote:
__________________ CProgramming FAQ Caution: this person may be a carrier of the misinformation virus. | ||
| Ken Fitlike is offline |
| | #9 |
| Registered User Join Date: Oct 2003
Posts: 719
| Hey does this mean there is going to be an up to date C and C++ tutorial?
__________________ When the concrete cases are understood the abstractions are readily made |
| caroundw5h is offline |
| | #10 |
| CS Author and Instructor Join Date: Sep 2002
Posts: 511
| Prelude, Nice tutorial Have you ever thought to publish a textbook? I am currently working on one in C# programming which is due out at the beginning of next year. This experience has been an eye opening one for me. You seem to have great style, may have to clean up some of the language... I sure you have looked, at what is out currently in C books. Most are primarily reference, few are actual textbooks. Anyway, I know the contacts at 3 of the publishers of CS books in the US... let me know if interested.
__________________ Mr. C: Author and Instructor |
| Mister C is offline |
| | #11 |
| Registered User Join Date: Oct 2003
Posts: 719
| Prelude if you write a book, I'll buy it. :d
__________________ When the concrete cases are understood the abstractions are readily made |
| caroundw5h is offline |
| | #12 | |||||
| Registered User Join Date: Aug 2003
Posts: 470
| Overall, I think the articles are well-writen. Here is are just a few of my comments. I'm by no means an expert at english or algorithms, so feel free to disagree. In most cases the meaning will be clear to someone famaliar to the material, but not so clear to someone who is new to it. Although BST trees are fairly simple, it's easier to read code using left and right links rather than struct node* link[2], especially if you gave us the code to delete_node. The generated code with these modifications might be a little bit slower, however. You might want to consider #define BST_LEFT 0 and #define BST_RIGHT 1. Quote:
Quote:
Quote:
Quote:
Quote:
"Red Black trees are an abstraction of the infamous B-tree that is generally used for database applications." I like "Red Black trees are an abstraction of the infamous B-tree, generally used for database applications" because the data applications part is not integral to the sentence. I also don't think "abstraction" is the right word. Something like "special case". I'll try to comment on the rest latter. | |||||
| okinrus is offline |
| | #13 |
| Never Exist Join Date: Jul 2004
Posts: 149
| without Prelude, what would this world be ?~~~~~
__________________ blow me ... ... |
| Hermitsky is offline |
| | #14 | |
| former member Join Date: Feb 2004
Posts: 472
| Quote:
__________________ My Tutorials : - Bad programming practices in : C - C\C++ Tips (constrcutive criticism is very welcome) - Brain Cell | |
| Brain Cell is offline |
| | #15 |
| Code Goddess Join Date: Sep 2001
Posts: 9,661
| >That's good but, you know, successfully might work better. That too. ![]() >Have you ever thought to publish a textbook? I've toyed with the idea. >it's easier to read code using left and right links rather than struct node* link[2] Not when most of the code is repeated for the left and right paths. I've found that code using an array and boolean index is easier to follow than explicit left and right links because it helps shorten routines to the point where they can be remembered all at once. Especially with advanced algorithms like balanced tree insertion and deletion, removing redundant code is a huge help in keeping the routine transparent. >You might want to consider #define BST_LEFT 0 and #define BST_RIGHT 1. That would defeat the point of using the array. The issue that I addressed was avoiding redundant code by using a boolean test to determine left or right and saving that value in the variable dir. It may look strange at first, but you get used to it quickly. ![]() >the average analysis of trees with both deletions and inputs has not been done. It's been proven that when insertions and deletions are random in a basic binary search tree, the height of a tree is roughly O(log N). But I will try to clarify that statement. >Don't you mean "Many average-case logarithmic algorithms" or some variant of this? Something like that, yes. ![]() >Should be something like "if the data is...". Singular is datum, plural is data. But according to the rule of common use, either "data is" or "data are" would be acceptable even though the latter is grammatically correct while the former is not. >The statement might be confused to mean that sorted order and alternating order are the >only possible cases to read a degerate case, which is not true Actually, it is true. The only way to get a degenerate tree is if only one path is available at each step along a search. Barring cases where different degenerate input is mixed together so that the tree is still degenerate, there are only two unique cases where this would happen (ignoring symmetric cases): Sorted input in either ascending or descending order, or input that would result in a zig-zag shaped tree (ie. alternating order). Mixing those together will give many variations of degenerate tree, but the two cases still remain. Granted, the statement does imply that only two contrived input orders will result in a worst case tree now that I read it again. I'll be more specific. ![]() >I like "These trees use various clever techniques for maintaining a nice branching structure, guaranteeing operations in logarithmic time." I like that too. ![]() >The sentence is ambiguous Hmm, how about: "However, despite a larger average height, Red Black trees can be more efficient than AVL trees because they spend less time maintaining a balanced structure". >I also don't think "abstraction" is the right word. Something like "special case". No, a special case is a 2-3-4 tree because it's a specialized B-tree. A Red Black binary search tree is most definitely an abstraction because it is a representation of the 2-3-4 tree using a more comfortable and simpler implementation than that of a B-tree. Maybe an abstraction of a special case?
__________________ My best code is written with the delete key. |
| Prelude is offline |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Binary search needed | ItsMeHere | C Programming | 1 | 06-22-2006 07:26 AM |
| Binary trees search problem... | Umoniel | C Programming | 2 | 02-22-2004 02:29 PM |
| binary search tree | unregistered | C++ Programming | 4 | 12-11-2001 11:42 AM |
| Ornery binary search algorithm | Unregistered | C Programming | 3 | 11-17-2001 03:32 PM |
| Newbie questions regarding binary search trees | Ham | C++ Programming | 1 | 11-04-2001 07:48 AM |