Scapegoat Tree
A scapegoat tree is a self-balancing variant of the binary search tree. Unlike other variants like the red-black tree and the AVL tree, the scapegoat tree is an unencumbered data structure. That is, it doesn't need an extra storage per node in the tree. The low overhead and easy implementation makes the scapegoat tree a very attractive data structure. Scapegoat trees should be used in applications where insert and lookup operations dominate because that's where the scapegoat tree is most efficient.
The intuition behind the scapegoat tree is one we've all had to deal with. When something goes wrong, we need to blame a scapegoat. Once we've identified our scapegoat, they can be left to deal with the problem. After an insertion, the scapegoat finds a single node whose subtree is not balanced. That is the scapegoat. This scapegoat's subtree is then rebalanced.
Scapegoat trees are extremely flexible in their implementation. They can be optimized for insertions at the expense of deletions and lookups or vice versa. The programmer can tailor their implementation of the scapegoat tree to their specific application, making this data structure very attractive.
The process by which we find a scapegoat for insertion is simple. In the gif below, the tree starts off as balanced. When a new node, 12 in inserted, the tree becomes unbalanced. Starting from the newly inserted node, we examine the subtree root at that node to see if it is balanced. If it is, we go to the parent and repeat. Eventually we will either find a scapegoat whose subtree is unbalanced or we will finish at the root node and the tree is still balanced. In this case, node 8 has a subtree that is unbalanced, so that is our scapegoat.
Properties
The scapegoat tree is the first binary search tree to achieve its \(O(\log_2(n))\) complexity without storing extra information at every node. This saves large amounts of space, making it an attractive balanced binary search tree when space is limited. Alternatively, red-black trees store the color of each node and AVL trees store the height of each node from the bottom of the tree.
Scapegoat trees only store the following properties for the entire tree.
Property Function size
The number of nodes in the tree. root
Pointer to the root of the tree. max_size
The maximum size of the tree since its last full rebuild. And each node in the scapegoat tree only has the following properties.
Property Function key
The value of the node. left
Pointer to the left child of the node. right
Pointer to the right child of the node.
Operations
There are two operations that are handled uniquely by the scapegoat tree: insertion and deletion. Lookup and traversal of the tree is exactly the same in any balanced binary search tree.
Insertion
The insertion process starts out the same as with a binary search tree. The new node's appropriate place is found via binary search. This process takes \(O(\log_2(n))\) time.
Now, the tree needs to know if it needs to re-balance. It does this by looking up through the new node's ancestry to find the first node whose subtree is not balanced. Whether or not a tree is balanced is usually a factor of its weighting property \(\alpha\). For simplicity we'll just say that the number of nodes in the left and right subtrees cannot differ by more than 1. If the tree is going to unbalanced, it is guaranteed that an ancestor of the new node is unbalanced (because that's where an additional node was added to an already balanced tree). Climbing back up the tree to find the scapegoat takes \(O(\log_2(n))\) time.
The actual process of re-balancing the tree rooted at the scapegoat takes \(O(n)\) time. However, this is not a fair analysis because while the scapegoat could be the root of the tree, it could also be a node very deep in the tree. That process would be much faster because the vast majority of the tree is left alone. Therefore, the re-balancing process takes \(O(\log_2(n))\) amortized time.
This is because, before we rebuilt the subtree root at a given node \(i\), there were some number of insertions that did not cause a re-balance, say, \(\Omega(|i|)\) insertions. \(|i|\) is the size of the subtree rooted at \(i\). Each of those cost \(O(\log_2(n))\) time. Then, there was a single insertion that caused the tree to re-balance, taking \(O(|i|)\) time. So, if we amortize over all of the insertions, our equation becomes
\[\dfrac{\Omega(|i|) \cdot O(\log_2(n)) + O(|i|))}{\Omega(|i|)} = O(\log_2(n))\]
Deletion
Deletion is actually easier that insertion in scapegoat trees. It's here that the tree uses the property max_size
which the maximum size achieved since the last full re-balance. Every time we do a full re-balance, the tree simply sets max_size
to size
.
When we delete a node from the scapegoat tree, the tree will check to see if size
is less than or equal to max_size
. If this is the case, then the tree re-balances itself from the root. This process takes \(O(n)\) time but the amortized analysis is only \(O(\log_2(n))\), much like insertion.
Complexity
The complexity of the scapegoat tree is very much like other binary search trees. Detailed analysis of insertion and deletion is worked out above. The same analysis from other binary search trees is used for other operations.
See also: big O notation.
Average | |
Space** | \(O(n)\) |
Search | \(O(\log_2(n))\) |
Traversal | *\(O(n)\) |
Insert | *\(O(\log_2(n))\) |
Delete | *\(O(\log_2(n))\) |
*amortized analysis
**but better than other self-balancing BSTs
Python Implementation
The following python implementation is meant to shed some light on some of the processes of the scapegoat tree. The process of insertion is laid out, however, deletion has been omitted for brevity.
Note: In this implementation, some things have been changed from the base structure to make things clearer. Specifically, all nodes point to their parent to make the code more readable. However, this can easily be omitted by keeping track of all parents as you go down the tree for insertion in a stack.
Two classes that describe the Node
and the Scapegoat
tree, along with necessary operations, are below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
All of the properties of both classes are included in the class constructors, plus the parent pointer on line 6 for easier method writing.
The functions on line 59 and 64, isBalancedAtNode
and sizeOfSubtree
are the functions that will tell the insertion function where our scapegoat is. The sizeOfSubtree
recursively searches down a nodes' left and right paths to see how many nodes are under it. The isBalancedAtNode
function uses that information to see if the subtree rooted at a given node is balanced or not.
The insert
function on line 15 is what we're interested in here. In this function, we place a new node at the bottom of the tree in the appropriate location. Then, we will backtrack out of the tree, following our parent pointers, until we find our scapegoat. The function findScapegoat
on line 50 does this for us. When this function's while loop on line 53 fails, it is because the node we are inspecting does not have a balanced subtree beneath it. So, we know we need to re-balance this subtree.
The re-balancing function is on line 69. We first flatten the tree into a sorted list recursively. Now we can perform binary search on this list in order to create a new, balanced tree. There are no rotations involved like in AVL trees. We simply create a new tree that is guaranteed to be balanced. Then we hook up this new tree back into our original tree with the code on lines 44-48.
While the deletion function is not included, it follows the same logic and is a good exercise to implement. In deletion, the process is even easier because there is not search for a scapegoat. It is always the root node.