A scapegoat tree is a self-balancing variant of the binary search tree. Unlike other variants like the red-black tree and the AVL tree, the scapegoat tree is an unencumbered data structure. That is, it doesn't need an extra storage per node in the tree. The low overhead and easy implementation makes the scapegoat tree a very attractive data structure. Scapegoat trees should be used in applications where insert and lookup operations dominate because that's where the scapegoat tree is most efficient.

The intuition behind the scapegoat tree is one we've all had to deal with. When something goes wrong, we need to blame a scapegoat. Once we've identified our scapegoat, they can be left to deal with the problem. After an insertion, the scapegoat finds a single node whose subtree is not balanced. That is the scapegoat. This scapegoat's subtree is then rebalanced.

Scapegoat trees are extremely flexible in their implementation. They can be optimized for insertions at the expense of deletions and lookups or vice versa. The programmer can tailor their implementation of the scapegoat tree to their specific application, making this data structure very attractive.

The process by which we find a scapegoat for insertion is simple. In the gif below, the tree starts off as balanced. When a new node, 12 in inserted, the tree becomes unbalanced. Starting from the newly inserted node, we examine the subtree root at that node to see if it is balanced. If it is, we go to the parent and repeat. Eventually we will either find a scapegoat whose subtree is unbalanced or we will finish at the root node and the tree is still balanced. In this case, node 8 has a subtree that is unbalanced, so that is our scapegoat.

Finding the scapegoat node after insertion

Properties

The scapegoat tree is the first binary search tree to achieve its \(O(\log_2(n))\) complexity without storing extra information at every node. This saves large amounts of space, making it an attractive balanced binary search tree when space is limited. Alternatively, red-black trees store the color of each node and AVL trees store the height of each node from the bottom of the tree.

Scapegoat trees only store the following properties for the entire tree.

Property Function

size The number of nodes in the tree.

root Pointer to the root of the tree.

max_size The maximum size of the tree since its last full rebuild.

And each node in the scapegoat tree only has the following properties.

Property Function

key The value of the node.

left Pointer to the left child of the node.

right Pointer to the right child of the node.

Operations

There are two operations that are handled uniquely by the scapegoat tree: insertion and deletion. Lookup and traversal of the tree is exactly the same in any balanced binary search tree.

Insertion

The insertion process starts out the same as with a binary search tree. The new node's appropriate place is found via binary search. This process takes \(O(\log_2(n))\) time.

Now, the tree needs to know if it needs to re-balance. It does this by looking up through the new node's ancestry to find the first node whose subtree is not balanced. Whether or not a tree is balanced is usually a factor of its weighting property \(\alpha\). For simplicity we'll just say that the number of nodes in the left and right subtrees cannot differ by more than 1. If the tree is going to unbalanced, it is guaranteed that an ancestor of the new node is unbalanced (because that's where an additional node was added to an already balanced tree). Climbing back up the tree to find the scapegoat takes \(O(\log_2(n))\) time.

The actual process of re-balancing the tree rooted at the scapegoat takes \(O(n)\) time. However, this is not a fair analysis because while the scapegoat could be the root of the tree, it could also be a node very deep in the tree. That process would be much faster because the vast majority of the tree is left alone. Therefore, the re-balancing process takes \(O(\log_2(n))\) amortized time.

This is because, before we rebuilt the subtree root at a given node \(i\), there were some number of insertions that did not cause a re-balance, say, \(\Omega(|i|)\) insertions. \(|i|\) is the size of the subtree rooted at \(i\). Each of those cost \(O(\log_2(n))\) time. Then, there was a single insertion that caused the tree to re-balance, taking \(O(|i|)\) time. So, if we amortize over all of the insertions, our equation becomes

\[\dfrac{\Omega(|i|) \cdot O(\log_2(n)) + O(|i|))}{\Omega(|i|)} = O(\log_2(n))\]

Deletion

Deletion is actually easier that insertion in scapegoat trees. It's here that the tree uses the property max_size which the maximum size achieved since the last full re-balance. Every time we do a full re-balance, the tree simply sets max_size to size.

When we delete a node from the scapegoat tree, the tree will check to see if size is less than or equal to max_size. If this is the case, then the tree re-balances itself from the root. This process takes \(O(n)\) time but the amortized analysis is only \(O(\log_2(n))\), much like insertion.

Complexity

The complexity of the scapegoat tree is very much like other binary search trees. Detailed analysis of insertion and deletion is worked out above. The same analysis from other binary search trees is used for other operations.

See also: big O notation.

	Average
Space**	\(O(n)\)
Search	\(O(\log_2(n))\)
Traversal	*\(O(n)\)
Insert	*\(O(\log_2(n))\)
Delete	*\(O(\log_2(n))\)

*amortized analysis

**but better than other self-balancing BSTs

Python Implementation

The following python implementation is meant to shed some light on some of the processes of the scapegoat tree. The process of insertion is laid out, however, deletion has been omitted for brevity.

Note: In this implementation, some things have been changed from the base structure to make things clearer. Specifically, all nodes point to their parent to make the code more readable. However, this can easily be omitted by keeping track of all parents as you go down the tree for insertion in a stack.

Two classes that describe the Node and the Scapegoat tree, along with necessary operations, are below.

import math
class Node:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None
        self.parent = None

class Scapegoat:
    def __init__(self):
        self.root = None
        self.size = 0
        self.maxSize = 0

    def insert(self, key):
        node = Node(key)
        #Base Case - Nothing in the tree
        if self.root == None:
            self.root = node
            return
        #Search to find the node's correct place
        currentNode = self.root
        while currentNode != None:
            potentialParent = currentNode
            if node.key < currentNode.key:
                currentNode = currentNode.left
            else:
                currentNode = currentNode.right
        #Assign parents and siblings to the new node
        node.parent = potentialParent
        if node.key < node.parent.key:
            node.parent.left = node
        else:
            node.parent.right = node
        node.left = None
        node.right = None
        self.size += 1
        scapegoat = self.findScapegoat(node)
        if scapegoat == None:
            return
        tmp = self.rebalance(scapegoat)

        #Assign the correct pointers to and from scapegoat
        scapegoat.left = tmp.left
        scapegoat.right = tmp.right
        scapegoat.key = tmp.key
        scapegoat.left.parent = scapegoat
        scapegoat.right.parent = scapegoat

    def findScapegoat(self, node):
        if node == self.root:
            return None
        while self.isBalancedAtNode(node) == True:
            if node == self.root:
                return None
            node = node.parent
        return node

    def isBalancedAtNode(self, node):
        if abs(self.sizeOfSubtree(node.left) - self.sizeOfSubtree(node.right)) <= 1:
            return True
        return False

    def sizeOfSubtree(self, node):
        if node == None:
            return 0
        return 1 + self.sizeOfSubtree(node.left) + self.sizeOfSubtree(node.right)

    def rebalance(self, root):
        def flatten(node, nodes):
            if node == None:
                return
            flatten(node.left, nodes)
            nodes.append(node)
            flatten(node.right, nodes)

        def buildTreeFromSortedList(nodes, start, end):
            if start > end:
                return None
            mid = int(math.ceil(start + (end - start) / 2.0))
            node = Node(nodes[mid].key)
            node.left = buildTreeFromSortedList(nodes, start, mid-1)
            node.right = buildTreeFromSortedList(nodes, mid+1, end)
            return node

        nodes = []
        flatten(root, nodes)
        return buildTreeFromSortedList(nodes, 0, len(nodes)-1)

    def delete(self, key):
        pass

All of the properties of both classes are included in the class constructors, plus the parent pointer on line 6 for easier method writing.

The functions on line 59 and 64, isBalancedAtNode and sizeOfSubtree are the functions that will tell the insertion function where our scapegoat is. The sizeOfSubtree recursively searches down a nodes' left and right paths to see how many nodes are under it. The isBalancedAtNode function uses that information to see if the subtree rooted at a given node is balanced or not.

The insert function on line 15 is what we're interested in here. In this function, we place a new node at the bottom of the tree in the appropriate location. Then, we will backtrack out of the tree, following our parent pointers, until we find our scapegoat. The function findScapegoat on line 50 does this for us. When this function's while loop on line 53 fails, it is because the node we are inspecting does not have a balanced subtree beneath it. So, we know we need to re-balance this subtree.

The re-balancing function is on line 69. We first flatten the tree into a sorted list recursively. Now we can perform binary search on this list in order to create a new, balanced tree. There are no rotations involved like in AVL trees. We simply create a new tree that is guaranteed to be balanced. Then we hook up this new tree back into our original tree with the code on lines 44-48.

While the deletion function is not included, it follows the same logic and is a good exercise to implement. In deletion, the process is even easier because there is not search for a scapegoat. It is always the root node.

Property	Function
`size`	The number of nodes in the tree.
`root`	Pointer to the root of the tree.
`max_size`	The maximum size of the tree since its last full rebuild.

Property	Function
`key`	The value of the node.
`left`	Pointer to the left child of the node.
`right`	Pointer to the right child of the node.

Contents

Insertion

Deletion