Disjointset Data Structure (UnionFind)
Unionfind, as it is popularly called, is a data structure that categorizes objects into different sets and lets checking out if two objects belong to the same set.
The most popular usage of the data structure is to check whether one node in a graph can be reached from another, e.g. in the Kruskal's algorithm to avoid forming cycles.
Interface
This data structure is supposed to support two operations:
find(x)
: Returns some representation of the set to whichx
belongs.union(x,y)
: Merge the sets containingx
andy
.
Often, it can be equipped with a constructor that organizes every object into its own set.
Quick Find
Here is a very simple (but not all that effective) way to achieve what we want. We keep an array that stores the information about which set the objects are in. The interface is implemented as follows:
find(x)
: Return the value at positionx
in the array. This is just O(1).union(x,y)
: Scan through the array to check if any of the values arey
. If so, update them tox
. This is O(n).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Actually, we can do better than that. Let's see how.
Quick Union
This time, we will still use an array for storage but we'll imagine it to be a forest.
We'll keep an array called parents
to track of which element is whose parent. Each set forms a tree represented by its node.
Here is an example of a forest where
{1,2,5,6,7}
form a set and{0,3,4}
form another.
find(x)
: Recursively keep finding the parent ofx
until an element which is the parent of itself is encountered. Because this is a tree, if the unions were random enough this should do better, but the worst case is \(O(N)\), if the tree is very tall.union(x,y)
: Find the root ofx
and make it point towards the root ofy
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Weighting
The problem with the above data structure is that the trees might become too tall. This problem can be fixed by deciding correctly which tree should go under which.
Would it be a better idea to put Tree 1 under Tree 2 or Tree 2 under Tree 1?
Tree 2 has a height of 4 whereas Tree 1 has a height of 3. If we put Tree 2 under the root of Tree 1, we get a larger tree of height 5. However, putting Tree 1 under the root of Tree 2 still makes a tree of height 4.
In general, when we have two trees of height \(m\) and \(n\) such that \(m \leq n,\) we should put the tree of height \(m\) under \(n\) and still get a tree of height \(n\).
To implement this, we need to keep an array size[i]
that keeps track of the objects in trees rooted at i
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Now, both find
and union
work in \(O (\log n)\).
The tree's height increases by at most one node when another tree of greater or equal height is unioned with it.
Since the other tree is at least as large as itself, the resultant tree must have at least double the number of elements.
But there are only \(n\) elements, so the doubling can happen at most \(\log n\) times.
Thus, the maximum height of the tree is in \(O (\log n),\) which is the number of operations we need to approach the root.
Path Compression
Here is another idea: We're already touching all the nodes from x
up to the root. Why don't we just as well push them up the tree as we go?
That requires just one line of extra code in the find
operation. Check line 22 below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 

This practically keeps the tree almost flat. In fact, this makes the operations work in \(O (\log ^* n)\) time as proved by Hopcroft and Ullman.
\(\log ^* n\) is the number of times one needs to apply \(\log\) to \(n\) to get a value less than or equal to 1. In practice, one could think of it to be almost \(O(1)\) since it exceeds 5 only after it has reached \(2^{65536}.\)
The bounds were later improved by Tarjan to \(O\big(\alpha (n)\big),\) where \(\alpha\) is the inverse Ackermann function.