Disjoint-set Data Structure (Union-Find)

Union-find, as it is popularly called, is a data structure that categorizes objects into different sets and lets checking out if two objects belong to the same set.

The most popular usage of the data structure is to check whether one node in a graph can be reached from another, e.g. in the Kruskal's algorithm to avoid forming cycles.

Interface

This data structure is supposed to support two operations:

find(x): Returns some representation of the set to which x belongs.
union(x,y): Merge the sets containing x and y.

Often, it can be equipped with a constructor that organizes every object into its own set.

Quick Find

Here is a very simple (but not all that effective) way to achieve what we want. We keep an array that stores the information about which set the objects are in. The interface is implemented as follows:

find(x): Return the value at position x in the array. This is just O(1).
union(x,y): Scan through the array to check if any of the values are y. If so, update them to x. This is O(n).

class UnionFind{ //Quick Find

  int *sets;
  int N;

public:

  UnionFind(int n){ //Set up a union-find data structure with n elements
    N = n;
    sets = new int[N];
    for (int iii = 0; iii < N; iii++)
      sets[iii] = iii;
  }

  int find(int x){
    return sets[x];
  }

  void merge(int x, int y){ //We call this merge here. Apparently, union is a keyword in cpp
    int root_x = find(x);
    int root_y = find(y);
    for (int iii = 0; iii < N; iii++)
      if (sets[iii] == root_x)
        sets[iii] = root_y;
  }

};

Actually, we can do better than that. Let's see how.

Quick Union

This time, we will still use an array for storage but we'll imagine it to be a forest.

We'll keep an array called parents to track of which element is whose parent. Each set forms a tree represented by its node.

Here is an example of a forest where {1,2,5,6,7} form a set and {0,3,4} form another.

find(x): Recursively keep finding the parent of x until an element which is the parent of itself is encountered. Because this is a tree, if the unions were random enough this should do better, but the worst case is \(O(N)\), if the tree is very tall.
union(x,y): Find the root of x and make it point towards the root of y.

class UnionFind{ //Quick Union

  int *parent;
  int N;

public:

  UnionFind(int n){ //Set up a union-find data structure with n elements
    N = n;
    parent = new int[N];
    for (int iii = 0; iii < N; iii++)
      sets[iii] = iii;
  }

  int find(int x){
    int root = x;
    while (parent[root] != root)
      root = parent[root];
    return root;
  }

  void merge(int x, int y){
    int root_x = find(x);
    int root_y = find(y);
    parent[root_x] = root_y;
  }

};

Weighting

The problem with the above data structure is that the trees might become too tall. This problem can be fixed by deciding correctly which tree should go under which.

Tree 1

Tree 2

Would it be a better idea to put Tree 1 under Tree 2 or Tree 2 under Tree 1?

Tree 2 has a height of 4 whereas Tree 1 has a height of 3. If we put Tree 2 under the root of Tree 1, we get a larger tree of height 5. However, putting Tree 1 under the root of Tree 2 still makes a tree of height 4.

In general, when we have two trees of height \(m\) and \(n\) such that \(m \leq n,\) we should put the tree of height \(m\) under \(n\) and still get a tree of height \(n\).

To implement this, we need to keep an array size[i] that keeps track of the objects in trees rooted at i.

class UnionFind{ //Quick Union with Weighting

  int *parent;
  int *size;
  int N;

public:

  UnionFind(int n){ //Set up a union-find data structure with n elements
    N = n;
    parent = new int[N];
    size = new int[N];
    for (int iii = 0; iii < N; iii++){
      parent[iii] = iii;
      size[iii] = 1;
    }
  }

  int find(int x){
    int root = x;
    while (parent[root] != root)
      root = parent[root];
    return root;
  }

  void merge(int x, int y){
    int root_x = find(x);
    int root_y = find(y);
    if (size[root_y] > size[root_x]){ //Make sure that the smaller tree goes under the larger tree
      parent[root_x] = root_y;
      size[root_y] += size[root_x];
    }
    else{
      parent[root_y] = root_x;
      size[root_x] += size[root_y];
    }
  }

};

Now, both find and union work in \(O (\log n)\).

The tree's height increases by at most one node when another tree of greater or equal height is unioned with it.

Since the other tree is at least as large as itself, the resultant tree must have at least double the number of elements.

But there are only \(n\) elements, so the doubling can happen at most \(\log n\) times.

Thus, the maximum height of the tree is in \(O (\log n),\) which is the number of operations we need to approach the root.

Path Compression

Here is another idea: We're already touching all the nodes from x up to the root. Why don't we just as well push them up the tree as we go?

That requires just one line of extra code in the find operation. Check line 22 below.

class UnionFind{ //Quick Union with Weighting and Path Compression

  int *parent;
  int *size;
  int N;

public:

  UnionFind(int n){ //Set up a union-find data structure with n elements
    N = n;
    parent = new int[N];
    size = new int[N];
    for (int iii = 0; iii < N; iii++){
      parent[iii] = iii;
      size[iii] = 1;
    }
  }

  int find(int x){
    int root = x;
    while (parent[root] != root){
      parent[root] = parent[parent[root]]; //Push up the node by one level
      root = parent[root];
    }
    return root;
  }

  void merge(int x, int y){
    int root_x = find(x);
    int root_y = find(y);
    if (size[root_y] > size[root_x]){ //Make sure that the smaller tree goes under the larger tree
      parent[root_x] = root_y;
      size[root_y] += size[root_x];
    }
    else{
      parent[root_y] = root_x;
      size[root_x] += size[root_y];
    }
  }

};

This practically keeps the tree almost flat. In fact, this makes the operations work in \(O (\log ^* n)\) time as proved by Hopcroft and Ullman.

\(\log ^* n\) is the number of times one needs to apply \(\log\) to \(n\) to get a value less than or equal to 1. In practice, one could think of it to be almost \(O(1)\) since it exceeds 5 only after it has reached \(2^{65536}.\)

The bounds were later improved by Tarjan to \(O\big(\alpha (n)\big),\) where \(\alpha\) is the inverse Ackermann function.

Contents