Disjoint Sets and the Maximal Tourism Problem
I recently solved a pretty interesting problem on HackerRank. The question is as follow:
You are given all of the bus routes that may connect different cities in a country. Some cities will not be able to access other cities by bus, and you may fly to one city and travel only by bus from that point forward. You are then asked to determine the maximum number of cities you could visit. They gave this illustration as an example:
In this example, the answer is easy: 4. You can see that a bus may connect a city to itself, other cities, or no bus routes may exist. Our goal is to calculate this number quickly for up to 100,000 cities/routes.
This problem, while seemingly complicated can be solved very quickly with the right approach. The approach I took was to use Disjoint sets with path compression. This is a graph theory approach and so it is worth noting that a basic graph has two parts, which you can see in the above graph: nodes are represented as circles and vertices are represented as lines connecting nodes. Vertices may or may not be directed. In our problem, they are not directed.
Disjoint Sets
A Disjoint set is a graph that has 3 essential functions: makeSet, merge/union, and findSet. The functions all work to make it easy to determine what subset any given node belongs to.
The makeSet simply adds a node to your space and has a result that looks like the following:
As you can see, we have added nodes to a space that are all pointing their “parent” pointer to themselves.
The merge or union function pairs points one nodes parent to another node:
In the diagram above, the node, specified first in the function, was chosen as the parent. This is potentially dangerous because this means that you can change the parent of an already merged node within a subset, which is usually not the purpose of a disjoint set.
There are various ways you can write the logic for how a parent is chosen. For example, each node might have ranks and the higher rank is considered the parent. You should also only choose to modify the parent pointer of the highest parent link. You could break nodes with equal ranks by choosing the one with a smaller integer payload. You could always increase the rank of the chosen parent by 1. This would be a safer method, but for the purpose of keeping our diagrams simple, we will stick to having the first node specified as the parent.
Also notice that Node 1 does not point to Node 2, this is because it doesn’t matter whether or not you can get to Node 2. It matters that Node 2 is a member of the subset “1”.
The find function serves to find the ultimate parent node where
find(this).id == this.id. For example, here are the results of the finds for our current graph:
This is how we arrive to the fact that Node 2 belongs to the subset 1, and that Node 4 belongs to the subset 3.
If we were to point 1 to 3 the find process would look like the following:
In the above case we find that all nodes belong to the subset 3, but it might take a long time to get to that conclusion. For example, suppose there were n vertices between 2 and 3. Now the process of determining it’s subset will always take O(n) time, and subsequent nodes along the way will take O(n-1), O(n-2), O(n-3), .. O(1) time. This can be a major overhead, which is why we can introduce path compression. Consider this graph below:
Now, suppose find(5) is called first. We find that it’s subset is 3, but while we recurse, we modify the parent of each node on the way to be 3. This means that finding the parent of Node 2, for example, will now be O(1) instead of O(n-1). This will improve performance dramatically as the graph is made more direct. This is okay because the integrity of how we get to 3 doesn’t matter, it only matters that we get to 3.
Maximal Tourism Sample Input
Now, let’s get back to the problem at hand. I encourage you to try to write a solution for the following input before moving forward:
Input:
8 5
1 2
7 4
7 3
5 8
1 3
Output:
5
The first line denotes how many nodes there are and how many vertices there are in that order. The subsequent lines describe which nodes are connected to which. This results in the following graph:
Starting from city 6, you can only visit 1 city, 5 or 8 return 2 cities you can visit, 4, 7, 3, 1, and 2 would allow for 5 cities. 5 is the largest so we output 5.
Maximal Tourism Solution
Hopefully, you had a chance to solve it, now I will share my solution to this problem by going through code snippets of how I created my Disjoint Set.
Node Class
class Node {
public:
long data;
Node *parent;
int rank;
};
Disjoint Set Class
This is the meat and potatoes of this solution. You will notice that this uses ranks to assign a parent, path compression to optimize as we find, and a wrapper findSet so that we can find the subset of a node using the number it is identified by.
class DisjointSet {
public:
class Node {
public:
long data;
Node *parent;
int rank;
};
long findSet(long data) {
//Calls the private findset that uses nodes
return findSet(nodeMap.find(data)->second)->data;
}
bool merge(long data1, long data2) {
//Get the actual nodes from our long to Node map.
Node *node1 = nodeMap.find(data1)->second;
Node *node2 = nodeMap.find(data2)->second;
Node *parent1 = findSet(node1);
Node *parent2 = findSet(node2);
if (parent1 == parent2) return false; //Shouldn't merge two nodes in the same subset
//Choose parent1 as the parent if it's rank is greater or equal to parent 2's rank.
if (parent1->rank >= parent2->rank) {
//raise parent1's rank if there is a tie in ranks.
parent1->rank = (parent1->rank == parent2->rank) ? parent1->rank + 1 : parent1->rank;
parent2->parent = parent1;
} else {
//We are choosing parent2 as the parent here.
parent1->parent = parent2;
}
return true;
}
//Add a new node to the space
void makeSet(long data) {
Node *node = new Node();
node->data = data;
node->parent = node;
node->rank = 0;
map<long, Node *>::iterator it = nodeMap.begin();
nodeMap.insert(it, pair<long, Node *>(data, node));
}
private:
map<long, Node *> nodeMap; //This keeps track of all of our nodes, so that we can easily iterate from 0 to n through hem
Node *findSet(Node *node) {
Node *parent = node->parent;
if (parent == node) return parent;
node->parent = findSet(node->parent); //Set the a nodes parent to the result of the find.
return node->parent;
}
};
The Solution
Finally here is the solution that creates an actual Disjoint Set, and finds the largest subset size:
#include <bits/stdc++.h>
using namespace std;
class DisjointSet {
public:
class Node {
public:
long data;
Node *parent;
int rank;
};
long findSet(long data) {
//Calls the private findset that uses nodes
return findSet(nodeMap.find(data)->second)->data;
}
bool merge(long data1, long data2) {
//Get the actual nodes from our long to Node map.
Node *node1 = nodeMap.find(data1)->second;
Node *node2 = nodeMap.find(data2)->second;
Node *parent1 = findSet(node1);
Node *parent2 = findSet(node2);
if (parent1 == parent2) return false; //Shouldn't merge two nodes in the same subset
//Choose parent1 as the parent if it's rank is greater or equal to parent 2's rank.
if (parent1->rank >= parent2->rank) {
//raise parent1's rank if there is a tie in ranks.
parent1->rank = (parent1->rank == parent2->rank) ? parent1->rank + 1 : parent1->rank;
parent2->parent = parent1;
} else {
//We are choosing parent2 as the parent here.
parent1->parent = parent2;
}
return true;
}
//Add a new node to the space
void makeSet(long data) {
Node *node = new Node();
node->data = data;
node->parent = node;
node->rank = 0;
map<long, Node *>::iterator it = nodeMap.begin();
nodeMap.insert(it, pair<long, Node *>(data, node));
}
private:
map<long, Node *> nodeMap; //This keeps track of all of our nodes, so that we can easily iterate from 0 to n through hem
Node *findSet(Node *node) {
Node *parent = node->parent;
if (parent == node) return parent;
node->parent = findSet(node->parent); //Set the a nodes parent to the result of the find.
return node->parent;
}
};
int main() {
int n;
int m;
DisjointSet d;
vector<long> answer;
cin >> n >> m; //Get the number of nodes and vertices.
//Create each node, and an answer associated with it's subset so far not counting itself.
for (int i = 0; i < n; i++) {
d.makeSet(i);
answer.push_back(0);
}
//Let's merge based on the vertices we are told exist.
for (int route_i = 0; route_i < m; route_i++) {
int u, v;
cin >> u >> v;
d.merge(u-1, v-1);
}
//Find the parent of each node, and increment the subset it belongs to by 1.
for(int i = 0; i < n; i++) {
answer[d.findSet(i)]++;
}
//Sort and get the last element which is the max
sort(answer.begin(), answer.end());
cout << answer[answer.size() - 1];
return 0;
}
Last Remarks
I hope this has been a helpful run through on how to use a disjoint set and a pretty cool competitive programming question around it. Please ask if there are any questions — until next time!