Comparing ancestry and closure_tree for your nested data structure

Sun, Sep 13, 2015

When it comes to implementing ActiveRecord nesting, there are a few popular implementations. In this post I will look closer at how Ancestry and Closure tree work and what the pros and cons are.

How ancestry works

Ancestry works by adding one column (by default named “ancestry”) that replaces the parent_id. Instead of just storing the parent_id, ancestry stores the whole ancestry path to the parent_id.

For example if you have this tree structure.

(id: 1) Page 1
(id: 2)   Page 1.1
(id: 3)   Page 1.2
(id: 4)     Page 1.2.1

Here is how it looks in the database

+----+------------+----------+
| id | name       | ancestry |
+----+------------+----------+
|  1 | Page 1     | NULL     |
|  2 | Page 1.1   | 1        |
|  3 | Page 1.2   | 1        |
|  4 | Page 1.2.1 | 1/3      |
+----+------------+----------+

This way you can always find the children of a node by selecting where ancestry = "{self.ancestry}/{self.id}". And you can find all the descendants of a node by selecting where (ancestry = "{self.ancestry}/{self.id}" OR ancestry LIKE "{self.ancestry}/{self.id}/%")

To select all ancestors of a node, you just select where id IN (ancestry_ids)

How closure_tree works

Closure tree is a little more complicated than Ancestry. In addition to using the regular parent_id on your model, it also uses a separate hierarchy table.

So if you have a Page model, you also need to have page_hierarchies table that has the following columns: ancestor_id, descendant_id, and generations.

What are all these columns for? Let’s figure that out by observing how it populates the table. Using the same structure

(id: 1) Page 1
(id: 2)   Page 1.1
(id: 3)   Page 1.2
(id: 4)     Page 1.2.1

closure_tree created the following rows in the hierarchy table:

+-------------+---------------+-------------+
| ancestor_id | descendant_id | generations |
+-------------+---------------+-------------+
|           1 |             1 |           0 |
|           1 |             2 |           1 |
|           1 |             3 |           1 |
|           1 |             4 |           2 |
|           2 |             2 |           0 |
|           3 |             3 |           0 |
|           3 |             4 |           1 |
|           4 |             4 |           0 |
+-------------+---------------+-------------+

It basically creates a single row for every descendant an ancestor has, and another row where the ancestor_id = descendant_id.

Because “Page 1” is the root ancestor, there is one row for itself and 3 more rows for each node under “Page 1”. This way you can find all the descendants by selecting where ancestor_id = self.id and joining with the main node table so you can do this in 1 select query. In the same way, you can find all the ancestor by selecting where descendant_id = self.id.

The generations column indicates how many generations the relation between the ancestor_id and the descendant_id. For example (id: 1) is the direct parent of (id: 2), therefore their relation has generations:1. All rows where ancestor_id = descendant_id has generations:0.

Benchmark

Write

Populating deeply nested data (25 level deep)

Ancestry: 0.083076
Closure tree: 0.215164

Populating shallow nested data

Ancestry: 0.192071
Closure tree: 0.490023

So in general closure_tree is more than 2 times slower at inserting data, because it has to do a lot more inserts to the hierarchy table.

Read

Let’s see how they do for selecting the ancestors and descendants

                              user     system      total        real
ancestry.ancestors        0.000000   0.000000   0.000000 (  0.009915)
closure_tree.ancestors    0.020000   0.010000   0.030000 (  0.030023)

                              user     system      total        real
ancestry.descendants      0.010000   0.000000   0.010000 (  0.014332)
closure_tree.descendants  0.020000   0.010000   0.030000 (  0.021242)

The select benchmark might be really dependent on db caching. Assuming the same db and caching mechanism, ancestry wins with a pretty good margin.

Update

Because each node stores information about its ancestors, moving a node means we need to update all the descendants to reflect the new ancestry tree.

The following benchmark move a root node with 14 nodes under it, and then move it back to root.

                              user     system      total        real
ancestry - move node      0.380000   0.130000   0.510000 (  0.639941)
closure_tree - move node  0.780000   0.150000   0.930000 (  1.802662)

Moving a root node with 25 nodes under it, and then move it back to root.

                              user     system      total        real
ancestry - move node      0.510000   0.170000   0.680000 (  0.799853)
closure_tree - move node  1.260000   0.180000   1.440000 (  2.733766)

Summary

Ancestry Pros:

Simpler data structure to understand
In general faster (inserting, selecting and moving nodes around)

Ancestry Cons:

Because the ancestry column is a varchar(255), there is a limitation on the tree depth (depending on what type of primary key you use). An INT in MySql has a max value of 2147483647 (10 digits), which means that you are limited to 255/11 = 23 ancestors. You have to make sure that your application does not allow creating more nested data than this limit.

closure_tree Pros:

If you have a huge existing nested data (using parent_id), you don’t have to run migration to add a new column (which could take a long time).
Unlimited level of nesting

closure_tree Cons:

Slower
Not ideal if you need to move nodes often

The source code for this benchmark is available on github.