# Comparing ancestry and closure_tree for your nested data structure

When it comes to implementing ActiveRecord nesting, there are a few popular implementations. In this post I will look closer at how Ancestry and Closure tree work and what the pros and cons are.

## How ancestry works

Ancestry works by adding one column (by default named “ancestry”) that replaces the parent_id. Instead of just storing the parent_id, ancestry stores the whole ancestry path to the parent_id.

For example if you have this tree structure.

(id: 1) Page 1
(id: 2)   Page 1.1
(id: 3)   Page 1.2
(id: 4)     Page 1.2.1


Here is how it looks in the database

+----+------------+----------+
| id | name       | ancestry |
+----+------------+----------+
|  1 | Page 1     | NULL     |
|  2 | Page 1.1   | 1        |
|  3 | Page 1.2   | 1        |
|  4 | Page 1.2.1 | 1/3      |
+----+------------+----------+


This way you can always find the children of a node by selecting where ancestry = "{self.ancestry}/{self.id}". And you can find all the descendants of a node by selecting where (ancestry = "{self.ancestry}/{self.id}" OR ancestry LIKE "{self.ancestry}/{self.id}/%")

To select all ancestors of a node, you just select where id IN (ancestry_ids)

## How closure_tree works

Closure tree is a little more complicated than Ancestry. In addition to using the regular parent_id on your model, it also uses a separate hierarchy table.

So if you have a Page model, you also need to have page_hierarchies table that has the following columns: ancestor_id, descendant_id, and generations.

What are all these columns for? Let’s figure that out by observing how it populates the table. Using the same structure

(id: 1) Page 1
(id: 2)   Page 1.1
(id: 3)   Page 1.2
(id: 4)     Page 1.2.1


closure_tree created the following rows in the hierarchy table:

+-------------+---------------+-------------+
| ancestor_id | descendant_id | generations |
+-------------+---------------+-------------+
|           1 |             1 |           0 |
|           1 |             2 |           1 |
|           1 |             3 |           1 |
|           1 |             4 |           2 |
|           2 |             2 |           0 |
|           3 |             3 |           0 |
|           3 |             4 |           1 |
|           4 |             4 |           0 |
+-------------+---------------+-------------+


It basically creates a single row for every descendant an ancestor has, and another row where the ancestor_id = descendant_id.

Because “Page 1” is the root ancestor, there is one row for itself and 3 more rows for each node under “Page 1”. This way you can find all the descendants by selecting where ancestor_id = self.id and joining with the main node table so you can do this in 1 select query. In the same way, you can find all the ancestor by selecting where descendant_id = self.id.

The generations column indicates how many generations the relation between the ancestor_id and the descendant_id. For example (id: 1) is the direct parent of (id: 2), therefore their relation has generations:1. All rows where ancestor_id = descendant_id has generations:0.

## Benchmark

### Write

Populating deeply nested data (25 level deep)

Ancestry: 0.083076
Closure tree: 0.215164


Populating shallow nested data

Ancestry: 0.192071
Closure tree: 0.490023


So in general closure_tree is more than 2 times slower at inserting data, because it has to do a lot more inserts to the hierarchy table.

Let’s see how they do for selecting the ancestors and descendants

                              user     system      total        real
ancestry.ancestors        0.000000   0.000000   0.000000 (  0.009915)
closure_tree.ancestors    0.020000   0.010000   0.030000 (  0.030023)

user     system      total        real
ancestry.descendants      0.010000   0.000000   0.010000 (  0.014332)
closure_tree.descendants  0.020000   0.010000   0.030000 (  0.021242)


The select benchmark might be really dependent on db caching. Assuming the same db and caching mechanism, ancestry wins with a pretty good margin.

### Update

Because each node stores information about its ancestors, moving a node means we need to update all the descendants to reflect the new ancestry tree.

The following benchmark move a root node with 14 nodes under it, and then move it back to root.

                              user     system      total        real
ancestry - move node      0.380000   0.130000   0.510000 (  0.639941)
closure_tree - move node  0.780000   0.150000   0.930000 (  1.802662)


Moving a root node with 25 nodes under it, and then move it back to root.

                              user     system      total        real
ancestry - move node      0.510000   0.170000   0.680000 (  0.799853)
closure_tree - move node  1.260000   0.180000   1.440000 (  2.733766)


## Summary

### Ancestry Pros:

• Simpler data structure to understand
• In general faster (inserting, selecting and moving nodes around)

### Ancestry Cons:

• Because the ancestry column is a varchar(255), there is a limitation on the tree depth (depending on what type of primary key you use). An INT in MySql has a max value of 2147483647 (10 digits), which means that you are limited to 25511 = 23 ancestors. You have to make sure that your application does not allow creating more nested data than this limit.

### closure_tree Pros:

• If you have a huge existing nested data (using parent_id), you don’t have to run migration to add a new column (which could take a long time).
• Unlimited level of nesting

### closure_tree Cons:

• Slower
• Not ideal if you need to move nodes often

The source code for this benchmark is available on github.