This was originally written in September 2018, but never published (for unknown reasons). I found it in the drafts section of the website today.
In this post, you will see me change approach and not use a secondary table in one instance, but use multiple secondary tables in another. In one case, you will also see my deliberations regarding which approach to take.
I am a fan of storing the raw data in a table. I want the original values available for future use. Sure, I often store calculated values. I sometimes refer to those as derived values.
For example, in the FreshPorts database, there is an element table, which is self-referencing, to implement a tree-like structure. This table represents a directory structure.
freshports.dev=# \d element Table "public.element" Column | Type | Collation | Nullable | Default ---------------------+--------------+-----------+----------+------------------------------------- id | integer | | not null | nextval('element_id_seq'::regclass) name | text | | not null | parent_id | integer | | | directory_file_flag | character(1) | | not null | status | character(1) | | not null | Indexes: "element_pkey" PRIMARY KEY, btree (id) "element_delete" btree (status) WHERE status = 'D'::bpchar "element_name" btree (name) "element_parent_id" btree (parent_id) Foreign-key constraints: "$1" FOREIGN KEY (parent_id) REFERENCES element(id) ON UPDATE RESTRICT ON DELETE CASCADE
The above output excludes the constraints, references, and triggers, but is meant to demonstrate that parent_id refers to another row in this table. By following the parent_id link, you can construct the directory pathname to a given location.
There is a function which derives the pathname:
freshports.dev=# select element_pathname(3884); element_pathname -------------------------------- /ports/head/astro/xephem/files (1 row) freshports.dev=#
This function was one of the earliest created for FreshPorts and it a WHILE loop to traverse the table. It runs quickly, often in 0.004 ms or so. However, sometimes you have to invoke this function many thousands of times or need to select a range of such values. This is why the element_pathname table was created:
freshports.dev=# \d element_pathname Table "public.element_pathname" Column | Type | Collation | Nullable | Default ------------+---------+-----------+----------+--------- element_id | integer | | not null | pathname | text | | not null | Indexes: "element_pathname_pathname" UNIQUE, btree (pathname) "element_pathname_element_id" btree (element_id) Foreign-key constraints: "element_pathname_element_id_fkey" FOREIGN KEY (element_id) REFERENCES element(id) ON DELETE CASCADE
The query is very straight forward and also very fast:
freshports.dev=# select E.id, EP.pathname FROM element E JOIN element_pathname EP ON E.id = EP.element_id WHERE E.id = 3884; id | pathname ------+-------------------------------- 3884 | /ports/head/astro/xephem/files (1 row)
This works for one element or many thousands of elements and it scales well.
In this post, I will discuss my plan for implementing the above solution for another situation which has recently arose. I hope to share my design decisions and solicit recommendations for improvement. My work on FreshPorts is sporadic at best, but this particular CONFLICTS issue has my interest now.
Background
If you are familiar with FreshPorts, FreeBSD ports, and CONFLICTS, you can skip this section.
FreshPorts displays information about the applications available via the FreeBSD ports infrastructure. FreshPorts obtains information from the ports tree mainly by the use of make -V. Various values are pulled using this method and stored in a database for later display upon the website.
The FreeBSD ports tree / package collection has a method for resolving conflicts. This feature allows a port maintainer to specify what, if any, packages conflict with a given port. The reasons why two ports might conflict is outside the scope of this article. For background reading:
A recent request to display CONFLICTS information has arrived. I’ve been thinking about it for nearly two weeks and this morning I decided upon an approach.
What are conflicts? Some applications conflict with other applications. For example, you usually do not install Bacula v7 and Bacula v9 on the same system. Yes, it can be done, but you usually do not. The FreeBSD ports system allows port maintainers to specify conflicts.
With respect to the FreeBSD ports, a CONFLICTS value is obtained from the Makefile via make -V:
# cd /usr/ports/sysutils/bacula-server # make -V CONFLICTS -V CONFLICTS_BUILD -V CONFLICTS_INSTALL bacula5-server-*
The two blank likes are significant because they represent empty values.
NOTE: grep is not an appropriate method for obtaining the values because the values may be contingent upon other configuration settings.
globs not regex
CONFLICTS can involve wildcards, such as the * shown above. It is critical to realize that these are globs, meant for matching disk filenames, as opposed to regular expressions.
From the documentation
# CONFLICTS - A list of package name patterns that the port conflicts # with, separated by blanks. The names may include shell # pattern meta-characters "*", "?", "[", "]", and "!". # Example: apache*-1.2* apache*-1.3.[012345] apache-*+ssl_*
For FreshPorts, we want to display the list of matching ports which conflict with FreshPorts. Following a discussion on the FreeBSD Ports mailing list, I fashioned a query which obtains the desired results:
SELECT distinct categoryport(id) FROM (SELECT P.id, P.package_name || '-' || CLP.port_version AS release FROM ports P JOIN commit_log_ports CLP on P.id = CLP.port_id) TMP WHERE TMP.release ~ 'bacula5-server-*';
The result is as expected:
sysutils/bacula5-server
My initial idea involved altering the CONFLICTS value to remove the regex portion and do an exact match on PKGNAME (i.e. P.package_name). That idea has since been discarded after reviewing the data. Some CONFLICTS values are straight forward (e.g. p5-Log-Any-Adapter-Syslog) whereas others get more complex (e.g. mariadb10[013-9]-client-*) and merely dropping the regex part will yield incorrect results).
Therefore, the search is necessary.
The starting query
The data I am searching for is package names and versions which match a given regex expression. In the port Makefile, these values are recorded in the PKGNAME and PORTREVISION values of the port Makefile. In the database, that data is stored in the package_name and port_revision columns of the ports table.
To construct the term upon which we are going to search, we need to concatenate the stored values: P.package_name || ‘-‘ || CLP.port_version
The resulting query for a given search term is:
SELECT distinct categoryport(id) FROM (SELECT P.id, P.package_name || '-' || CLP.port_version AS release FROM ports P JOIN commit_log_ports CLP on P.id = CLP.port_id) TMP WHERE TMP.release ~ 'bacula5-server-*';
For those interested, here is the query in detail:
freshports.dev=# explain analyse SELECT distinct categoryport(id) FROM (SELECT P.id, P.package_name || '-' || CLP.port_version AS release FROM ports P JOIN commit_log_ports CLP on P.id = CLP.port_id) TMP WHERE TMP.release ~ 'bacula5-server-*'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=40147.78..40170.86 rows=4616 width=32) (actual time=2261.668..2261.671 rows=1 loops=1) -> Sort (cost=40147.78..40159.32 rows=4616 width=32) (actual time=2261.667..2261.668 rows=10 loops=1) Sort Key: (categoryport(p.id)) Sort Method: quicksort Memory: 25kB -> Hash Join (cost=11858.09..39866.84 rows=4616 width=32) (actual time=2150.543..2261.648 rows=10 loops=1) Hash Cond: (clp.port_id = p.id) Join Filter: (((p.package_name || '-'::text) || clp.port_version) ~ 'bacula5-server-*'::text) Rows Removed by Join Filter: 923079 -> Seq Scan on commit_log_ports clp (cost=0.00..15113.27 rows=923227 width=10) (actual time=0.037..216.532 rows=923089 loops=1) -> Hash (cost=10913.82..10913.82 rows=51382 width=17) (actual time=80.327..80.327 rows=49699 loops=1) Buckets: 16384 Batches: 4 Memory Usage: 741kB -> Seq Scan on ports p (cost=0.00..10913.82 rows=51382 width=17) (actual time=0.018..60.782 rows=49699 loops=1) Planning time: 0.400 ms Execution time: 2261.699 ms (14 rows) freshports.dev=#
The data is derived from stored values and we must scan the entire ports table every time. There is a lot of data there and it takes time.
The server has about 60G RAM with about 16G RAM free as I ran the above query. Everything is on SSDs.
I bet someone will figure out a great way to optimize the above query, but I’m going to go another way. I’m going to store the calculated values in a table and I will keep it updated via triggers. By using triggers, the change will be completely transparent to the code which populates / maintains the database. That is the part I really like: no code changes.
The proposed table
This is the table which I originally considered. It is not what I implemented.
While reading this section, keep in mind that I was going back and forth in my mind as to whether or not to create a new table or update in place.
CREATE TABLE commit_log_ports_revisions ( commit_log_id INTEGER REFERENCES commit_log(id), port_id INTEGER REFERENCES ports(id), port_name_revision TEXT NOT NULL);
Why a new table? Putting the data in a new table avoids multiple updates to the same table when a change occurs. For example:
- the CONFLICTS field is updated
- the trigger is invoked
- the table is updated again
Instead, we will update a second table, not the original table.
But first, let’s look at the table from where we are getting this data:
freshports.dev=# \d commit_log_ports Table "public.commit_log_ports" Column | Type | Collation | Nullable | Default ---------------+----------+-----------+----------+--------- commit_log_id | integer | | not null | port_id | integer | | not null | needs_refresh | smallint | | not null | port_version | text | | | port_revision | text | | | port_epoch | text | | | Indexes: "commit_log_ports_pkey" PRIMARY KEY, btree (commit_log_id, port_id) "commit_log_ports_needs_refresh_idx" btree (needs_refresh) WHERE needs_refresh <> 0 "commit_log_ports_port_id" btree (port_id) "needs_refresh" btree (needs_refresh) Foreign-key constraints: "$1" FOREIGN KEY (commit_log_id) REFERENCES commit_log(id) ON UPDATE CASCADE ON DELETE CASCADE "$2" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE Triggers: commit_log_ports_insert AFTER INSERT ON commit_log_ports FOR EACH ROW EXECUTE PROCEDURE commit_log_ports_insert() freshports.dev=#
We could accomplish the same result by adding the port_name_revision column to the commit_log_ports table. That’s a good idea. I looked at the code. Here is the diff I came up with:
[dan@dev-ingress01:~/modules] $ diff -ruN ~/modules/commit_log_ports.pm ~/tmp/commit_log_ports.pm --- /usr/home/dan/modules/commit_log_ports.pm 2018-05-06 13:08:45.368996000 +0000 +++ /usr/home/dan/tmp/commit_log_ports.pm 2018-08-20 14:41:41.374169000 +0000 @@ -51,7 +51,7 @@ if (!defined($this->{saved})) { # we are inserting $sql = "insert into commit_log_ports - (commit_log_id, port_id, needs_refresh, port_version, port_revision) values + (commit_log_id, port_id, needs_refresh, port_version, port_revision, port_name_revision) values ($this->{commit_log_id}, $this->{port_id}, $this->{needs_refresh}, $quoted_port_version, $quoted_port_revision)"; } else { @@ -60,7 +60,8 @@ set needs_refresh = $this->{needs_refresh}, port_version = $quoted_port_version, port_revision = $quoted_port_revision, - port_epoch = $quoted_port_epoch + port_epoch = $quoted_port_epoch, + port_name_revision = " . $portname . ' ' . $quoted_port_version where commit_log_id = $this->{commit_log_id} and port_id = $this->{port_id}"; } [dan@dev-ingress01:~/modules] $
That seems like a trivial change to the code which maintains the commit_log_ports table.
What is stopping me from taking this simple approach?
That $portname value on line 19 is not available in the code. It does exist, it is not obtainable at that point.
However, the value is available in the database. Let’s see if I can easily derive the right value. Let’s start with the first SQL query I had above. From that, I came up with:
SELECT P.package_name FROM ports P WHERE P.id = 13990;
The answer comes back in about 0.2 ms, so it is reasonably fast.
I could imbed that SQL as an nested query, but I would more likely create a stored procedure / function for that. Oddly enough, there is no existing function to obtain the package name based on the port id. It seems that in 18+ years of FreshPorts development, this has never been needed.
The new stored procedure
I call it a stored procedure but the PostgreSQL terminology is function. I think the SP term comes from wanting to distinguish between system-provided functions and user-supplied functions (such as mine).
Here is what I created:
CREATE OR REPLACE FUNCTION PackageName(integer) RETURNS TEXT AS $$ SELECT P.package_name FROM ports P WHERE P.id = $1 $$ LANGUAGE SQL STABLE;
The function has a volatility classification of STABLE because it is a simple query and does not modify anything. Repeated calls with the same parameters will yield the same results.
It was here that I started abandoning the idea of a second table for this particular data because the data could be stored without a secondary update via a trigger.
The resulting diff
When I modify the code to use the newly created function, the diff becomes:
[dan@dev-ingress01:~/tmp] $ diff -ruN ~/modules/commit_log_ports.pm ~/tmp/commit_log_ports.pm --- /usr/home/dan/modules/commit_log_ports.pm 2018-05-06 13:08:45.368996000 +0000 +++ /usr/home/dan/tmp/commit_log_ports.pm 2018-08-20 16:53:15.769520000 +0000 @@ -51,16 +51,17 @@ if (!defined($this->{saved})) { # we are inserting $sql = "insert into commit_log_ports - (commit_log_id, port_id, needs_refresh, port_version, port_revision) values + (commit_log_id, port_id, needs_refresh, port_version, port_revision, port_name_revision) values ($this->{commit_log_id}, $this->{port_id}, $this->{needs_refresh}, - $quoted_port_version, $quoted_port_revision)"; + $quoted_port_version, $quoted_port_revision, PackageName($this->{port_id}) || '-' || '$quoted_port_version')"; } else { # we are updating $sql = "update commit_log_ports set needs_refresh = $this->{needs_refresh}, port_version = $quoted_port_version, port_revision = $quoted_port_revision, - port_epoch = $quoted_port_epoch + port_epoch = $quoted_port_epoch, + port_name_revision = PackageName($this->{port_id}) || '-' || '$quoted_port_version' where commit_log_id = $this->{commit_log_id} and port_id = $this->{port_id}"; } [dan@dev-ingress01:~/tmp] $
The table modification
Here is the addition of the new column to the existing table:
ALTER TABLE commit_log_ports ADD COLUMN port_name_revision TEXT;
Ideally, I would add a NOT NULL clause to this declaration, but that cannot be done until after all rows have a value for this new column.
Let’s set that value:
freshports.dev=# UPDATE commit_log_ports SET port_name_revision = PackageName(port_id)|| '-' || port_version; UPDATE 923098
After that, I discovered 1891 ports without a value for package_name, which led to 53378 rows in the commit_log_ports without a value for port_name_revision. Those issues, while important to resolve, are completely out of scope for this blog post.
That also means the NOT NULL clause will not be pursued.
In addition, the commit_log_ports_revisions table is no longer required.
Query update
Now that we have the new field, let’s try querying it. It is fast, even with a sequential scan of over 900,000 rows. Here is the results on a query not-previously run:
freshports.dev=# explain analyse SELECT DISTINCT CLP.port_id, CLP.port_name_revision FROM commit_log_ports CLP WHERE CLP.port_name_revision ~ 'mariadb10[013-9]-client-*'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=19482.03..19482.67 rows=85 width=22) (actual time=665.321..665.345 rows=35 loops=1) -> Sort (cost=19482.03..19482.24 rows=85 width=22) (actual time=665.320..665.324 rows=39 loops=1) Sort Key: port_id, port_name_revision Sort Method: quicksort Memory: 28kB -> Gather (cost=1000.00..19479.31 rows=85 width=22) (actual time=474.123..666.755 rows=39 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on commit_log_ports clp (cost=0.00..18470.81 rows=35 width=22) (actual time=450.999..662.266 rows=13 loops=3) Filter: (port_name_revision ~ 'mariadb10[013-9]-client-*'::text) Rows Removed by Filter: 307687 Planning time: 0.586 ms Execution time: 666.853 ms (12 rows) freshports.dev=#
We have gone from over 2 seconds to about 2/3 of a second.
Let’s try an index
This will create a B-Tree index, which, accordingly to the Index Types documentation, can be used by the query optimizer “for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string”. This is our exact use case.
freshports.dev=# CREATE INDEX commit_log_ports_port_name_revision ON commit_log_ports(port_name_revision); CREATE INDEX
Let’s try the new query:
freshports.dev=# explain analyse SELECT DISTINCT CLP.port_id, CLP.port_name_revision FROM commit_log_ports CLP WHERE CLP.port_name_revision ~ 'mariadb10[013-9]-client-*'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=19482.04..19482.68 rows=85 width=22) (actual time=664.463..664.479 rows=35 loops=1) -> Sort (cost=19482.04..19482.25 rows=85 width=22) (actual time=664.462..664.466 rows=39 loops=1) Sort Key: port_id, port_name_revision Sort Method: quicksort Memory: 28kB -> Gather (cost=1000.00..19479.32 rows=85 width=22) (actual time=398.829..665.728 rows=39 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on commit_log_ports clp (cost=0.00..18470.82 rows=35 width=22) (actual time=455.528..661.375 rows=13 loops=3) Filter: (port_name_revision ~ 'mariadb10[013-9]-client-*'::text) Rows Removed by Filter: 307687 Planning time: 0.486 ms Execution time: 665.818 ms (12 rows) freshports.dev=#
That’s pretty much the same time. This query does bring back a rather wide selection:
port_id | port_name_revision ---------+--------------------------- 35630 | mariadb100-client-10.0.14 35630 | mariadb100-client-10.0.16 35630 | mariadb100-client-10.0.21 35630 | mariadb100-client-10.0.22 35630 | mariadb100-client-10.0.23 35630 | mariadb100-client-10.0.25 35630 | mariadb100-client-10.0.26 35630 | mariadb100-client-10.0.27 35630 | mariadb100-client-10.0.29 35630 | mariadb100-client-10.0.30 35630 | mariadb100-client-10.0.31 35630 | mariadb100-client-10.0.36 38043 | mariadb101-client-10.1.11 38043 | mariadb101-client-10.1.13 38043 | mariadb101-client-10.1.14 38043 | mariadb101-client-10.1.16 38043 | mariadb101-client-10.1.17 38043 | mariadb101-client-10.1.19 38043 | mariadb101-client-10.1.21 38043 | mariadb101-client-10.1.22 38043 | mariadb101-client-10.1.23 38043 | mariadb101-client-10.1.30 38043 | mariadb101-client-10.1.33 38043 | mariadb101-client-10.1.35 40632 | mariadb101-client-10.1.21 40632 | mariadb101-client-10.1.22 40634 | mariadb100-client-10.0.29 40634 | mariadb100-client-10.0.30 42923 | mariadb100-client-10.0.31 42933 | mariadb101-client-10.1.23 46761 | mariadb101-client-10.1.30 48631 | mariadb101-client-10.1.33 48882 | mariadb103-client-10.3.7 49667 | mariadb100-client-10.0.36 49670 | mariadb101-client-10.1.35 (35 rows)
This is from a total of 471 rows out of 900,000 which match mariadb%.
The query is fast if you use LIKE:
freshports.dev=# explain analyse SELECT CLP.port_id, CLP.port_name_revision FROM commit_log_ports CLP WHERE CLP.port_name_revision LIKE 'bacula5-server%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using commit_log_ports_port_name_revision on commit_log_ports clp (cost=0.42..8.45 rows=85 width=22) (actual time=0.026..0.040 rows=10 loops=1) Index Cond: ((port_name_revision >= 'bacula5-server'::text) AND (port_name_revision < 'bacula5-serves'::text)) Filter: (port_name_revision ~~ 'bacula5-server%'::text) Planning time: 0.178 ms Execution time: 0.060 ms (5 rows) freshports.dev=#
But that is not the type of query we need at present. We have already decided to match based on the expressions supplied within the CONFLICTS fields.
DANGER WILL ROBINSON!
NOTE TO SELF: the proposed port_name_revision field should also be updated when the underlying element.name field is updated. The same applies to the element_pathname table.