May 292021
 

Today at about 1:00 pm Philadelphia time, the website went offline. It was still there, it just could not answer any queries.

In this post:

  • Website and database server running in AWS
  • FreeBSD 12.2 on a
  • t3.medium EC2 instance
  • PostgreSQL 12.5 running on a db.t2.large RDS instance

Initial look

Looking at load, it was acceptable:

$ w
 7:09PM  up 43 days, 20:48, 1 user, load averages: 0.12, 0.26, 0.30
USER       TTY      FROM                                      LOGIN@  IDLE WHAT
ec2-user   pts/0    [redacted]                                7:09PM     - w

There were a lot of php-fpm: pool www (php-fpm) processes running.

Looking in the logs, I found:

[root@nginx01 /var/log]# tail -F php-fpm.log 
[29-May-2021 17:41:34] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 17 idle, and 224 total children
[29-May-2021 17:41:41] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 19 idle, and 234 total children
[29-May-2021 17:41:43] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 17 idle, and 235 total children
[29-May-2021 17:41:44] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 18 idle, and 238 total children
[29-May-2021 17:41:45] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 18 idle, and 240 total children
[29-May-2021 17:41:46] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 19 idle, and 242 total children
[29-May-2021 17:41:51] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 18 idle, and 249 total children
[29-May-2021 17:41:55] WARNING: [pool www] server reached pm.max_children setting (250), consider raising it
[29-May-2021 18:02:26] WARNING: [pool www] child 12962 exited with code 70 after 1759.560106 seconds from start
[29-May-2021 18:02:26] NOTICE: [pool www] child 22023 started
^C

Yeah, that’s a lot.

Looking at the RDS statistics, it was busy at about 98% load I think, I did not take a screen shot then. But here is the screen shot in hindsight, showing the fast build up of the number of sessions, from near zero to 250 and staying there for 2.5 hours

250 database sessions

250 database sessions

From this:

  • select * from elementGetCaseInsensitive() is suspect
  • this happened suddenly

The query

Here is the query:

   SELECT tmp.*, EP.pathname FROM (
      SELECT id,
             name::text,
             directory_file_flag::text,
             status::text,
             case when IsCategory(id) IS NULL THEN FALSE ELSE TRUE END,
             case when IsPort(    id) IS NULL THEN FALSE ELSE TRUE END
        FROM element
       WHERE id = (SELECT element_id
        FROM element_pathname
       WHERE lower(pathname) = lower($1))
     ) AS tmp JOIN element_pathname EP on tmp.id = EP.element_id;

And an explain analyse:

freshports.org=> explain analyse select * from elementGetCaseInsensitive('/ports/head/sysutils/nut');
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=29669.04..29685.59 rows=1 width=137) (actual time=6396.216..6396.221 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on element_pathname  (cost=0.00..29668.18 rows=5748 width=4) (actual time=4123.990..6395.702 rows=1 loops=1)
           Filter: (lower(pathname) = '/ports/head/sysutils/nut'::text)
           Rows Removed by Filter: 1149267
   ->  Index Scan using element_pkey on element  (cost=0.43..8.45 rows=1 width=23) (actual time=6395.745..6395.746 rows=1 loops=1)
         Index Cond: (id = $0)
   ->  Index Scan using element_pathname_element_id on element_pathname ep  (cost=0.43..8.45 rows=1 width=56) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (element_id = $0)
 Planning Time: 0.392 ms
 Execution Time: 6396.252 ms
(11 rows)

That’s terrible. Look at that sequential scan on over 1.1 million rows.

I know how to improve this: a new index.

The alteration

This is how I altered the table, first, the original definition:

freshports.org-> \d element_pathname
            Table "public.element_pathname"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 element_id | integer |           | not null | 
 pathname   | text    |           | not null | 
Indexes:
    "element_pathname_pathname" UNIQUE, btree (pathname)
    "element_pathname_element_id" btree (element_id)
Foreign-key constraints:
    "element_pathname_element_id_fkey" FOREIGN KEY (element_id) REFERENCES element(id) ON DELETE CASCADE

freshports.org-> 

This is the index I added:

freshports.org=> begin;
BEGIN
freshports.org=> create index element_pathname_pathname_lc on element_pathname (lower(pathname));
CREATE INDEX
freshports.org=> commit;

The new explain analyze

Now that query runs quickly:

freshports.org=> explain analyse select * from elementGetCaseInsensitive('/ports/head/sysutils/nut');
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=10438.60..10455.15 rows=1 width=137) (actual time=0.292..0.297 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Bitmap Heap Scan on element_pathname  (cost=265.08..10437.74 rows=5746 width=4) (actual time=0.042..0.043 rows=1 loops=1)
           Recheck Cond: (lower(pathname) = '/ports/head/sysutils/nut'::text)
           Heap Blocks: exact=1
           ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..263.65 rows=5746 width=0) (actual time=0.037..0.038 rows=1 loops=1)
                 Index Cond: (lower(pathname) = '/ports/head/sysutils/nut'::text)
   ->  Index Scan using element_pkey on element  (cost=0.43..8.45 rows=1 width=23) (actual time=0.053..0.054 rows=1 loops=1)
         Index Cond: (id = $0)
   ->  Index Scan using element_pathname_element_id on element_pathname ep  (cost=0.43..8.45 rows=1 width=56) (actual time=0.007..0.008 rows=1 loops=1)
         Index Cond: (element_id = $0)
 Planning Time: 0.604 ms
 Execution Time: 0.332 ms
(13 rows)

freshports.org=> 

With that change, the website soon returned to its usually snappiness.

The stats show the results

This screen shows how the number of session is back to normal. see also that the query in question is now listed down under 0.01 seconds. Good.

sessions down near 0

sessions down near 0

I should have known to add this index with that new function was created. I don’t see an explain analyze in that post.

Mar 092021
 

I had noticed a problem with my full text search solution when I was testing. If the input query contained a quote, an error resulted on the query. At the time, I attributed that to a poor input/text handling. Later, I realized I was wrong.

On the website, this PHP code is used:

$result = pg_query_params($this->dbh, $sql, array($this->Query));

Where $this->Query is what the user supplied on the search form. The pg_query_params documentation says

The primary advantage of pg_query_params() over pg_query() is that parameter values may be separated from the query string, thus avoiding the need for tedious and error-prone quoting and escaping.

So I don’t need to do anything special.

Here is some testing I tried while talking this over via IRC:

freshports.dev=# SELECT to_tsquery('english', $$Joe's stuff$$);
ERROR:  syntax error in tsquery: "Joe's stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joes stuff$$);
ERROR:  syntax error in tsquery: "Joes stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joe stuff$$);
ERROR:  syntax error in tsquery: "Joe stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joestuff$$);
 to_tsquery 
------------
 'joestuff'
(1 row)

It turns out the solution is to use websearch_to_tsquery() instead. The documentation mentions:

Moreover, this function should never raise syntax errors, which makes it possible to use raw user-supplied input for search.

That sounds ideal, and it was. It completely fixes the issue.

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joe's stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joes stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joe stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

Great. Hope this helps.

Mar 092021
 

In yesterdays blog post, Replacing a column search with a full text search solution, I wanted to mention how adding a LIMIT 10 to a fast running query made it run slower. I didn’t remember that issue until this morning. Fortunately, the fix was low cost.

The query

This is the query I ran:

EXPLAIN ANALYSE
WITH short_list AS (
    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
)
  select CL.commit_date - SystemTimeAdjust() AS last_commit_date, 
         P.id, 
         E.name as port,
         C.name as category, 
...
         P.uses  
       FROM short_list, ports P
       LEFT OUTER JOIN ports_vulnerable    PV  ON PV.port_id       = P.id
       LEFT OUTER JOIN commit_log          CL  ON P.last_commit_id = CL.id,
       element_pathname EP,
       categories C, element E 
 WHERE P.id = short_list.port_id
   AND P.element_id = EP.element_id and EP.pathname like '/ports/head/%'
   AND P.category_id  = C.id
   AND P.element_id   = E.id  AND E.status = 'A';

The plan (as found at https://explain.depesz.com/s/Bkkg) was fast enough for me.

Adding LIMIT 10

This was used on the FreshPorts search page (not in production, but on the dev website). It needs a limit, so I added LIMIT 10.

The plan changed to a sequential scan of a large table. Boom, 28 seconds. That is unsatisfactory. The plan is at https://explain.depesz.com/s/oJZB.

The Fix

The fix, as pointed out by Thomas Hurst in his reply to my tweet centered around a CTE (Common Table Expressions). That initially confused me, because the query was already using a CTE. That’s the WITH short_list AS part of the query.

Then I read the AS MATERIALIZED part of his reply and started reading up on that. I changed the query to:

EXPLAIN ANALYSE
WITH short_list AS MATERIALIZED (
    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
)
  select CL.commit_date - SystemTimeAdjust() AS last_commit_date, 
         P.id, 
         E.name as port,
         C.name as category, 
...
         P.uses  
       FROM short_list, ports P
       LEFT OUTER JOIN ports_vulnerable    PV  ON PV.port_id       = P.id
       LEFT OUTER JOIN commit_log          CL  ON P.last_commit_id = CL.id,
       element_pathname EP,
       categories C, element E 
 WHERE P.id = short_list.port_id
   AND P.element_id = EP.element_id and EP.pathname like '/ports/head/%'
   AND P.category_id  = C.id
   AND P.element_id   = E.id  AND E.status = 'A'
LIMIT 10;

You can see the change on line 2: AS MATERIALIZED.

The plan (as found at https://explain.depesz.com/s/UBps) takes 1.4s and is back to using an index scan.

Further reading:

Mar 082021
 

One of the many useful features on FreshPorts is: what port[s] install[s] this file? That’s the pkg-plist search option. pkg-plist is a file which “lists all the files installed by the port”. However not all ports have a pkg-plist file because the list is compiled automatically. That is why the configure_plist table was created to replace the ports.pkg_plist column. The creation of that table broke the search function because it was then working on outdated data.

Last week, I blogged about how I created a new a new stored procedure for pulling back the pkg-plist information.

The new table

By new, I mean new to this search solution. The table looks like this:

freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

For a given port, there will be N entries in this table, one for each item in the pkg-plist. You will notice some indexes on the installed_file column.

A simple search

With the old solution, we could search like this, but it took time:

AND P.pkg_plist ILIKE '%share/examples/acme.sh/dnsapi/dns_1984hosting.sh%'

Over 2.7 seconds in fact.

Let’s try looking in the generate_plist table instead. I was reading the Tables and Indexes section of the PostgreSQL Full Text Search and decided to start with this:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, installed_file)) STORED;

CREATE INDEX generate_plist_textsearch_idx
    ON public.generate_plist USING gin
    (textsearchable_index_col)
    TABLESPACE pg_default;

With that, the following search was pretty damn nice.

freshports.dev=# explain analyse    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on generate_plist  (cost=367.55..2143.46 rows=426 width=4) (actual time=2.116..2.123 rows=2 loops=1)
   Recheck Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..367.44 rows=426 width=0) (actual time=2.095..2.095 rows=2 loops=1)
         Index Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
 Planning Time: 0.434 ms
 Execution Time: 2.208 ms
(7 rows)

But this was … umm, what?

freshports.dev=# SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('dns_1984hosting.sh')
;
 port_id 
---------
(0 rows)

freshports.dev=# 

Nothing found? What?

Full text search is not like that

Full text search is not like that. My search on the full path name would have worked just fine with the existing index:

freshports.dev=# explain analyse SELECT
        port_id
    FROM
        generate_plist
    WHERE
        installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh';
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using generate_plist_installed_file_idx on generate_plist  (cost=0.56..28.66 rows=6 width=4) (actual time=0.046..0.052 rows=2 loops=1)
   Index Cond: (installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text)
 Planning Time: 0.222 ms
 Execution Time: 0.088 ms
(4 rows)

freshports.dev=# 

In fact, it’s faster.

But that’s not what is needed. Let’s look at the use case.

Searching for file names

We want to search for the full path or the file name. Or just dns_1984hosting.sh or dns_1984hosting.

With help from RhodiumToad on IRC, tsearch is all about matching words (technically lexemes), not arbitrary substrings. We need to think in terms of lexemes. For example:

freshports.dev=# select ts_debug('english', 'foo/bar');
                           ts_debug                           
--------------------------------------------------------------
 (file,"File or path name",foo/bar,{simple},simple,{foo/bar})
(1 row)

freshports.dev=# 

So let’s split the file name into distinct words. Let’s convert the / to space and add a new column for that:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col2 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx2
    ON public.generate_plist USING gin
    (textsearchable_index_col2)
    TABLESPACE pg_default;

Let’s try searching now:

freshports.dev=# EXPLAIN ANALYSE
WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=171.64..25352.59 rows=2981 width=88) (actual time=1.337..1.506 rows=2 loops=1)
   ->  Bitmap Heap Scan on generate_plist  (cost=171.35..12102.68 rows=2981 width=56) (actual time=1.031..1.040 rows=2 loops=1)
         Recheck Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.015..1.015 rows=2 loops=1)
               Index Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
   ->  Index Scan using ports_pkey on ports p  (cost=0.29..4.19 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=2)
         Index Cond: (id = generate_plist.port_id)
 Planning Time: 1.064 ms
 Execution Time: 1.596 ms
(10 rows)

freshports.dev=# 

Umm, that’s fast.

But if we try the file without the .sh we get nothing:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
 id | element_pathname | installed_file 
----+------------------+----------------
(0 rows)

freshports.dev=# 

Another column for that

Humor me here while I add another column:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col3 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx3
    ON public.generate_plist USING gin
    (textsearchable_index_col3)
    TABLESPACE pg_default;

This translates both / and . to a space. Now we have:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col3 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
  id   |            element_pathname             |                  installed_file                  
-------+-----------------------------------------+--------------------------------------------------
 59654 | /ports/branches/2020Q3/security/acme.sh | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
 43508 | /ports/head/security/acme.sh            | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
(2 rows)

freshports.dev=# 

Which is still lightning fast.

Doing them all at once

Creating a query which uses all three columns is still performant:

freshports.dev=# explain analyse WITH short_list AS (
    SELECT
        DISTINCT port_id, installed_file
    FROM
        generate_plist
    WHERE
        textsearchable_index_col  @@ to_tsquery('bacula')
     OR textsearchable_index_col2 @@ to_tsquery('bacula')
     OR textsearchable_index_col3 @@ to_tsquery('bacula')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                                                       QUERY PLAN                                                                                                       
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=50810.64..53335.38 rows=7790 width=88) (actual time=273.782..520.103 rows=1872 loops=1)
   Hash Cond: (sl.port_id = p.id)
   ->  Subquery Scan on sl  (cost=35602.05..35757.85 rows=7790 width=56) (actual time=14.561..16.078 rows=1872 loops=1)
         ->  HashAggregate  (cost=35602.05..35679.95 rows=7790 width=56) (actual time=14.558..15.495 rows=1872 loops=1)
               Group Key: generate_plist.port_id, generate_plist.installed_file
               ->  Bitmap Heap Scan on generate_plist  (cost=1705.16..35563.02 rows=7806 width=56) (actual time=11.540..12.771 rows=1872 loops=1)
                     Recheck Cond: ((textsearchable_index_col @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col2 @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col3 @@ to_tsquery('bacula'::text)))
                     Heap Blocks: exact=116
                     ->  BitmapOr  (cost=1705.16..1705.16 rows=7808 width=0) (actual time=11.489..11.490 rows=0 loops=1)
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..415.44 rows=426 width=0) (actual time=2.994..2.995 rows=36 loops=1)
                                 Index Cond: (textsearchable_index_col @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.373..1.374 rows=1857 loops=1)
                                 Index Cond: (textsearchable_index_col2 @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx3  (cost=0.00..1113.26 rows=4401 width=0) (actual time=7.117..7.117 rows=1872 loops=1)
                                 Index Cond: (textsearchable_index_col3 @@ to_tsquery('bacula'::text))
   ->  Hash  (cost=14174.04..14174.04 rows=63004 width=8) (actual time=258.282..258.283 rows=63065 loops=1)
         Buckets: 32768  Batches: 4  Memory Usage: 877kB
         ->  Seq Scan on ports p  (cost=0.00..14174.04 rows=63004 width=8) (actual time=0.089..219.485 rows=63065 loops=1)
 Planning Time: 1.227 ms
 Execution Time: 521.244 ms
(20 rows)

freshports.dev=# 

At what cost?

What does the table look like now:

freshports.dev=# \d generate_plist
                                                                          Table "public.generate_plist"
          Column           |   Type   | Collation | Nullable |                                                      Default                                                      
---------------------------+----------+-----------+----------+-------------------------------------------------------------------------------------------------------------------
 id                        | bigint   |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id                   | integer  |           | not null | 
 installed_file            | text     |           | not null | 
 textsearchable_index_col  | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, installed_file)) stored
 textsearchable_index_col2 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) stored
 textsearchable_index_col3 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) stored
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)
    "generate_plist_textsearch_idx" gin (textsearchable_index_col)
    "generate_plist_textsearch_idx2" gin (textsearchable_index_col2)
    "generate_plist_textsearch_idx3" gin (textsearchable_index_col3)

How much space does this take:



freshports.dev=# select count(*) from generate_plist;
  count  
---------
 8478513
(1 row)


freshports.dev=# SELECT pg_table_size('generate_plist');
 pg_table_size 
---------------
    3578167296
(1 row)

freshports.dev=# select pg_total_relation_size('generate_plist');
 pg_total_relation_size 
------------------------
             6730162176
(1 row)

That’s 8.5 million rows taking up 3.3 GB but with indexes it’s 6.3 GB.

Disk is cheap. Time is expensive. I’d save the cost is worth the performance.

Mar 022021
 

In this post I will show you how I discovered that the backend was putting data into a new table (generate_plist) but the website was still using the ports.pkg_plist column. The result was stale data being shown on the website (for older ports) or no data at all (for newer ports).

How long? Since Oct 2017, because that’s when configure_plist showed up on the website. Nobody noticed until recently. I can only guess that I made the backend changes with intent to change the frontend later. I should have created myself a ticket.

Why was the change made? Primarily because all ports, especially Python based ports, have a pkg-plist file. These ports generate the list automatically at package-built time. Secondly, it also simplifies the search of pkg-plist.

Ironically, this bug also affects search, which is still using the old column.

Background

One set of data which FreshPorts shows is the list of files installed by a given package. This information can be provided via a file, called pkg-plist.

Here is one example for the comms/gnuradio port:

freshports.dev=# select pkg_plist from ports where id = 34724;
             pkg_plist              
------------------------------------
 include/libconcord.h              +
 lib/libconcord.a                  +
 lib/libconcord.so                 +
 lib/libconcord.so.4               +
 lib/libconcord.so.4.0.0           +
 share/mime/packages/libconcord.xml
(1 row)

freshports.dev=# 

NOTE: the above is one row of output. It is a collection of strings, separated by newlines.

If visit the website, you’ll see that list is wrong, because it’s outdated.

The generate-plist command is part of the ports infrastructure. FreshPorts gathers this data with this command and the resulting file.

$ make generate-plist
$ cat work/.PLIST.mktmp

A recent blog post goes into more detail. All that goes into the configure_plist table:

freshports.dev=# \d configure_plist
Did not find any relation named "configure_plist".
freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

The original data

This is what the data looked like, as demonstrated by this query:

freshports.dev=# select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
             pkg_plist              |     array_to_json     
------------------------------------+-----------------------
 include/libconcord.h              +| ["lib/libconcord.so"]
 lib/libconcord.a                  +| 
 lib/libconcord.so                 +| 
 lib/libconcord.so.4               +| 
 lib/libconcord.so.4.0.0           +| 
 share/mime/packages/libconcord.xml | 
(1 row)

NOTE: This example user :alpha: but I later changed this to
:alnum: in my code.

NOTE: The above is one row.

The goal is to grab all the .so libraries.

The new layout

The new table, with refreshed data, looks like:

freshports.dev=# select * from generate_plist where port_id = 34724;
    id    | port_id |                       installed_file                       
----------+---------+------------------------------------------------------------
 27791797 |   34724 | @shared-mime-info share/mime
 27791798 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/catalog.mk
 27791799 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/LICENSE
 27791800 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/GPLv2
 27791801 |   34724 | include/libconcord.h
 27791802 |   34724 | lib/libconcord.a
 27791803 |   34724 | lib/libconcord.so
 27791804 |   34724 | lib/libconcord.so.4
 27791805 |   34724 | lib/libconcord.so.4.0.0
 27791806 |   34724 | share/mime/packages/libconcord.xml
 27791807 |   34724 | @postexec /usr/sbin/service ldconfig restart > /dev/null
 27791808 |   34724 | @postunexec /usr/sbin/service ldconfig restart > /dev/null
(12 rows)

freshports.dev=# 

First go

NOTE: I have removed the array_to_json() call for now:

freshports.dev=# select installed_file, regexp_match(installed_file, 'lib/[[:alpha:]]*?\.so') 
from generate_plist where port_id = 34724;
                       installed_file                       |    regexp_match     
------------------------------------------------------------+---------------------
 @shared-mime-info share/mime                               | 
 /usr/local/share/licenses/libconcord-1.2_2/catalog.mk      | 
 /usr/local/share/licenses/libconcord-1.2_2/LICENSE         | 
 /usr/local/share/licenses/libconcord-1.2_2/GPLv2           | 
 include/libconcord.h                                       | 
 lib/libconcord.a                                           | 
 lib/libconcord.so                                          | {lib/libconcord.so}
 lib/libconcord.so.4                                        | {lib/libconcord.so}
 lib/libconcord.so.4.0.0                                    | {lib/libconcord.so}
 share/mime/packages/libconcord.xml                         | 
 @postexec /usr/sbin/service ldconfig restart > /dev/null   | 
 @postunexec /usr/sbin/service ldconfig restart > /dev/null | 
(12 rows)

That looks good so far. What I did not realize until later: each value of the regexp_match output is an array.

Let’s try this on a port with more libraries:

freshports.dev=# select distinct regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
   from generate_plist
  WHERE generate_plist.port_id = 9589 and regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so')  is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

That still looks good.

Getting rid of duplicate values and ignoring null

I thought this was a simple way to get rid of the uninteresting values. I like using WITH. It allows you to take a very complex set of data and adjust it before performing additional operations upon it. What I’m doing here is not very complicated and I am sure it can be done without WITH.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589
)
select distinct lib from tmp where lib is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

JSON time

In this section, there are three main changes.

Convert the results from an array to the first element of that array. We a dealing with a single file. We should get just one match.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select distinct lib[1] from tmp where lib is not null;
       lib        
------------------
 lib/hamlibtcl.so
 lib/libhamlib.so
 lib/Hamlib.so
(3 rows)

Next, combine all the rows into one row, using array.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array(select distinct lib[1] from tmp where lib is not null);
                       array                       
---------------------------------------------------
 {lib/hamlibtcl.so,lib/libhamlib.so,lib/Hamlib.so}
(1 row)

The last step, change that to JSON output.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                      array_to_json                      
---------------------------------------------------------
 ["lib/hamlibtcl.so","lib/libhamlib.so","lib/Hamlib.so"]
(1 row)

But…

I seems wasteful, or backward, to be converting to an array and then to JSON, but I think that’s the way it needs to be done for this data. I’m happy to have additional ideas if you have them.

EDIT: Magnus Hagander gave me the solution there. Use json_agg instead.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select json_agg(distinct lib[1]) from tmp where lib is not null;
                                              json_agg                                               
-----------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so", "lib/libipmiconsole.so", "lib/libipmidetect.so", "lib/libipmimonitoring.so"]
(1 row)

A more complex example

Let’s try sysutils/freeipmi which has more librarries:

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                                          array_to_json                                           
--------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so","lib/libipmidetect.so","lib/libipmiconsole.so","lib/libipmimonitoring.so"]
(1 row)

freshports.dev=# 

That seems to work.

Put it into a function

This is the function I created:

-- to convert the generate_plist table, multiple rows per port, into a
-- single row of JSON. Mostly used for Dependency line:
--
-- re https://github.com/FreshPorts/freshports/issues/216
--
CREATE OR REPLACE FUNCTION pkg_plist(bigint) returns json as $$
   WITH tmp AS (
       select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
         from generate_plist
        where generate_plist.port_id = $1
   )
   select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
$$ LANGUAGE SQL STABLE;

The changed code

The original code was:

freshports.dev=# explain analyse
select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.31 rows=1 width=190) (actual time=0.092..0.096 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.305 ms
 Execution Time: 0.138 ms
(4 rows)
freshports.dev=# 

The new code is:

freshports.dev=# explain analyse
freshports.dev-# select pkg_plist, pkg_plist(id) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.56 rows=1 width=190) (actual time=0.323..0.324 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.229 ms
 Execution Time: 0.340 ms
(4 rows)

That is an acceptable difference.

Thanks.

Jan 312021
 

In this post I’ll talk about how the code intially took the easy way out and how recent changes have tightened that up. In short, the code would raise N messages when just a single notification is sufficient.

Today the full vuxml database was reloaded into FreshPorts. This affects approximately 3454 ports and the marking of 60,436 commits (or revisions) as vulnerable. When you look at a port page, you can see the skulls which indicate a known vuln beside the commits in question (look under Commit History). That makes up for 565,576 unique combinations of commits, ports, and vuxml entries.

After all that is processed (over about 21.5 hours on the current server), we need to flush the cache of for each item that changed (ports, daily summaries, etc).

How cache clearing works

When a port is updated, a trigger on the ports table adds an entry to the cache_clearing_ports. A long running daemon monitors the table waiting for a LISTEN notification. When it arrives, the daemon reads the table and starts processing the entries, clearing the cache.

This is simplified though. Updates occur in many places and each invokes its own trigger. However, for this post, we will deal with only one.

current cache_clearing_ports table

This is the current cache_clearing_ports table:

freshports.org=# \d cache_clearing_ports
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

freshports.org=# 

As you can see, duplicate entries are OK. However, over a single commit, which is how the vuxml data is processed, this can result in a large number of duplicates.

How many duplicates?

Checking here, we have just over 1 million entries.

freshports.org=# select count(*) from cache_clearing_ports;
  count  
---------
 1005407
(1 row)

How unique are they?

Only 3500 ports:

freshports.org=# select count(distinct port_id) from commit_log_ports_vuxml;
 count 
-------
  3454
(1 row)

I recall when I was writing this: duplicates are no big deal. The cache gets cleared twice. So what?

It takes time to read through the cache_clearing_ports. A long time, when you have about 1 million entries.

The trigger

This is the existing trigger:

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port);

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

  DROP TRIGGER IF EXISTS ports_clear_cache ON ports;
CREATE TRIGGER ports_clear_cache
    AFTER UPDATE ON ports
    FOR EACH ROW
    EXECUTE PROCEDURE ports_clear_cache();

This is how I propose to change the table. First, I’m creating a new identical table to play with:

freshports.dev=# create table cache_clearing_ports_test (like cache_clearing_ports INCLUDING ALL);
CREATE TABLE

Just what I need. This creates a new table, including all the constraints etc, so I could play with it without affecting the ongoing database operation in my dev server.

This is the DDL I would run:

ALTER TABLE public.cache_clearing_ports_test
    ADD CONSTRAINT cache_clearing_ports_test_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports_test
    IS 'Let''s store just one instance per port_id';

The DDL for the real table would be:

ALTER TABLE public.cache_clearing_ports
    ADD CONSTRAINT cache_clearing_ports_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports
    IS 'Let''s store just one instance per port_id';

This is the amended test table:

freshports.dev=# \d cache_clearing_ports_test
                                      Table "public.cache_clearing_ports_test"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_test_pkey" PRIMARY KEY, btree (id)
    "cache_clearing_ports_port_id_idx" UNIQUE CONSTRAINT, btree (port_id)

freshports.dev=# 

Next step, fix the trigger, which will produce duplicate errors if left as it is.

The new trigger

The INSERT command has a conflict_target clause, which is very useful for this situation.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

What this says:

  • On the INSERT
  • IF there is a conflict on the unique port_id index
  • THEN do nothing

There are many other things you can do ON CONFLICT, but in my case, doing nothing is what I want.

But wait! There’s more!

I went one step farther. Why NOTIFY if we don’t have to.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
      l_row_count bigint;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            GET DIAGNOSTICS l_row_count = ROW_COUNT;
            IF l_row_count > 0 THEN
                  NOTIFY port_updated;
            END IF;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

In this solution, the code checks to see if anything was inserted. If so, it issues the NOTIFY.

Aug 112020
 

This idea came to me today from swills:

stupid question, would it be possible to make the urls on freshports case insensitive?

so that like https://www.freshports.org/x11/libx11 would go to https://www.freshports.org/x11/libX11

After confirming this was for manually typed URLs (as opposed to generated links), I started looking into it. This is what I found.

In this post, we’re going to look at some old PHP database code and the PostgreSQL stored procedures behind them. I’ll show you the multi-line stored procedure now in use and how a single query will now do the same thing. This change isn’t some clever coding or another solution. The database has evolved and the procedures did not have to. Until now.

Going from sysutils/py-mqttwarn to a database entry

I will base all my links to code on the 1.50.5 release.

If the Nginx webserver cannot follow the URL to a file on disk, it invokes www/–/index.php via this bit of declartion:

  location / {
    try_files $uri $uri/ @missing;
  }

  location @missing {
    rewrite ^ /--/index.php break;

    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass unix:/var/run/php-fpm.sock;
    fastcgi_index index.php;
    fastcgi_param SCRIPT_FILENAME $request_filename;
    include fastcgi_params;
  }

Line 2 says: try finding the URL using files on disk and if you file, go do what @missing points to.

Line 6 say: invoke /–/index.php for things you don’t find in a file.

Why /–/index.php? I wanted to put this file somewhere that it would be highly unlikely to collide with anything in the FreeBSD repo (with regards to directories and file names).

The code checks for other special stuff ((badges, scripts, API, etc) before invoking freshports_Parse404URI() on line 148. I will discuss that soon, but first: if nothing else is picked up by /–/index.php, then it’s time to invoke rewrite/missing.php

freshports_Parse404URI() is declared in rewrite/functions.php and the code relevant to our journey is on line 70:

if ($ElementRecord->FetchByName('/ports/head/' . $pathname)) {

Ignoring that hardcoded pathname for now, this function gets passed something like this:

/ports/head/sysutils/py-mqttwarn

There might be more on the path: branch name, commit number, etc.

FetchByName() is located on line 38 of classes/element_record

Now we are getting into the oldest code of FreshPorts. I’m sure this stuff hasn’t changed in over 15 years.

The database function elementGet() is invoked which looks like this:

CREATE OR REPLACE FUNCTION elementGet (text) RETURNS SETOF element_type AS $$
   SELECT id,
          name::text,
          directory_file_flag::text,
          status::text,
          case when IsCategory(Pathname_ID($1)) IS NULL THEN FALSE ELSE TRUE END,
          case when IsPort(    Pathname_ID($1)) IS NULL THEN FALSE ELSE TRUE END,
          element_pathname(id)
     FROM element
    WHERE id = PathName_ID($1);
$$ LANGUAGE SQL STABLE;

The key in this function is PathName_ID() which is used three times and uniquely identifies a single row in the element table.

Element table

The element table is an implementation of a filesystem using a self-referential solution (i.e. it is recursive).

The table (without constraints, references, and triggers) looks like this:

freshports.dev=# \d element
                                     Table "public.element"
       Column        |     Type     | Collation | Nullable |               Default               
---------------------+--------------+-----------+----------+-------------------------------------
 id                  | integer      |           | not null | nextval('element_id_seq'::regclass)
 name                | text         |           | not null | 
 parent_id           | integer      |           |          | 
 directory_file_flag | character(1) |           | not null | 
 status              | character(1) |           | not null | 
Indexes:
    "element_pkey" PRIMARY KEY, btree (id)
    "element_delete" btree (status) WHERE status = 'D'::bpchar
    "element_name" btree (name)
    "element_parent_id" btree (parent_id)

An entry in the top level directory will have parent_id = NULL.

Pathname_ID function

Pathname_ID() is a function which progresses along a supplied pathname from left to right following the each entry from the element table.

CREATE OR REPLACE FUNCTION Pathname_ID(text) returns int4 AS $$

   DECLARE
      in_element_pathname ALIAS FOR $1;

      debug_str         text;
      element_id        int4;
      element_parent_id int4;
      element_name      text;
      element_pathname  text;
      position          int4;
      slash             text;

begin
  slash            := '/';
  element_pathname := in_element_pathname;
  element_id       := -1;

  IF substr(element_pathname,1,1) = slash THEN
    element_pathname := substr(element_pathname, 2);
  END IF;

  position := strpos(element_pathname, slash);
  IF position = 0 THEN
    element_name     := element_pathname;
    element_pathname := '';
  ELSE
    element_name     := substr(element_pathname, 1, position-1);
    element_pathname := substr(element_pathname, position+1);
  END IF;

  select id 
    into element_parent_id
    from element 
   where element.name = element_name
     and parent_id is null;

  debug_str := debug_str || '1 = ' || element_name;

  WHILE(char_length(element_pathname) > 0) LOOP
    position := strpos(element_pathname, slash);
    if position = 0 then
       element_name     := element_pathname;
       element_pathname := '';
    else
       element_name     := substr(element_pathname, 1, position-1);
       element_pathname := substr(element_pathname, position+1);
    end if;

    debug_str := debug_str || '1 = ' || element_name;
    select id 
      into element_parent_id
      from element 
     where element.name = element_name 
       and parent_id    = element_parent_id;

    debug_str := debug_str || ' 2 = ' || element_name;
    debug_str := debug_str || ' 3 = ' || element_parent_id;
  end loop;

  return element_parent_id;
END;
$$  LANGUAGE 'plpgsql' STABLE;

The key lines are 35 and 54. To meet the goal of a case insensitive match, those operators would need to be changed from = to a case insensitive comparison, such as:

where lower(element.name) = lower(element_name)

I do not want to alter this code which is so far down the tree. Instead, I will create a new function which will take a different approach. It will also take advantage of new database features which did not exist 8 years ago, let along 15+ years ago.

Testing new queries

I did some playing around with queries today, looking to find the element_id for a given port, in our example sysutils/py-mqttwarn.

I started off with this, show me things named x11:

freshports.stage=# select *, element_pathname(id) from element where lower(name) = lower('X11');
   id    | name | parent_id | directory_file_flag | status |                    element_pathname                    
---------+------+-----------+---------------------+--------+--------------------------------------------------------
   43426 | x11  |     33542 | F                   | A      | /ports/head/misc/kdeutils11/x11
   52955 | x11  |      4760 | F                   | A      | /ports/head/misc/kdeutils2/x11
   58887 | x11  |     50378 | F                   | A      | /ports/head/textproc/xml-i18n-tools/x11
   58962 | x11  |      3582 | F                   | A      | /ports/head/www/gtkhtml/x11
   83338 | x11  |     77344 | D                   | A      | /doc/head/en_US.ISO8859-1/books/handbook/x11
  104943 | x11  |    104814 | D                   | A      | /doc/head/fr_FR.ISO8859-1/books/handbook/x11
   99879 | x11  |     86040 | D                   | A      | /doc/head/ja_JP.eucJP/books/handbook/x11
  200645 | x11  |    200550 | D                   | A      | /doc/head/nl_NL.ISO8859-1/books/handbook/x11
  110618 | x11  |     77691 | D                   | A      | /doc/head/de_DE.ISO8859-1/books/handbook/x11
  112293 | x11  |    112148 | D                   | A      | /doc/head/it_IT.ISO8859-15/books/handbook/x11
  113146 | X11  |     77283 | D                   | A      | /base/head/release/scripts/X11
  200746 | x11  |    200655 | D                   | A      | /doc/head/zh_CN.GB2312/books/handbook/x11
  171234 | x11  |    110131 | D                   | A      | /doc/head/ru_RU.KOI8-R/books/handbook/x11
  271292 | x11  |    271177 | D                   | A      | /doc/head/mn_MN.UTF-8/books/handbook/x11
  200939 | x11  |    110120 | D                   | A      | /doc/head/es_ES.ISO8859-1/books/handbook/x11
  234867 | x11  |    234757 | D                   | A      | /doc/head/zh_TW.Big5/books/handbook/x11
  255132 | x11  |    255022 | D                   | A      | /doc/head/pl_PL.ISO8859-2/books/handbook/x11
  288692 | x11  |    288608 | D                   | A      | /doc/head/el_GR.ISO8859-7/books/handbook/x11
  300947 | x11  |    300813 | D                   | A      | /doc/head/hu_HU.ISO8859-2/books/handbook/x11
  433144 | x11  |    433041 | D                   | A      | /ru_RU.KOI8-R/books/handbook/x11
  433549 | x11  |    433544 | D                   | A      | /mn_MN.UTF-8/books/handbook/x11
  434309 | x11  |    433663 | D                   | A      | /ja_JP.eucJP/books/handbook/x11
  439412 | x11  |    432543 | D                   | A      | /en_US.ISO8859-1/books/handbook/x11
  439459 | x11  |    433298 | D                   | A      | /nl_NL.ISO8859-1/books/handbook/x11
  457198 | x11  |    111733 | D                   | A      | /doc/head/pt_BR.ISO8859-1/books/handbook/x11
     231 | x11  |    464087 | D                   | A      | /ports/head/x11
  465674 | x11  |    465603 | D                   | A      | /doc/translations/nl_NL.ISO8859-1/books/handbook/x11
  501886 | x11  |    465481 | D                   | A      | /doc/translations/en_US.ISO8859-1/books/handbook/x11
  537516 | x11  |    537439 | D                   | A      | /doc/head/zh_CN.UTF-8/books/handbook/x11
  663947 | x11  |    663946 | D                   | A      | /doc/release/10.2.0/en_US.ISO8859-1/books/handbook/x11
  587542 | x11  |    587471 | D                   | D      | /doc/head/zh_TW.UTF-8/books/handbook/x11
  745067 | x11  |    743628 | D                   | A      | /ports/branches/2016Q4/x11
  770464 | x11  |    763349 | D                   | A      | /ports/branches/2017Q1/x11
  792524 | x11  |    790218 | D                   | A      | /ports/branches/2017Q2/x11
  812565 | x11  |    812271 | D                   | A      | /ports/branches/2017Q3/x11
  829995 | x11  |    829562 | D                   | A      | /ports/branches/2017Q4/x11
  870434 | x11  |    867105 | D                   | A      | /ports/branches/2018Q2/x11
  882372 | x11  |    881009 | D                   | A      | /ports/branches/2018Q3/x11
  901665 | x11  |    901631 | D                   | A      | /ports/branches/2018Q4/x11
  914724 | x11  |    914354 | D                   | A      | /ports/branches/2019Q1/x11
  936002 | x11  |    935538 | D                   | A      | /ports/branches/2019Q2/x11
  954616 | x11  |    949697 | D                   | A      | /ports/branches/2019Q3/x11
  971674 | x11  |    961357 | D                   | A      | /ports/branches/2019Q4/x11
  978479 | x11  |    978429 | D                   | A      | /ports/branches/2020Q1/x11
 1014222 | x11  |   1012899 | D                   | A      | /ports/branches/2020Q2/x11
 1055981 | x11  |   1053714 | D                   | A      | /ports/branches/2020Q3/x11
(46 rows)

Based on line 29, show me a child of that element named libX11:

freshports.stage=# select *, element_pathname(id) from element where lower(name) = lower('libX11') and parent_id = 231;
   id   |  name  | parent_id | directory_file_flag | status |    element_pathname    
--------+--------+-----------+---------------------+--------+------------------------
 163623 | libX11 |       231 | D                   | A      | /ports/head/x11/libX11
(1 row)

freshports.stage=# 

That’s the one.

Using this approach, I came up with few queries, which I later abandoned for another solution.

The abandoned queries

I tried this join of four tables:

freshports.stage=# explain analyse
freshports.stage-# SELECT E4.id, element_pathname(E4.id)
freshports.stage-#   FROM element E1, element E2, element E3, element E4
freshports.stage-#  WHERE E1.parent_id IS NULL
freshports.stage-#    AND E2.parent_id = E1.id
freshports.stage-#    AND E3.parent_id = E2.id
freshports.stage-#    AND E4.parent_id = E3.id
freshports.stage-#    AND E2.name = 'head'
freshports.stage-#    AND E1.name = 'ports'
freshports.stage-#    AND lower(E3.name) = lower('sysutils')
freshports.stage-#    AND lower(E4.name) = lower('py-mqttwarn');
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.05..74.95 rows=1 width=36) (actual time=1.833..2.299 rows=1 loops=1)
   ->  Nested Loop  (cost=19.62..67.62 rows=1 width=4) (actual time=0.126..0.138 rows=1 loops=1)
         ->  Nested Loop  (cost=19.19..27.23 rows=1 width=4) (actual time=0.061..0.063 rows=1 loops=1)
               ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.034..0.035 rows=1 loops=1)
                     Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.030..0.030 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.016..0.016 rows=38 loops=1)
                                 Index Cond: (name = 'ports'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.010..0.010 rows=100 loops=1)
                                 Index Cond: (parent_id IS NULL)
               ->  Bitmap Heap Scan on element e2  (cost=9.43..13.45 rows=1 width=8) (actual time=0.023..0.024 rows=1 loops=1)
                     Recheck Cond: ((name = 'head'::text) AND (parent_id = e1.id))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.020..0.020 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.009..0.009 rows=15 loops=1)
                                 Index Cond: (name = 'head'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.008..0.008 rows=5 loops=1)
                                 Index Cond: (parent_id = e1.id)
         ->  Index Scan using element_parent_id on element e3  (cost=0.43..40.38 rows=1 width=8) (actual time=0.064..0.074 rows=1 loops=1)
               Index Cond: (parent_id = e2.id)
               Filter: (lower(name) = 'sysutils'::text)
               Rows Removed by Filter: 88
   ->  Index Scan using element_parent_id on element e4  (cost=0.43..7.07 rows=1 width=8) (actual time=1.540..1.994 rows=1 loops=1)
         Index Cond: (parent_id = e3.id)
         Filter: (lower(name) = 'py-mqttwarn'::text)
         Rows Removed by Filter: 2380
 Planning Time: 0.689 ms
 Execution Time: 2.352 ms
(29 rows)

freshports.stage=# 

I tried rewriting with a JOIN JOIN approach, but that’s really just the previous query:

freshports.stage=# -- find a port which matches this
freshports.stage=# -- case insensitive
freshports.stage=# 
freshports.stage=# 
freshports.stage=# explain analyse
freshports.stage-# SELECT E4.id, element_pathname(E4.id)
freshports.stage-#   FROM element E1
freshports.stage-#   JOIN element E2 ON E1.parent_id IS NULL AND E1.name = 'ports'
freshports.stage-#                  AND E2.parent_id = E1.id AND E2.name = 'head'
freshports.stage-#   JOIN element E3 ON E3.parent_id = E2.id AND lower(E3.name) = lower('sysutils')
freshports.stage-#   JOIN element E4 ON E4.parent_id = E3.id AND lower(E4.name) = lower('bacula9-server');
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.05..74.95 rows=1 width=36) (actual time=3.729..4.788 rows=1 loops=1)
   ->  Nested Loop  (cost=19.62..67.62 rows=1 width=4) (actual time=0.278..0.309 rows=1 loops=1)
         ->  Nested Loop  (cost=19.19..27.23 rows=1 width=4) (actual time=0.129..0.132 rows=1 loops=1)
               ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.072..0.073 rows=1 loops=1)
                     Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.063..0.063 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.032..0.033 rows=38 loops=1)
                                 Index Cond: (name = 'ports'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.023..0.023 rows=100 loops=1)
                                 Index Cond: (parent_id IS NULL)
               ->  Bitmap Heap Scan on element e2  (cost=9.43..13.45 rows=1 width=8) (actual time=0.047..0.049 rows=1 loops=1)
                     Recheck Cond: ((name = 'head'::text) AND (parent_id = e1.id))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.041..0.041 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.019..0.019 rows=15 loops=1)
                                 Index Cond: (name = 'head'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.016..0.016 rows=5 loops=1)
                                 Index Cond: (parent_id = e1.id)
         ->  Index Scan using element_parent_id on element e3  (cost=0.43..40.38 rows=1 width=8) (actual time=0.147..0.173 rows=1 loops=1)
               Index Cond: (parent_id = e2.id)
               Filter: (lower(name) = 'sysutils'::text)
               Rows Removed by Filter: 88
   ->  Index Scan using element_parent_id on element e4  (cost=0.43..7.07 rows=1 width=8) (actual time=3.097..4.125 rows=1 loops=1)
         Index Cond: (parent_id = e3.id)
         Filter: (lower(name) = 'bacula9-server'::text)
         Rows Removed by Filter: 2380
 Planning Time: 1.608 ms
 Execution Time: 4.898 ms
(29 rows)

The next query involves passing the parent id down to the next level:

freshports.stage=# -- find a port which matches this
freshports.stage=# -- case insensitive
freshports.stage=# 
freshports.stage=# explain analyse
freshports.stage-# SELECT E4.ID, element_pathname(E4.id)
freshports.stage-#   FROM element E4
freshports.stage-#  WHERE lower(E4.name) = lower('py37-mqttwarn') AND E4.parent_id = (
freshports.stage(# SELECT E3.ID
freshports.stage(#   FROM element E3
freshports.stage(#  WHERE lower(E3.name) = lower('sysutils') AND E3.parent_id = (
freshports.stage(# SELECT E2.ID
freshports.stage(#   FROM element E2
freshports.stage(#  WHERE E2.name = 'head' AND E2.parent_id = (
freshports.stage(# SELECT E1.id
freshports.stage(#   FROM element E1 
freshports.stage(#  WHERE E1.parent_id IS NULL AND E1.name = 'ports')));
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using element_parent_id on element e4  (cost=68.93..110.05 rows=1 width=36) (actual time=1.902..1.902 rows=0 loops=1)
   Index Cond: (parent_id = $2)
   Filter: (lower(name) = 'py37-mqttwarn'::text)
   Rows Removed by Filter: 2381
   InitPlan 3 (returns $2)
     ->  Index Scan using element_parent_id on element e3  (cost=27.64..68.51 rows=1 width=4) (actual time=0.121..0.134 rows=1 loops=1)
           Index Cond: (parent_id = $1)
           Filter: (lower(name) = 'sysutils'::text)
           Rows Removed by Filter: 88
           InitPlan 2 (returns $1)
             ->  Bitmap Heap Scan on element e2  (cost=23.20..27.21 rows=1 width=4) (actual time=0.061..0.061 rows=1 loops=1)
                   Recheck Cond: ((name = 'head'::text) AND (parent_id = $0))
                   Heap Blocks: exact=1
                   InitPlan 1 (returns $0)
                     ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=1)
                           Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                           Heap Blocks: exact=1
                           ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.027..0.027 rows=0 loops=1)
                                 ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.011..0.011 rows=38 loops=1)
                                       Index Cond: (name = 'ports'::text)
                                 ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.013..0.013 rows=100 loops=1)
                                       Index Cond: (parent_id IS NULL)
                   ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.058..0.058 rows=0 loops=1)
                         ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.016..0.016 rows=15 loops=1)
                               Index Cond: (name = 'head'::text)
                         ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.040..0.040 rows=5 loops=1)
                               Index Cond: (parent_id = $0)
 Planning Time: 0.235 ms
 Execution Time: 1.945 ms

The chosen query

The chosen query uses a relatively new table: element_pathname

freshports.stage=# \d element_pathname
            Table "public.element_pathname"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 element_id | integer |           | not null | 
 pathname   | text    |           | not null | 
Indexes:
    "element_pathname_pathname" UNIQUE, btree (pathname)
    "element_pathname_element_id" btree (element_id)
Foreign-key constraints:
    "element_pathname_element_id_fkey" FOREIGN KEY (element_id) REFERENCES element(id) ON DELETE CASCADE

freshports.stage=# 

This table is maintained via triggers on the element table.

The same query here is:

freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/py-mqttwarn';
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..19555.74 rows=5361 width=54) (actual time=158.979..160.785 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on element_pathname  (cost=0.00..18019.64 rows=2234 width=54) (actual time=145.503..156.289 rows=0 loops=3)
         Filter: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
         Rows Removed by Filter: 351214
 Planning Time: 0.483 ms
 Execution Time: 160.830 ms
(8 rows)

freshports.stage=#

Our query will always be on lower case, so let’s create an index on lower case. I am curious about the size of this index, and it added only 77MB to the database.

freshports.stage=# select pg_size_pretty(pg_indexes_size('element_pathname'));
 pg_size_pretty 
----------------
 107 MB
(1 row)

freshports.stage=# create index element_pathname_pathname_lc on element_pathname ((lower(pathname)));
CREATE INDEX
freshports.stage=# select pg_size_pretty(pg_indexes_size('element_pathname'));
 pg_size_pretty 
----------------
 184 MB
(1 row)

freshports.stage=# select count(*) from element_pathname;
  count  
---------
 1053643
(1 row)

freshports.stage=# 

Now let’s try that query again:

freshports.stage=# explain analyse select *                                                                          
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/py-mqttwarn';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.133..0.133 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.108..0.108 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
 Planning Time: 0.087 ms
 Execution Time: 0.155 ms
(7 rows)

freshports.stage=# 

Now we’re using an index scan on that newly created lower case index.

Let’s do a few more queries to see how they go:

freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/bacula-server';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.050..0.051 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/bacula-server'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.037..0.037 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/bacula-server'::text)
 Planning Time: 0.073 ms
 Execution Time: 0.067 ms
(7 rows)
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/bacula9-server';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.065..0.066 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/bacula9-server'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.049..0.050 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/bacula9-server'::text)
 Planning Time: 0.086 ms
 Execution Time: 0.088 ms
(7 rows)

freshports.stage=# 
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/lang/gcc10-devel';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.132..0.132 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/lang/gcc10-devel'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.116..0.117 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/lang/gcc10-devel'::text)
 Planning Time: 0.090 ms
 Execution Time: 0.155 ms
(7 rows)

freshports.stage=# 
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/lang/perl5.32';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.059..0.060 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/lang/perl5.32'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.044..0.044 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/lang/perl5.32'::text)
 Planning Time: 0.085 ms
 Execution Time: 0.081 ms
(7 rows)

Yes, I think this will do nicely.

Things change

There is code and tables in here which date back to late 1999 or early 2000. Since then, the database has grown from 15 tables in Nov 2001 to 80 tables today.

I’m not sure when this svg was created or how many tables it represents, but I know its a few years old now.

Database and code evolve over time and it is sometimes interesteing to see the paths they take.

Apr 162020
 

This post is another in the packages series. It documents how the data is transformed from the raw form (pacakges_raw) into normalized data (packages).

The Packages – how is this data stored? post may be useful reading.

The primary purpose of this blog post is documentation of the above mentioned transformation process. Along the way I found a performance enhancement owing to a deferred constraint. That’s why this post is tagged with PostgreSQL.

Some data background

When data is copied into the package_raw table, two columns are left empty:

  1. abi_id
  2. port_id

Those values can be determined based on columns which are supplied:

  1. abi
  2. package_origin

Here is some sample data:

freshports.dev=# SELECT * FROM packages_raw WHERE abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin LIMIT 10;
   id    |         package_origin         |   package_name   | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------+------------------+-----------------+-----------------+--------+---------+-------------
 2749964 | accessibility/accerciser       | accerciser       | 3.22.0          | FreeBSD:13:i386 |        |         | latest
 2749124 | accessibility/at-spi2-atk      | at-spi2-atk      | 2.26.2          | FreeBSD:13:i386 |        |         | latest
 2749119 | accessibility/at-spi2-core     | at-spi2-core     | 2.28.0          | FreeBSD:13:i386 |        |         | latest
 2749003 | accessibility/atk              | atk              | 2.28.1          | FreeBSD:13:i386 |        |         | latest
 2748993 | accessibility/atkmm            | atkmm            | 2.24.2_4        | FreeBSD:13:i386 |        |         | latest
 2747796 | accessibility/caribou          | caribou          | 0.4.21_2        | FreeBSD:13:i386 |        |         | latest
 2760503 | accessibility/eflite           | eflite           | 0.4.1           | FreeBSD:13:i386 |        |         | latest
 2744976 | accessibility/kdeaccessibility | kdeaccessibility | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756892 | accessibility/kmag             | kmag             | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756793 | accessibility/kmousetool       | kmousetool       | 19.12.3_1       | FreeBSD:13:i386 |        |         | latest
(10 rows)

freshports.dev=# 

This date will eventually be copied into the packages table, but we need to populate the two empty columns first.

What should the data look like?

My first goal was to create a result set resembled the desired outcome.

        SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi         = 'FreeBSD:13:i386'
           AND P.package_set = 'latest'
      ORDER BY package_origin 
         LIMIT 10;
   id    |      name       | abi_id |         package_origin         | port_id |   package_name   | package_version | package_set 
---------+-----------------+--------+--------------------------------+---------+------------------+-----------------+-------------
 2749964 | FreeBSD:13:i386 |     10 | accessibility/accerciser       |   21760 | accerciser       | 3.22.0          | latest
 2749124 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-atk      |   26871 | at-spi2-atk      | 2.26.2          | latest
 2749119 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-core     |   26868 | at-spi2-core     | 2.28.0          | latest
 2749003 | FreeBSD:13:i386 |     10 | accessibility/atk              |   11601 | atk              | 2.28.1          | latest
 2748993 | FreeBSD:13:i386 |     10 | accessibility/atkmm            |   27818 | atkmm            | 2.24.2_4        | latest
 2747796 | FreeBSD:13:i386 |     10 | accessibility/caribou          |   35704 | caribou          | 0.4.21_2        | latest
 2760503 | FreeBSD:13:i386 |     10 | accessibility/eflite           |   22197 | eflite           | 0.4.1           | latest
 2744976 | FreeBSD:13:i386 |     10 | accessibility/kdeaccessibility |   11726 | kdeaccessibility | 19.12.3         | latest
 2756892 | FreeBSD:13:i386 |     10 | accessibility/kmag             |   31155 | kmag             | 19.12.3         | latest
 2756793 | FreeBSD:13:i386 |     10 | accessibility/kmousetool       |   31156 | kmousetool       | 19.12.3_1       | latest
(10 rows)

freshports.dev=# 

The ports_orgin is a new table created specifically for this query. Until now, there has been no need to obtain a port_id based upon its origin. This table will be updated via triggers upon the ports table. That bit of work has been deferred for now.

freshports.dev=# \d ports_origin
              Table "public.ports_origin"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 port_id     | integer |           | not null | 
 port_origin | text    |           | not null | 
Indexes:
    "port_origin_pkey" PRIMARY KEY, btree (port_id)
    "ports_origin_port_id_idx" btree (port_id)
    "ports_origin_port_origin_idx" btree (port_origin)
Foreign-key constraints:
    "ports_origin_port_id_fk" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE NOT VALID

The original test

The original test took 18 seconds to populate about 30,000 rows:

explain analyse
UPDATE packages_raw PR
   SET abi_id  = pkg.abi_id,
       port_id = pkg.port_id
  FROM (SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi        = 'FreeBSD:13:i386'
           AND P.package_set = 'latest') AS pkg
 WHERE PR.id = pkg.id;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=11081.844..11081.844 rows=0 loops=1)
   ->  Hash Join  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=2880.504..4680.566 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=9051.81..36627.53 rows=22050 width=113) (actual time=2659.534..3858.975 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.344..0.351 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=9051.81..36405.99 rows=22050 width=120) (actual time=2659.181..3844.350 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12452.32 rows=524132 width=77) (actual time=0.370..2365.531 rows=459733 loops=1)
                     ->  Hash  (cost=8560.18..8560.18 rows=22050 width=51) (actual time=143.257..143.258 rows=30492 loops=1)
                           Buckets: 16384  Batches: 4  Memory Usage: 767kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=1018.44..8560.18 rows=22050 width=51) (actual time=86.264..117.885 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..1012.92 rows=22050 width=0) (actual time=85.918..85.918 rows=66670 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.029..220.029 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.458..172.482 rows=57807 loops=1)
 Planning Time: 7.125 ms
 Trigger for constraint packages_raw_abi_id_fk: time=1151.252 calls=30474
 Trigger for constraint packages_raw_port_id_fk: time=6532.501 calls=30474
 Execution Time: 18782.602 ms

See https://explain.depesz.com/s/6d9Z for more detail.

I noticed all the time spent checking foreign keys while updating 30474 rows. I figured I could save perhaps most of that nearly 7 seconds spent on constraint.

I thought we could defer that. I changed the packages_raw_port_id_fk constraint to be DEFERRABLE INITIALLY DEFERRED. I also added ON UPDATE CASCADE and ON DELETE CASCADE which should have been there from the start (but that’s not relevant to the change in performance).

Test number 2

On the second try, we get:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=12221.244..12221.244 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2772.498..6044.554 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2552.187..5468.500 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.335..0.338 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2551.846..5455.815 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.337..2901.118 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=171.243..171.244 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=100.492..143.901 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=100.172..100.173 rows=60966 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=219.460..219.460 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.459..172.605 rows=57807 loops=1)
 Planning Time: 9.476 ms
 Trigger for constraint packages_raw_abi_id_fk: time=363.506 calls=30474
 Execution Time: 12589.008 ms

Roughly 5 seconds of savings. See https://explain.depesz.com/s/Y9gj.

OK, let’s do the same for abi_id.

3rd time

After setting packages_raw_abi_id_fk to DEFERRABLE INITIALLY DEFERRED, the 3rd test results were:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=8798.389..8798.389 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2841.848..4960.062 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2620.392..3970.915 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.343..0.350 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2620.040..3956.973 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.340..2796.802 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=167.195..167.196 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=110.897..140.914 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=110.419..110.420 rows=91440 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.662..220.662 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.446..173.184 rows=57807 loops=1)
 Planning Time: 4.560 ms
 Execution Time: 8800.805 ms

That’s nearly 10 seconds faster. Sure, there might be some caching involved but the time spent checking foreign keys is greatly reduced.

See https://explain.depesz.com/s/XXnj.

Next, I noticed the sequential scan on ports_origin. Let’s see about that.

Yeah, that self-join

The self-join was noticed by xocolatl as being unnecessary. Let’s remove that to get:

UPDATE packages_raw PR
   SET abi_id  = abi.id,
       port_id = PO.port_id
  FROM abi, ports_origin PO
 WHERE PR.abi         = 'FreeBSD:13:i386'
   AND PR.package_set = 'latest'
   AND PR.abi         = abi.name
   AND PO.port_origin = PR.package_origin;

                                                                                QUERY PLAN                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=6183.437..6183.437 rows=0 loops=1)
   ->  Hash Join  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=418.091..845.851 rows=38905 loops=1)
         Hash Cond: (pr.package_origin = po.port_origin)
         ->  Nested Loop  (cost=989.00..9048.43 rows=16056 width=87) (actual time=174.079..268.378 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.030..0.036 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Bitmap Heap Scan on packages_raw pr  (cost=989.00..8886.84 rows=16056 width=77) (actual time=174.032..256.085 rows=30492 loops=1)
                     Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                     Heap Blocks: exact=416
                     ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..984.98 rows=16056 width=0) (actual time=173.458..173.458 rows=121903 loops=1)
                           Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=243.062..243.062 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.451..195.454 rows=57807 loops=1)
 Planning Time: 2.922 ms
 Execution Time: 6186.449 ms
(17 rows)

Well, isn’t that better?

See https://explain.depesz.com/s/cnyw for details.

That sequential scan on ports_origin takes up 195ms. Nothing to worry about.

Things to follow up on

This query doesn’t find values for all the FreeBSD:13:i386 * latest combination. Some are not matched.

freshports.dev=# SELECT * FROM packages_raw WHERE port_id IS NULL AND abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin;
   id    |            package_origin            |          package_name          | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------------+--------------------------------+-----------------+-----------------+--------+---------+-------------
 2745970 | databases/rubygem-peek-mysql2-rails5 | rubygem-peek-mysql2-rails5     | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2740242 | devel/p5-File-ShareDir-Tarball       | p5-File-ShareDir-Tarball       | 0.2.2           | FreeBSD:13:i386 |        |         | latest
 2741383 | devel/p5-MooX-Singleton              | p5-MooX-Singleton              | 1.20            | FreeBSD:13:i386 |        |         | latest
 2740542 | devel/p5-MooseX-MungeHas             | p5-MooseX-MungeHas             | 0.011           | FreeBSD:13:i386 |        |         | latest
 2733133 | devel/py-aspy.yaml                   | py27-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2741040 | devel/py-aspy.yaml                   | py37-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2743867 | devel/py-nodeenv                     | py27-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2744182 | devel/py-nodeenv                     | py37-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2746002 | devel/rubygem-peek-gc-rails5         | rubygem-peek-gc-rails5         | 0.0.2           | FreeBSD:13:i386 |        |         | latest
 2745932 | devel/rubygem-peek-rails5            | rubygem-peek-rails5            | 1.0.1           | FreeBSD:13:i386 |        |         | latest
 2745897 | devel/rubygem-peek-rblineprof-rails5 | rubygem-peek-rblineprof-rails5 | 0.2.0           | FreeBSD:13:i386 |        |         | latest
 2745876 | devel/rubygem-peek-redis-rails5      | rubygem-peek-redis-rails5      | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2749944 | security/picosha2                    | picosha2                       | 1.0.0.23        | FreeBSD:13:i386 |        |         | latest
 2749043 | sysutils/atf-rk3399                  | atf-rk3399                     | v2.1            | FreeBSD:13:i386 |        |         | latest
 2746664 | sysutils/py-leviathan                | py27-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2746885 | sysutils/py-leviathan                | py37-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2737873 | sysutils/swapusage                   | swapusage                      | 1.0.0           | FreeBSD:13:i386 |        |         | latest
 2744052 | www/rubygem-gon-rails5               | rubygem-gon-rails5             | 6.2.1           | FreeBSD:13:i386 |        |         | latest
(18 rows)

This is a development database and those ports do not exist on this host. It could also be an out-of-date ports_origin table, but I haven’t checked that yet. All of those ports were added to the FreeBSD repo on these dates:

  1. February 5 2019
  2. April 15 2019
  3. May 26-29 2019

They are present in the production database. Nothing to be done here.

However, the system will have to flag such entries for later followup.

Apr 142020
 

This is the latest in a series of posts dealing with displaying what packages are available for a given FreeBSD port.

The Packages – how is this data stored? post may help you follow along. It outlines the data and the tables used in this post.

Last night I managed to get the right data out for ports which have multiple flavors. I used a few procedures each of one calling another, eventually getting the results.

That effort has now been replaced with a single query.

In this post, we will be looking at the www/py-django-storages port, which has three flavors.

The query

This query came to me from RhodiumToad, who has a wide breadth of knowledge and is liberal with his help. What you see below has been adjusted since then, but it closely represents his delivery.

WITH 
  pkg AS (SELECT * FROM packages WHERE port_id = 28303)
SELECT abi.name AS abi, 
       pn.package_name,
       pkg.package_set,
       pkg.package_version 
  FROM abi 
       CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn
       LEFT JOIN pkg on (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)
 ORDER BY package_name, abi, package_set;

This blog post is mainly to document the above and go through it step by step. I’ll need this later because I know I won’t remember it. I also want to learn more about this query and apply it to future development.

I’m going to step through the query, outlining what each part does.

WITH pkg

A WITH query “provides a way to write auxiliary statements for use in a larger query”.

WITH 
  pkg AS (SELECT * FROM packages WHERE port_id = 28303)
SELECT abi.name AS abi, 
       pn.package_name,
       pkg.package_set,
       pkg.package_version 
  FROM abi 
       CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn
       LEFT JOIN pkg on (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)
 ORDER BY package_name, abi, package_set;

I think of it as a temporary table for this query. It is defining the data which will be used by the rest of the query.

I will refer to this later as the pkg table.

This is what we get:

freshports.dev=# SELECT * FROM packages WHERE port_id = 28303 ORDER BY package_name;
   id    | abi_id | port_id | package_version |     package_name     | branch_id | branch_name | package_set 
---------+--------+---------+-----------------+----------------------+-----------+-------------+-------------
 4373305 |      1 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4369003 |      1 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4340907 |      7 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4392074 |     10 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4388153 |     11 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4346130 |      6 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4382725 |      2 |   28303 | 1.8             | py27-django-storages |           |             | latest
 4349958 |      6 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4353741 |      4 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4380869 |      8 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4376671 |      8 |   28303 | 1.9.1           | py27-django-storages |           |             | latest
 4357905 |      4 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4339390 |      7 |   28303 | 1.5.1           | py27-django-storages |           |             | latest
 4361606 |      9 |   28303 | 1.5.1           | py27-django-storages |           |             | latest
 4363708 |      9 |   28303 | 1.9.1           | py27-django-storages |           |             | quarterly
 4339245 |      7 |   28303 | 1.5.1           | py36-django-storages |           |             | latest
 4361337 |      9 |   28303 | 1.5.1           | py36-django-storages |           |             | latest
 4369021 |      1 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4373197 |      1 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
 4365567 |      9 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
 4357806 |      4 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
 4376807 |      8 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4380779 |      8 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
 4353810 |      4 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4349931 |      6 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
 4383395 |      2 |   28303 | 1.8             | py37-django-storages |           |             | latest
 4388136 |     11 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4346104 |      6 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4392033 |     10 |   28303 | 1.9.1           | py37-django-storages |           |             | latest
 4342889 |      7 |   28303 | 1.9.1           | py37-django-storages |           |             | quarterly
(30 rows)

freshports.dev=# 

I sorted the data just to make it easier to follow.

The above represents all the packages available for www/py-django-storages from the FreeBSD repos.

How many different builds do we have for each package_name?

  SELECT package_name, 
         count(package_name) 
    FROM packages 
   WHERE port_id = 28303 
GROUP BY package_name 
ORDER BY package_name;

That query produces this result:

     package_name     | count 
----------------------+-------
 py27-django-storages |    15
 py36-django-storages |     2
 py37-django-storages |    13
(3 rows)

Those numbers will become relevant later.

SELECT DISTINCT

SELECT DISTINCT eliminates duplicate rows from the result. Don’t do what I initially did and use brackets around your first item. Consider SELECT DISTINCT to be a single keyword. It is not a function.

WITH 
  pkg AS (SELECT * FROM packages WHERE port_id = 28303)
SELECT abi.name AS abi, 
       pn.package_name,
       pkg.package_set,
       pkg.package_version 
  FROM abi 
       CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn
       LEFT JOIN pkg on (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)
 ORDER BY package_name, abi, package_set;

We need results for all packages on all ABI. To get there, we need a list of the packages first.

Combining that SELECT DISTINCT into the WITH clause, we get:

WITH pkg AS
  (SELECT * FROM packages WHERE port_id = 28303)
SELECT DISTINCT package_name FROM pkg;

     package_name     
----------------------
 py27-django-storages
 py36-django-storages
 py37-django-storages
(3 rows)

This is the pkg table.

All ABI * package combinations

We want to show people that a package is not available on a given ABI. We also want all tables to have the same number of rows so that scanning across the page compares like-to-like. For that, we want all combinations please.

abi CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn

Here is a list of all packages beside all ABI:

WITH pkg AS
  (SELECT * FROM packages WHERE port_id = 28303)
SELECT * FROM abi CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn

 id |        name        | active |     package_name     
----+--------------------+--------+----------------------
  1 | FreeBSD:12:amd64   | t      | py27-django-storages
  2 | FreeBSD:13:aarch64 | t      | py27-django-storages
  4 | FreeBSD:11:i386    | t      | py27-django-storages
  6 | FreeBSD:11:amd64   | t      | py27-django-storages
  7 | FreeBSD:11:aarch64 | t      | py27-django-storages
  8 | FreeBSD:12:i386    | t      | py27-django-storages
  9 | FreeBSD:12:aarch64 | t      | py27-django-storages
 10 | FreeBSD:13:i386    | t      | py27-django-storages
 11 | FreeBSD:13:amd64   | t      | py27-django-storages
  1 | FreeBSD:12:amd64   | t      | py36-django-storages
  2 | FreeBSD:13:aarch64 | t      | py36-django-storages
  4 | FreeBSD:11:i386    | t      | py36-django-storages
  6 | FreeBSD:11:amd64   | t      | py36-django-storages
  7 | FreeBSD:11:aarch64 | t      | py36-django-storages
  8 | FreeBSD:12:i386    | t      | py36-django-storages
  9 | FreeBSD:12:aarch64 | t      | py36-django-storages
 10 | FreeBSD:13:i386    | t      | py36-django-storages
 11 | FreeBSD:13:amd64   | t      | py36-django-storages
  1 | FreeBSD:12:amd64   | t      | py37-django-storages
  2 | FreeBSD:13:aarch64 | t      | py37-django-storages
  4 | FreeBSD:11:i386    | t      | py37-django-storages
  6 | FreeBSD:11:amd64   | t      | py37-django-storages
  7 | FreeBSD:11:aarch64 | t      | py37-django-storages
  8 | FreeBSD:12:i386    | t      | py37-django-storages
  9 | FreeBSD:12:aarch64 | t      | py37-django-storages
 10 | FreeBSD:13:i386    | t      | py37-django-storages
 11 | FreeBSD:13:amd64   | t      | py37-django-storages
(27 rows)

With 9 rows in the ABI table and 3 distinct package names we get 27 rows.

I will refer to this as the abi_x_package_name data.

LEFT JOIN

We have a list of all the

Now we do a LEFT JOIN. That gets us all the abi_x_package_name combinations we need with, or without, the corresponding package information.

LEFT JOIN pkg on (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)

This takes the entries from pkg and joins it with everything from the abi table.

This takes us to the entire query now:

WITH 
  pkg AS (SELECT * FROM packages WHERE port_id = 28303)
SELECT abi.name AS abi, 
       pn.package_name,
       pkg.package_set,
       pkg.package_version 
  FROM abi 
       CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn
       LEFT JOIN pkg on (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)
 ORDER BY package_name, abi, package_set;

        abi         | abi_id |     package_name     | package_set | package_version 
--------------------+--------+----------------------+-------------+-----------------
 FreeBSD:11:aarch64 |      7 | py27-django-storages | latest      | 1.5.1
 FreeBSD:11:aarch64 |      7 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:11:amd64   |      6 | py27-django-storages | latest      | 1.9.1
 FreeBSD:11:amd64   |      6 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:11:i386    |      4 | py27-django-storages | latest      | 1.9.1
 FreeBSD:11:i386    |      4 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:12:aarch64 |      9 | py27-django-storages | latest      | 1.5.1
 FreeBSD:12:aarch64 |      9 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:12:amd64   |      1 | py27-django-storages | latest      | 1.9.1
 FreeBSD:12:amd64   |      1 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:12:i386    |      8 | py27-django-storages | latest      | 1.9.1
 FreeBSD:12:i386    |      8 | py27-django-storages | quarterly   | 1.9.1
 FreeBSD:13:aarch64 |      2 | py27-django-storages | latest      | 1.8
 FreeBSD:13:amd64   |     11 | py27-django-storages | latest      | 1.9.1
 FreeBSD:13:i386    |     10 | py27-django-storages | latest      | 1.9.1
 FreeBSD:11:aarch64 |      7 | py36-django-storages | latest      | 1.5.1
 FreeBSD:11:amd64   |      6 | py36-django-storages |             | 
 FreeBSD:11:i386    |      4 | py36-django-storages |             | 
 FreeBSD:12:aarch64 |      9 | py36-django-storages | latest      | 1.5.1
 FreeBSD:12:amd64   |      1 | py36-django-storages |             | 
 FreeBSD:12:i386    |      8 | py36-django-storages |             | 
 FreeBSD:13:aarch64 |      2 | py36-django-storages |             | 
 FreeBSD:13:amd64   |     11 | py36-django-storages |             | 
 FreeBSD:13:i386    |     10 | py36-django-storages |             | 
 FreeBSD:11:aarch64 |      7 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:11:amd64   |      6 | py37-django-storages | latest      | 1.9.1
 FreeBSD:11:amd64   |      6 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:11:i386    |      4 | py37-django-storages | latest      | 1.9.1
 FreeBSD:11:i386    |      4 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:12:aarch64 |      9 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:12:amd64   |      1 | py37-django-storages | latest      | 1.9.1
 FreeBSD:12:amd64   |      1 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:12:i386    |      8 | py37-django-storages | latest      | 1.9.1
 FreeBSD:12:i386    |      8 | py37-django-storages | quarterly   | 1.9.1
 FreeBSD:13:aarch64 |      2 | py37-django-storages | latest      | 1.8
 FreeBSD:13:amd64   |     11 | py37-django-storages | latest      | 1.9.1
 FreeBSD:13:i386    |     10 | py37-django-storages | latest      | 1.9.1
(37 rows)

freshports.dev=#

From here, we start pivoting the data so we have latest and quarterly on the same line.

Pivot

By now, we have gone through the entire query. Next is the pivot.

WITH
  pkg AS (SELECT * FROM packages WHERE port_id = 28303)
SELECT abi.name AS abi,
       pn.package_name,
       max(pkg.package_version) FILTER (WHERE pkg.package_set = 'latest')
         AS package_version_latest,
       max(pkg.package_version) FILTER (WHERE pkg.package_set = 'quarterly')
         AS package_version_quarterly
  FROM abi
       CROSS JOIN (SELECT DISTINCT package_name FROM pkg) pn
       LEFT JOIN pkg ON (pkg.abi_id = abi.id AND pkg.package_name = pn.package_name)
 GROUP BY pn.package_name, abi.name
 ORDER BY pn.package_name, abi.name;

The filter controls which input rows are fed to the aggregate function max. I could have use min(), or any other similar function here because there is only one value which will match it. The GROUP BY clause is what controls this. For each package_name and abi.name combination, there will be 1 or two lines.

The output is the expected 27 lines of information:

        abi         |     package_name     | package_version_latest | package_version_quarterly 
--------------------+----------------------+------------------------+---------------------------
 FreeBSD:11:aarch64 | py27-django-storages | 1.5.1                  | 1.9.1
 FreeBSD:11:amd64   | py27-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:11:i386    | py27-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:12:aarch64 | py27-django-storages | 1.5.1                  | 1.9.1
 FreeBSD:12:amd64   | py27-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:12:i386    | py27-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:13:aarch64 | py27-django-storages | 1.8                    | 
 FreeBSD:13:amd64   | py27-django-storages | 1.9.1                  | 
 FreeBSD:13:i386    | py27-django-storages | 1.9.1                  | 
 FreeBSD:11:aarch64 | py36-django-storages | 1.5.1                  | 
 FreeBSD:11:amd64   | py36-django-storages |                        | 
 FreeBSD:11:i386    | py36-django-storages |                        | 
 FreeBSD:12:aarch64 | py36-django-storages | 1.5.1                  | 
 FreeBSD:12:amd64   | py36-django-storages |                        | 
 FreeBSD:12:i386    | py36-django-storages |                        | 
 FreeBSD:13:aarch64 | py36-django-storages |                        | 
 FreeBSD:13:amd64   | py36-django-storages |                        | 
 FreeBSD:13:i386    | py36-django-storages |                        | 
 FreeBSD:11:aarch64 | py37-django-storages |                        | 1.9.1
 FreeBSD:11:amd64   | py37-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:11:i386    | py37-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:12:aarch64 | py37-django-storages |                        | 1.9.1
 FreeBSD:12:amd64   | py37-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:12:i386    | py37-django-storages | 1.9.1                  | 1.9.1
 FreeBSD:13:aarch64 | py37-django-storages | 1.8                    | 
 FreeBSD:13:amd64   | py37-django-storages | 1.9.1                  | 
 FreeBSD:13:i386    | py37-django-storages | 1.9.1                  | 
(27 rows)

Why 27? 3 package names, 9 ABI.

Much better

This is a much better approach than the procedures I used before. I like it.

Mar 292020
 

When I started this, I thought it was be straight forward and take only a few hours at most. It turned into multiple hours over two days.

This is part two of a two-part post. The first part dealt with changing a database column from time to timestamp.

In this post:

  1. PostgreSQL 12.2
  2. Nagios 3.5
  3. FreeBSD 12.1

The queries

The two queries I want to run are:

SELECT count(*) FROM cache_clearing_dates WHERE date_added > now() - interval '1 hour';
SELECT count(*) FROM cache_clearing_ports WHERE date_added > now() - interval '1 hour';

The script

Combining those two into a script, I came up with this, after a bit of trial and error:

#!/bin/sh

# This script checks the cache_clearing_ports and cache_clearing_dates in
# a freshports database

# arguments
# * server   - IP address or hostname of the PostgreSQL database server
# * database - database name
# * user     - database user
#
# port is always the default value: 5432
#
# The password must be in ~.pgpass for the Unix user which runs this script.
#
# For Nagios, that defaults to ~nagios/.pgpass
#
# See https://www.postgresql.org/docs/12/libpq-pgpass.html
#

#
# arguments used to create simple output of a single value
# and to never issue a password prompt
#
psql="/usr/local/bin/psql --no-align --tuples-only --no-password "

server=$1
database=$2
user=$3

# we return zero, all good, by default
result=0

# the two database queries we will run
query_dates="SELECT count(*) FROM cache_clearing_dates WHERE date_added < now() - interval '1 hour';"
query_ports="SELECT count(*) FROM cache_clearing_ports WHERE date_added < now() - interval '1 hour';"

# how is the dates table?

dates=`$psql -c "$query_ports" --host=$server $database $user`
if [ $? == 0 ] ; then
# uncomment for debugging
#  echo all ok with dates query
else
  # when we hit an error, we do not care about the rest of the checks
  echo error on dates query:$? $dates
  exit 2
fi

# how is the ports table?

ports=`$psql -c "$query_dates" --host=$server $database $user`
if [ $? == 0 ] ; then
# uncomment for debugging
#  echo all ok with ports query
else
  # when we hit an error, we don't care about the rest of the checks
  echo error on ports query: $? $ports
  exit 2
fi

# check the results

if [ "$ports" != "0" ] ; then
  echo There are $ports ports entries in the cache
  result=2
fi

if [ "$dates" != "0" ] ; then
  echo There are $dates dates entries in the cache
  result=2
fi

if [ $result == 0 ] ; then
  echo All OK
fi

exit $result

How is this script invoked?

I run this script via net-mgmt/nrpe3. In general, I run that tool on every host and jail. The line within the configuration file looks like this:

$ grep check_freshports_cache_clearing /usr/local/etc/nrpe.cfg
command[check_freshports_cache_clearing]=/usr/local/libexec/nagios-custom/check_freshports_cache_clearing pg03.int.example.org freshports.dev nagios

I store my personal Nagios scripts in a non-default location: /usr/local/libexec/nagios-custom

Usually, Nagios scripts are in /usr/local/libexec/nagios

Why? Mostly so I can keep them all in one place, and there is never any chance of file name collision.

Script parameters

Please note the following about these parameters:

  • very simple
  • rather inflexible
  • they all must be provided
  • they must appear in this order

The parameters are:

  1. pg03.int.example.org : server – IP address or hostname of the PostgreSQL database server
  2. freshports.dev : database – database name
  3. nagios : database user

Do not confuse the nagios database user with the nagios unix user. They are two separate things.

The nagios user above is part of the PostgreSQL database system. I will show you how I created that later.

The nagios unix user is created by the Nagios port. Installing Nagios or nrpe is outside the scope of this post.

Password

I am a big fan of never putting sensitive information on the command line, such as passwords. Even if nobody else is ever on your system, don’t do it.

The solution here is the .pgpass as documented by the PostgreSQL project.

The script is run by nrpe which runs as the nagios user.

$ ps auwwx | grep nrpe
nagios     85203  0.0  0.0 15812  6368  -  IsJ  01:54   0:01.48 /usr/local/sbin/nrpe3 -c /usr/local/etc/nrpe.cfg -d
dan        15654  0.0  0.0 11324  2648  3  S+J  18:10   0:00.00 grep nrpe

It is because the script runs as the nagios user that the .pgpass files resides at ~nagios/.pgpass. It must be chown nagios and chmod 0600. My file looks like this, with contents slightly modified for public display:

$ sudo cat ~nagios/.pgpass
# hostname:port:database:username:password
pg03.int.example.org:5432:freshports.dev:nagios:notmypassword

The file looks like this:

$ ls -l  ~nagios/.pgpass
-rw-------  1 nagios  nagios  111 Mar 28 19:30 /var/spool/nagios/.pgpass

Why /var/spool/nagios?

Because of this:

$ grep nagios /etc/passwd 
nagios:*:181:181:Nagios pseudo-user:/var/spool/nagios:/usr/sbin/nologin

That is why. That is the home directory for the nagios user as defined in the UIDs file of the FreeBSD ports tree.

Creating the database user & permissions

These are the commands I used to create the nagios database user and the permissions required to query the tables.

First, I create a new group, or role. The group has SELECT rights on the table.

CREATE ROLE reporting;
GRANT SELECT ON cache_clearing_dates TO GROUP reporting;
GRANT SELECT ON cache_clearing_ports TO GROUP reporting;

Here, I create the nagios database user and add them to the role I just created.

CREATE ROLE nagios WITH LOGIN PASSWORD 'notmypassword';
GRANT reporting TO nagios;

Of note:

  • WITH LOGIN – allows this user to login. A role having the LOGIN attribute can be thought of as a user.
  • PASSWORD ‘notmypassword’ – specifies the password for this user. It also appears in the .pgpassfile for the user which will be running this script.

Testing the script

When testing the script from the command line, I did it this way:

[dan@dev-ingress01:~] $ /usr/local/libexec/nagios-custom/check_freshports_cache_clearing pg03.int.example.org freshports.dev nagios
All OK

I ran this script as my regular user and I also created a ~/.pgpass file for myself, which looks just like the one for the nagios unix user.

Things to check

If you can’t get this to work, the things to check are:

  • pg_hba.conf file for the PostgreSQL database server
  • permissions on the database tables
  • permissions on the .pgpass files in question
  • make sure it runs fine from the command line before trying nrpe

Monitoring is key

Monitoring is key to early detection of problems. When something should always be in a given state, it should be monitored to ensure it is in that state. Having a monitoring system which will provide early detection of anomalies helps me fix problems early, rather than later.

I try to monitor everything. Connections to expected hosts. Services running. Plenty of disk space.

If you’re not monitoring yet, start slowly and build it up. That time will be well spent.

Hope this helps.