Dec 182022
 

One of the primary uses for FreshPorts is searching various characteristics of the ports. One highly used search feature is pkg-messages (background at pkg-message). A problem which has been vexing me for about 2 months was the lack of results on prod, but results which worked on dev. The issue was raised on GitHub. Initially, it was DDL related (adding a column to the ports table fixed that on test and stage). However, the problem persisted on prod.

Here is an example, which appears later when we get deeper into it:

SELECT DISTINCT id as port_id
  FROM ports
 WHERE pkgmessage_textsearchable  @@ websearch_to_tsquery($1) OR
       pkgmessage_textsearchable2 @@ websearch_to_tsquery($1)

On prod, nothing. On all other systems, we get a few hundred rows. Why?

The key difference between all these databases: the working databases were self-hosted (PostgreSQL 12.13 on FreeBSD 13.1); the problem database was hosted on RDS in AWS (PostgreSQL 12.11 on something…).

It was because of this difference that I feared RDS didn’t support what I was doing. I was wrong.

The problem was configuration. The problem wasn’t that it didn’t work on prod, but why did it work everywhere else?

I’ll give some background as the fields involved in the search, the queries, and how we fixed it. I learned something new yesterday and that’s what is prompting this blog post, so I don’t forget it.

In short, the column was defined with ‘english‘ as the regconfig, but the search query was using the database default value, which was ‘english‘ on some hosts and ‘simple‘ on the problem host.

If you want to skip to why, read So why did it work and not work?

Background

All three databases should contain the same data. That theory was tested by doing queries via ilike which would find the information in the raw data, when the full text search did not.

I first started using this full text search approach back in March 2021 (it was a similar approach to the problem issue, but not the precise one in question). That search was “which port installs this file” – which works without having the port installed. Perhaps your application is calling for that file and you don’t know what installs it. FreshPorts can find it for you. Details in Replacing a column search with a full text search solution.

After reviewing that post, it seems all of my tsvector solutions may require adjustment.

The table

There are two columns on the ports table, one which splits up path names, and now which does not. This table listing is greatly redacted to keep the information to a minimum.

freshports.devgit=# \d ports
                                                                                    Table "public.ports"
           Column           |           Type           | Collation | Nullable |                                                   Default                                                   
 id                         | integer                  |           | not null | nextval('ports_id_seq'::regclass)
 element_id                 | integer                  |           | not null | 
...
 pkgmessage                 | text                     |           |          | 
...
 pkgmessage_textsearchable  | tsvector                 |           |          | generated always as (to_tsvector('english'::regconfig, pkgmessage)) stored
 pkgmessage_textsearchable2 | tsvector                 |           |          | generated always as (to_tsvector('english'::regconfig, translate(pkgmessage, '/'::text, ' '::text))) stored
Indexes:
...
    "ports_pkgmessage_textsearchable2_idx" gin (pkgmessage_textsearchable2)
    "ports_pkgmessage_textsearchable_idx" gin (pkgmessage_textsearchable)
...

The original data is stored in pkgmessage and searches are preformed over the pkgmessage_textsearchable andpkgmessage_textsearchable2 columns.

The query

The main part of the query looks like this:

SELECT DISTINCT id as port_id
  FROM ports
 WHERE pkgmessage_textsearchable  @@ websearch_to_tsquery($1) OR
       pkgmessage_textsearchable2 @@ websearch_to_tsquery($1)

On the development, test, and staging hosts, we get results like this:

freshports.devgit=# SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
  100421 | /ports/branches/2022Q1/dns/dnsmasq
  100428 | /ports/branches/2022Q1/emulators/xsystem35
   14686 | /ports/head/sysutils/lmon
... etc

I am searching only over one column here, but that doesn’t affect the problem in question.

Running the same query on production yields no results:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('example');
 port_id | element_pathname 
---------+------------------
(0 rows)


freshports.org=> 

Why? Let’s check the raw data.

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage ilike '%example%';
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
... etc

Yes, there are items which match. Granted, that word might not stand on its own. So let’s check with leading and trailing spaces:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage ilike '% example %';
 port_id |                element_pathname                
---------+------------------------------------------------
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   71388 | /ports/branches/2022Q2/math/saga
   51088 | /ports/branches/2018Q4/emulators/qemu
   64192 | /ports/branches/main/editors/lazarus-qt5
   22159 | /ports/head/devel/boost_build
... etc

So, where to next?

Social media

I posted on social media and I received a few ideas via private messages.

My first idea was reindex table concurrently ports – that didn’t help.

Next, we checked Collate and Ctype settings. Both matched:

prod, has problem


                                     List of databases
      Name      |  Owner   | Encoding  |   Collate   |    Ctype    |   Access privileges   
----------------+----------+-----------+-------------+-------------+-----------------------
 freshports.org | postgres | SQL_ASCII | C           | C           | 


dev, no issues:


[pg02 dan ~] % psql -l
                                        List of databases
            Name             |    Owner     | Encoding  | Collate | Ctype |   Access privileges   
-----------------------------+--------------+-----------+---------+-------+-----------------------
 freshports.devgit           | postgres     | SQL_ASCII | C       | C     | 

All the same. Nothing there.

tsvector values

We started looking at tsvector next. Here are the values from both columns.

From Controlling Text Search:

to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration.

The following is the results of invoking to_tsvector on the pkg-message file for security/pond as found above in one of the result sets.

It shows us that ‘example’ is the 27th word in the document. However, notice that it is ‘exampl`, without the trailing ‘e’.

freshports.org=> select pkgmessage_textsearchable from ports where id = 34126;
                                pkgmessage_textsearchable                                                                                                              
----------------------------------------------------------------------------------------------------------
 '/tmp':32 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmpfs':9,18,24,30,31 
 'use':2 'yes':20
(1 row)


freshports.org=> select pkgmessage_textsearchable2 from ports where id = 34126;
                              pkgmessage_textsearchable2                                                                                                             
---------------------------------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14 
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmp':32 
 'tmpfs':9,18,24,30,31 'use':2 'yes':20
(1 row)


freshports.org=> 


In both columns (the collection of lexemes), we have the same 'exampl'.

Let's look at a database which runs this query with expected results (which have been adjusted to readable without scrolling).


freshports.devgit=# select pkgmessage_textsearchable2 from ports where id = 34126;
                           pkgmessage_textsearchable2                                                                                                             
---------------------------------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmp':32 
 'tmpfs':9,18,24,30,31 'use':2 'yes':20
(1 row)


freshports.devgit=# select pkgmessage_textsearchable from ports where id = 34126;
                             pkgmessage_textsearchable                                                                                                              
----------------------------------------------------------------------------------------------------------
 '/tmp':32 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmpfs':9,18,24,30,31 'use':2 
 'yes':20
(1 row)


freshports.devgit=# 

So look at that. I see exampl (without the trailing e) – now that’s interesting.

Let’s try searching for the shortened word.

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('exampl');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
...

Hah! It’s finding that. Good. Progress. And not finished yet. Why does it work on one host, but not the other, given we have the same lexemes.

What’s the default_text_search_config setting on each host?

The configuration parameter default_text_search_config specifies the name of the default configuration, which is the one used by text search functions if an explicit configuration parameter is omitted. It can be set in postgresql.conf, or set for an individual session using the SET command. (taken directly from that documentation URL).

Also from the docs: “A text search configuration specifies all options necessary to transform a document into a tsvector: the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme.”

What is that setting on my hosts? On the problem host:

freshports.org=> show  default_text_search_config ;
 default_text_search_config 
----------------------------
 pg_catalog.simple

On the working hosts:

freshports.devgit=# show  default_text_search_config;
 default_text_search_config 
----------------------------
 pg_catalog.english
(1 row)

freshports.devgit=# 

There is a difference, and this turns out to be the cause of the problem.

However, as pointed out by Tom Lane’s reply to my post on the PostgreSQL General mailing list, that setting is not going to play well with my column definitions, both of with use this:

...to_tsvector('english'::regconfig...

Because ‘because “english” and “simple” will stem words differently.’ – OK, let’s look.

In short, the

Example tests

Here are the tests Tom provided:

regression=# select websearch_to_tsquery('english', 'example');
 websearch_to_tsquery 
----------------------
 'exampl'
(1 row)


regression=# select websearch_to_tsquery('simple', 'example');
 websearch_to_tsquery 
----------------------
 'example'
(1 row)

From that, it is clear that I should be using simple, not english.

Where are my settings set?

Settings can come from a configuration item or from the default value for the system.

The problem host:

freshports.org=> select setting, source from pg_settings where name = 'default_text_search_config';
      setting      | source  
-------------------+---------
 pg_catalog.simple | default
(1 row)

The working hosts:

freshports.devgit=# select setting, source from pg_settings where name = 'default_text_search_config';
      setting       |       source       
--------------------+--------------------
 pg_catalog.english | configuration file
(1 row)

Proof of concept

Let’s create a new column and index, in prod, and see how that goes.

Here we go, on the problem database, create a new field, based on simple, not english.

ALTER TABLE public.ports
    ADD COLUMN pkgmessage_textsearchable3 tsvector generated 
     always as (to_tsvector('simple'::regconfig, translate(pkgmessage, '/'::text, ' '::text))) stored;

Index it:

CREATE INDEX ports_pkgmessage_textsearchable3_idx
    ON public.ports USING gin
    (pkgmessage_textsearchable3)
    TABLESPACE pg_default;
CREATE INDEX

Try the problem query:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable3  @@ websearch_to_tsquery('example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
   51532 | /ports/branches/2019Q1/graphics/drm-legacy-kmod
... etc

Success. Thank you Mr Lane.

Looking at the values

Let’s have a close look at the tsvector values (I have wrapped the output for readability):

freshports.org=> select pkgmessage_textsearchable3 from ports where id = 34126;
                                                                                                                                                            pkgmessage_textsearchable3                                                                                                                                                            
-------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'a':5,23 'and':37 'as':4 'client':6,36 'directory':25 
 'example':27 'follow':38 'following':13 'for':26 'in':15 'instructions':40 'line':14 
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'require':8 'run':34 
 't':29 'the':12,39 'tmp':32 'tmpfs':9,18,24,30,31 'to':1 'use':2 'yes':20 'you':7
(1 row)

So why did it work and not work?

On both hosts, the column was being indexed using ‘english’, which converted ‘example’ to the value ‘exampl’.

In my query, my search parameter was also being converted, using the configuration defined on that host.

In production, the search criteria would be converted like this:

freshports.org=> select websearch_to_tsquery('example');
 websearch_to_tsquery 
----------------------
 'example'
(1 row)

That is, no change, and no match to the tsvector lexemes.

On the working hosts, this happened:

freshports.devgit=# select websearch_to_tsquery('example');
 websearch_to_tsquery 
----------------------
 'exampl'
(1 row)

So, the same converted value, which matches the tsvector lexemes which means the search succeeds.

In short, the column was defined as ‘english’ and we were searching using ‘simple’.

Fixing the original query

I could fix the original query by supplying regconfig when invoking to_tsvector. Here is the original problem query, slightly altered:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('english', 'example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
   56061 | /ports/head/games/homura
   51532 | /ports/branches/2019Q1/graphics/drm-legacy-kmod
...etc

The original query used websearch_to_tsquery(‘example’);. That is why it didn’t work on prod. Because the default regconfig (‘simple’) didn’t match the column (‘english'(.

This query worked on the other hosts, because the default regconfig (‘english’) matched the column (‘english’);

Why store the tsvector values?

I was asked: why do you have tsvector columns in the table? As opposed to just having an index over a tsvector of the actual text column.

I think it was a progression thing. I started following a tutorial or something.

Instead of having an index over the column pkgmessage_textsearchable, as I have now, I could create an index such as to_tsvector(‘english’::regconfig, pkgmessage).

Perhaps that is for future work.

Edit 2022-12-19 – I tried a proof-of-concept. It failed. Initially. Then I added a new index.

See https://explain.depesz.com/s/eb6t – I’ll add more details here as I complete this task later.

edit 2022-12-23 – the index conversion was documented in Moving from storing the tsvector values to a tsvector index.

What’s next

Next, I have to decide if I want to modify my queries to explicitly specify regconfig or rely upon the configuration setting. What are the pros and cons of each? Please let me know your thoughts via the comments.

Aug 192022
 

A long-standing bug, perhaps dating back to April 2003, has been fixed. It was an optimization in a trigger: if the values in the trigger have not changed, let’s not run the trigger.

Turns out, that was a bad idea.

The table

The ports_categories table documents the connections between a port and the categories in which it resides. There must be one physical category. By physical, I mean a category for which a directory exists on disk in the /usr/ports directory. This is sometimes referred to as the primary category. The primary/physical category must always be listed first in the CATEGORIES= directive of the port’s Makefile. See also Categorization in the FreeBSD Porters Handbook.

There are four open issues as I write this. The missing ports_categories entries #129 issue is most relevant to what I’m trying to solve today.

How I know

A system check is run on on regular basis and I get an email looking like this:

This is a list of ports that do not have entries in the ports_categories table
This can be fixed with this query:

begin;  insert into ports_categories select id, category_id from ports_active PA WHERE NOT EXISTS 
(SELECT * from ports_categories PC where PC.port_id = PA.id and PC.category_id = PA.category_id);

This is a list of the ports in question:

 id  | category_id |       name       |   category   |             port              |             element_pathname              
-----+-------------+------------------+--------------+-------------------------------+-------------------------------------------
 500 |           9 | open-motif-devel | x11-toolkits | x11-toolkits/open-motif-devel | /ports/head/x11-toolkits/open-motif-devel
(1 row)

Checking the data

Looking at the history of commits for that port, I notice the most recent commit resurrected this port after it was deleted in 2002.

I now suspect the resurrection process on FreshPorts. Perhaps the updates to ports_categories are not being done.

Let’s look at the data:

[pg02 dan ~] % psql freshports.devgit
psql (12.11)
Type "help" for help.

freshports.devgit=# select category_id, categories from ports where id = 500;
 category_id |  categories  
-------------+--------------
           9 | x11-toolkits
(1 row)

freshports.devgit=# select * from ports_categories where port_id = 500;
 port_id | category_id 
---------+-------------
(0 rows)

freshports.devgit=# 

From there, I can see category_id (derived value) and categories (directly copied from the Makefile are populated. It seems the only information missed is the entry from the ports_categories table.

The files

Here are the files/logs involved with that commit. They are archived by commit date.

[dev-ingress01 dan ~freshports/message-queues/recent] % cd ../archive/2022_08/2022_08_16
[dev-ingress01 dan ~freshports/message-queues/archive/2022_08/2022_08_16] % ls *0395dfd3904676e2e652bfd7850c4a48ec5cebba*
2022.08.16.03.33.52.000000.0395dfd3904676e2e652bfd7850c4a48ec5cebba.log  2022.08.16.03.33.52.000000.0395dfd3904676e2e652bfd7850c4a48ec5cebba.xml
[dev-ingress01 dan ~freshports/message-queues/archive/2022_08/2022_08_16] %

Can it be reproduced?

I have an idea. Let’s rerun the commit and see if the situation persists. If so, we have a reproducible event.

This is the delete:

freshports.devgit=# begin; delete from commit_log where message_id = '0395dfd3904676e2e652bfd7850c4a48ec5cebba';
BEGIN
DELETE 1
freshports.devgit=# commit;
COMMIT
freshports.devgit=# 

This is the injection of the commit into the processing queue:

[dev-ingress01 dan ~freshports/message-queues/archive/2022_08/2022_08_16] % sudo mv -i 2022.08.16.03.33.52.000000.0395dfd3904676e2e652bfd7850c4a48ec5cebba.xml ~ingress/message-queues/incoming/ 

The files are now back in the recent queue and this is how we know processing is complete.

[dev-ingress01 dan ~freshports/message-queues/recent] % ls *0395dfd3904676e2e652bfd7850c4a48ec5cebba*
2022.08.16.03.33.52.000000.0395dfd3904676e2e652bfd7850c4a48ec5cebba.log  2022.08.16.03.33.52.000000.0395dfd3904676e2e652bfd7850c4a48ec5cebba.xml

Checking the tables, we have the same results. The situation can be reproduced.

freshports.devgit=# select category_id, categories from ports where id = 500;
 category_id |  categories  
-------------+--------------
           9 | x11-toolkits
(1 row)

freshports.devgit=# select * from ports_categories where port_id = 500;
 port_id | category_id 
---------+-------------
(0 rows)

freshports.devgit=# 

I looked at the code, and found nothing updating the ports_categories table. It must be part of the database itself (i.e. a trigger).

Looking at the source code for the relational integrity stuff, I found this function:

CREATE OR REPLACE FUNCTION ports_categories_set() RETURNS TRIGGER AS $$
   DECLARE
                CategoryCount   integer;
                CategoryNames   text;
                Category        text;
                CategoryID      int8;
                UpdateNeeded    boolean;
   BEGIN
                UpdateNeeded := TRUE;

                IF TG_OP = 'UPDATE' THEN
                        IF old.categories = new.categories THEN
                                UpdateNeeded := FALSE;
                        END IF;
                END IF;

                IF UpdateNeeded THEN

                        DELETE 
                          FROM ports_categories
                         WHERE port_id = new.id;

                        IF new.categories IS NOT NULL THEN

                                CategoryCount := 0;
                                CategoryNames := new.categories;
                                LOOP
...

The highlighted lines 12-14 are an optimization. I think it’s wrong. Presumably the table would be correct if no changes are occurring. However, I see the faulty logic and it’s not in the function above.

That code dates back to at least April 2003.

Background

When the ports_categories table was first added to the database, it needed to be populated. Looking at commit logs, this was sometime pre 2006-12-17 because the oldest log entry says:

r4012 | dan | 2006-12-17 12:06:34 +0000 (Sun, 17 Dec 2006) | 2 lines

Merge from FreshPorts2 to HEAD

I could not find any earlier references. In that populate script, I find:

#
# get a list of ports to update
#

$sql = "
  SELECT ports_active.id         AS id, 
         ports_active.categories AS categories,
         ports_active.name       AS name,
         ports_active.category   AS category
    FROM ports_active
ORDER BY ports_active.id";

The ports_active is a view of the ports table which excludes deleted ports.

That’s the problem right there ^

Why were deleted ports excluded? It is awkward to extract values from deleted ports. The Makefile may be out of data for the current FreshPorts code. Sure, the code could versioned and could cope with older Makefile – that approach is common with APIs. No, I’m not going to do that.

Let’s comment out that optimization from the trigger and see if that fixes the situation.

Checking

Not shown: editing the trigger, updating the database, deleting the commit, and injecting the XML file back into the processing queue.

Here’s the new situation:

freshports.devgit=# select * from ports_categories where port_id = 500;
 port_id | category_id 
---------+-------------
     500 |           9
(1 row)

freshports.devgit=# select category_id, categories from ports where id = 500;
 category_id |  categories  
-------------+--------------
           9 | x11-toolkits
(1 row)

That is correct. The ports_categories now contains the expected values. Good.

I still have to wait for the top of the hour when the system check will be invoked, but right now, I’m confident this has solved an elusive bug.

EDIT: Several hours later, the alerts are no longer occurring on the dev database. Time to push these change through to test, stage, and then prod.

May 292021
 

Today at about 1:00 pm Philadelphia time, the website went offline. It was still there, it just could not answer any queries.

In this post:

  • Website and database server running in AWS
  • FreeBSD 12.2 on a
  • t3.medium EC2 instance
  • PostgreSQL 12.5 running on a db.t2.large RDS instance

Initial look

Looking at load, it was acceptable:

$ w
 7:09PM  up 43 days, 20:48, 1 user, load averages: 0.12, 0.26, 0.30
USER       TTY      FROM                                      LOGIN@  IDLE WHAT
ec2-user   pts/0    [redacted]                                7:09PM     - w

There were a lot of php-fpm: pool www (php-fpm) processes running.

Looking in the logs, I found:

[root@nginx01 /var/log]# tail -F php-fpm.log 
[29-May-2021 17:41:34] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 17 idle, and 224 total children
[29-May-2021 17:41:41] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 19 idle, and 234 total children
[29-May-2021 17:41:43] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 17 idle, and 235 total children
[29-May-2021 17:41:44] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 18 idle, and 238 total children
[29-May-2021 17:41:45] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 18 idle, and 240 total children
[29-May-2021 17:41:46] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 19 idle, and 242 total children
[29-May-2021 17:41:51] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 18 idle, and 249 total children
[29-May-2021 17:41:55] WARNING: [pool www] server reached pm.max_children setting (250), consider raising it
[29-May-2021 18:02:26] WARNING: [pool www] child 12962 exited with code 70 after 1759.560106 seconds from start
[29-May-2021 18:02:26] NOTICE: [pool www] child 22023 started
^C

Yeah, that’s a lot.

Looking at the RDS statistics, it was busy at about 98% load I think, I did not take a screen shot then. But here is the screen shot in hindsight, showing the fast build up of the number of sessions, from near zero to 250 and staying there for 2.5 hours

250 database sessions

250 database sessions

From this:

  • select * from elementGetCaseInsensitive() is suspect
  • this happened suddenly

The query

Here is the query:

   SELECT tmp.*, EP.pathname FROM (
      SELECT id,
             name::text,
             directory_file_flag::text,
             status::text,
             case when IsCategory(id) IS NULL THEN FALSE ELSE TRUE END,
             case when IsPort(    id) IS NULL THEN FALSE ELSE TRUE END
        FROM element
       WHERE id = (SELECT element_id
        FROM element_pathname
       WHERE lower(pathname) = lower($1))
     ) AS tmp JOIN element_pathname EP on tmp.id = EP.element_id;

And an explain analyse:

freshports.org=> explain analyse select * from elementGetCaseInsensitive('/ports/head/sysutils/nut');
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=29669.04..29685.59 rows=1 width=137) (actual time=6396.216..6396.221 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on element_pathname  (cost=0.00..29668.18 rows=5748 width=4) (actual time=4123.990..6395.702 rows=1 loops=1)
           Filter: (lower(pathname) = '/ports/head/sysutils/nut'::text)
           Rows Removed by Filter: 1149267
   ->  Index Scan using element_pkey on element  (cost=0.43..8.45 rows=1 width=23) (actual time=6395.745..6395.746 rows=1 loops=1)
         Index Cond: (id = $0)
   ->  Index Scan using element_pathname_element_id on element_pathname ep  (cost=0.43..8.45 rows=1 width=56) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (element_id = $0)
 Planning Time: 0.392 ms
 Execution Time: 6396.252 ms
(11 rows)

That’s terrible. Look at that sequential scan on over 1.1 million rows.

I know how to improve this: a new index.

The alteration

This is how I altered the table, first, the original definition:

freshports.org-> \d element_pathname
            Table "public.element_pathname"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 element_id | integer |           | not null | 
 pathname   | text    |           | not null | 
Indexes:
    "element_pathname_pathname" UNIQUE, btree (pathname)
    "element_pathname_element_id" btree (element_id)
Foreign-key constraints:
    "element_pathname_element_id_fkey" FOREIGN KEY (element_id) REFERENCES element(id) ON DELETE CASCADE

freshports.org-> 

This is the index I added:

freshports.org=> begin;
BEGIN
freshports.org=> create index element_pathname_pathname_lc on element_pathname (lower(pathname));
CREATE INDEX
freshports.org=> commit;

The new explain analyze

Now that query runs quickly:

freshports.org=> explain analyse select * from elementGetCaseInsensitive('/ports/head/sysutils/nut');
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=10438.60..10455.15 rows=1 width=137) (actual time=0.292..0.297 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Bitmap Heap Scan on element_pathname  (cost=265.08..10437.74 rows=5746 width=4) (actual time=0.042..0.043 rows=1 loops=1)
           Recheck Cond: (lower(pathname) = '/ports/head/sysutils/nut'::text)
           Heap Blocks: exact=1
           ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..263.65 rows=5746 width=0) (actual time=0.037..0.038 rows=1 loops=1)
                 Index Cond: (lower(pathname) = '/ports/head/sysutils/nut'::text)
   ->  Index Scan using element_pkey on element  (cost=0.43..8.45 rows=1 width=23) (actual time=0.053..0.054 rows=1 loops=1)
         Index Cond: (id = $0)
   ->  Index Scan using element_pathname_element_id on element_pathname ep  (cost=0.43..8.45 rows=1 width=56) (actual time=0.007..0.008 rows=1 loops=1)
         Index Cond: (element_id = $0)
 Planning Time: 0.604 ms
 Execution Time: 0.332 ms
(13 rows)

freshports.org=> 

With that change, the website soon returned to its usually snappiness.

The stats show the results

This screen shows how the number of session is back to normal. see also that the query in question is now listed down under 0.01 seconds. Good.

sessions down near 0

sessions down near 0

I should have known to add this index with that new function was created. I don’t see an explain analyze in that post.

Mar 092021
 

I had noticed a problem with my full text search solution when I was testing. If the input query contained a quote, an error resulted on the query. At the time, I attributed that to a poor input/text handling. Later, I realized I was wrong.

On the website, this PHP code is used:

$result = pg_query_params($this->dbh, $sql, array($this->Query));

Where $this->Query is what the user supplied on the search form. The pg_query_params documentation says

The primary advantage of pg_query_params() over pg_query() is that parameter values may be separated from the query string, thus avoiding the need for tedious and error-prone quoting and escaping.

So I don’t need to do anything special.

Here is some testing I tried while talking this over via IRC:

freshports.dev=# SELECT to_tsquery('english', $$Joe's stuff$$);
ERROR:  syntax error in tsquery: "Joe's stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joes stuff$$);
ERROR:  syntax error in tsquery: "Joes stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joe stuff$$);
ERROR:  syntax error in tsquery: "Joe stuff"

freshports.dev=# SELECT to_tsquery('english', $$Joestuff$$);
 to_tsquery 
------------
 'joestuff'
(1 row)

It turns out the solution is to use websearch_to_tsquery() instead. The documentation mentions:

Moreover, this function should never raise syntax errors, which makes it possible to use raw user-supplied input for search.

That sounds ideal, and it was. It completely fixes the issue.

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joe's stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joes stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

freshports.dev=# SELECT websearch_to_tsquery('english', $$Joe stuff$$);
 websearch_to_tsquery 
----------------------
 'joe' & 'stuff'
(1 row)

Great. Hope this helps.

Mar 092021
 

In yesterdays blog post, Replacing a column search with a full text search solution, I wanted to mention how adding a LIMIT 10 to a fast running query made it run slower. I didn’t remember that issue until this morning. Fortunately, the fix was low cost.

The query

This is the query I ran:

EXPLAIN ANALYSE
WITH short_list AS (
    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
)
  select CL.commit_date - SystemTimeAdjust() AS last_commit_date, 
         P.id, 
         E.name as port,
         C.name as category, 
...
         P.uses  
       FROM short_list, ports P
       LEFT OUTER JOIN ports_vulnerable    PV  ON PV.port_id       = P.id
       LEFT OUTER JOIN commit_log          CL  ON P.last_commit_id = CL.id,
       element_pathname EP,
       categories C, element E 
 WHERE P.id = short_list.port_id
   AND P.element_id = EP.element_id and EP.pathname like '/ports/head/%'
   AND P.category_id  = C.id
   AND P.element_id   = E.id  AND E.status = 'A';

The plan (as found at https://explain.depesz.com/s/Bkkg) was fast enough for me.

Adding LIMIT 10

This was used on the FreshPorts search page (not in production, but on the dev website). It needs a limit, so I added LIMIT 10.

The plan changed to a sequential scan of a large table. Boom, 28 seconds. That is unsatisfactory. The plan is at https://explain.depesz.com/s/oJZB.

The Fix

The fix, as pointed out by Thomas Hurst in his reply to my tweet centered around a CTE (Common Table Expressions). That initially confused me, because the query was already using a CTE. That’s the WITH short_list AS part of the query.

Then I read the AS MATERIALIZED part of his reply and started reading up on that. I changed the query to:

EXPLAIN ANALYSE
WITH short_list AS MATERIALIZED (
    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
)
  select CL.commit_date - SystemTimeAdjust() AS last_commit_date, 
         P.id, 
         E.name as port,
         C.name as category, 
...
         P.uses  
       FROM short_list, ports P
       LEFT OUTER JOIN ports_vulnerable    PV  ON PV.port_id       = P.id
       LEFT OUTER JOIN commit_log          CL  ON P.last_commit_id = CL.id,
       element_pathname EP,
       categories C, element E 
 WHERE P.id = short_list.port_id
   AND P.element_id = EP.element_id and EP.pathname like '/ports/head/%'
   AND P.category_id  = C.id
   AND P.element_id   = E.id  AND E.status = 'A'
LIMIT 10;

You can see the change on line 2: AS MATERIALIZED.

The plan (as found at https://explain.depesz.com/s/UBps) takes 1.4s and is back to using an index scan.

Further reading:

Mar 082021
 

One of the many useful features on FreshPorts is: what port[s] install[s] this file? That’s the pkg-plist search option. pkg-plist is a file which “lists all the files installed by the port”. However not all ports have a pkg-plist file because the list is compiled automatically. That is why the configure_plist table was created to replace the ports.pkg_plist column. The creation of that table broke the search function because it was then working on outdated data.

Last week, I blogged about how I created a new a new stored procedure for pulling back the pkg-plist information.

The new table

By new, I mean new to this search solution. The table looks like this:

freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

For a given port, there will be N entries in this table, one for each item in the pkg-plist. You will notice some indexes on the installed_file column.

A simple search

With the old solution, we could search like this, but it took time:

AND P.pkg_plist ILIKE '%share/examples/acme.sh/dnsapi/dns_1984hosting.sh%'

Over 2.7 seconds in fact.

Let’s try looking in the generate_plist table instead. I was reading the Tables and Indexes section of the PostgreSQL Full Text Search and decided to start with this:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, installed_file)) STORED;

CREATE INDEX generate_plist_textsearch_idx
    ON public.generate_plist USING gin
    (textsearchable_index_col)
    TABLESPACE pg_default;

With that, the following search was pretty damn nice.

freshports.dev=# explain analyse    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on generate_plist  (cost=367.55..2143.46 rows=426 width=4) (actual time=2.116..2.123 rows=2 loops=1)
   Recheck Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..367.44 rows=426 width=0) (actual time=2.095..2.095 rows=2 loops=1)
         Index Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
 Planning Time: 0.434 ms
 Execution Time: 2.208 ms
(7 rows)

But this was … umm, what?

freshports.dev=# SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('dns_1984hosting.sh')
;
 port_id 
---------
(0 rows)

freshports.dev=# 

Nothing found? What?

Full text search is not like that

Full text search is not like that. My search on the full path name would have worked just fine with the existing index:

freshports.dev=# explain analyse SELECT
        port_id
    FROM
        generate_plist
    WHERE
        installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh';
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using generate_plist_installed_file_idx on generate_plist  (cost=0.56..28.66 rows=6 width=4) (actual time=0.046..0.052 rows=2 loops=1)
   Index Cond: (installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text)
 Planning Time: 0.222 ms
 Execution Time: 0.088 ms
(4 rows)

freshports.dev=# 

In fact, it’s faster.

But that’s not what is needed. Let’s look at the use case.

Searching for file names

We want to search for the full path or the file name. Or just dns_1984hosting.sh or dns_1984hosting.

With help from RhodiumToad on IRC, tsearch is all about matching words (technically lexemes), not arbitrary substrings. We need to think in terms of lexemes. For example:

freshports.dev=# select ts_debug('english', 'foo/bar');
                           ts_debug                           
--------------------------------------------------------------
 (file,"File or path name",foo/bar,{simple},simple,{foo/bar})
(1 row)

freshports.dev=# 

So let’s split the file name into distinct words. Let’s convert the / to space and add a new column for that:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col2 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx2
    ON public.generate_plist USING gin
    (textsearchable_index_col2)
    TABLESPACE pg_default;

Let’s try searching now:

freshports.dev=# EXPLAIN ANALYSE
WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=171.64..25352.59 rows=2981 width=88) (actual time=1.337..1.506 rows=2 loops=1)
   ->  Bitmap Heap Scan on generate_plist  (cost=171.35..12102.68 rows=2981 width=56) (actual time=1.031..1.040 rows=2 loops=1)
         Recheck Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.015..1.015 rows=2 loops=1)
               Index Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
   ->  Index Scan using ports_pkey on ports p  (cost=0.29..4.19 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=2)
         Index Cond: (id = generate_plist.port_id)
 Planning Time: 1.064 ms
 Execution Time: 1.596 ms
(10 rows)

freshports.dev=# 

Umm, that’s fast.

But if we try the file without the .sh we get nothing:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
 id | element_pathname | installed_file 
----+------------------+----------------
(0 rows)

freshports.dev=# 

Another column for that

Humor me here while I add another column:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col3 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx3
    ON public.generate_plist USING gin
    (textsearchable_index_col3)
    TABLESPACE pg_default;

This translates both / and . to a space. Now we have:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col3 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
  id   |            element_pathname             |                  installed_file                  
-------+-----------------------------------------+--------------------------------------------------
 59654 | /ports/branches/2020Q3/security/acme.sh | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
 43508 | /ports/head/security/acme.sh            | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
(2 rows)

freshports.dev=# 

Which is still lightning fast.

Doing them all at once

Creating a query which uses all three columns is still performant:

freshports.dev=# explain analyse WITH short_list AS (
    SELECT
        DISTINCT port_id, installed_file
    FROM
        generate_plist
    WHERE
        textsearchable_index_col  @@ to_tsquery('bacula')
     OR textsearchable_index_col2 @@ to_tsquery('bacula')
     OR textsearchable_index_col3 @@ to_tsquery('bacula')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                                                       QUERY PLAN                                                                                                       
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=50810.64..53335.38 rows=7790 width=88) (actual time=273.782..520.103 rows=1872 loops=1)
   Hash Cond: (sl.port_id = p.id)
   ->  Subquery Scan on sl  (cost=35602.05..35757.85 rows=7790 width=56) (actual time=14.561..16.078 rows=1872 loops=1)
         ->  HashAggregate  (cost=35602.05..35679.95 rows=7790 width=56) (actual time=14.558..15.495 rows=1872 loops=1)
               Group Key: generate_plist.port_id, generate_plist.installed_file
               ->  Bitmap Heap Scan on generate_plist  (cost=1705.16..35563.02 rows=7806 width=56) (actual time=11.540..12.771 rows=1872 loops=1)
                     Recheck Cond: ((textsearchable_index_col @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col2 @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col3 @@ to_tsquery('bacula'::text)))
                     Heap Blocks: exact=116
                     ->  BitmapOr  (cost=1705.16..1705.16 rows=7808 width=0) (actual time=11.489..11.490 rows=0 loops=1)
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..415.44 rows=426 width=0) (actual time=2.994..2.995 rows=36 loops=1)
                                 Index Cond: (textsearchable_index_col @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.373..1.374 rows=1857 loops=1)
                                 Index Cond: (textsearchable_index_col2 @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx3  (cost=0.00..1113.26 rows=4401 width=0) (actual time=7.117..7.117 rows=1872 loops=1)
                                 Index Cond: (textsearchable_index_col3 @@ to_tsquery('bacula'::text))
   ->  Hash  (cost=14174.04..14174.04 rows=63004 width=8) (actual time=258.282..258.283 rows=63065 loops=1)
         Buckets: 32768  Batches: 4  Memory Usage: 877kB
         ->  Seq Scan on ports p  (cost=0.00..14174.04 rows=63004 width=8) (actual time=0.089..219.485 rows=63065 loops=1)
 Planning Time: 1.227 ms
 Execution Time: 521.244 ms
(20 rows)

freshports.dev=# 

At what cost?

What does the table look like now:

freshports.dev=# \d generate_plist
                                                                          Table "public.generate_plist"
          Column           |   Type   | Collation | Nullable |                                                      Default                                                      
---------------------------+----------+-----------+----------+-------------------------------------------------------------------------------------------------------------------
 id                        | bigint   |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id                   | integer  |           | not null | 
 installed_file            | text     |           | not null | 
 textsearchable_index_col  | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, installed_file)) stored
 textsearchable_index_col2 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) stored
 textsearchable_index_col3 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) stored
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)
    "generate_plist_textsearch_idx" gin (textsearchable_index_col)
    "generate_plist_textsearch_idx2" gin (textsearchable_index_col2)
    "generate_plist_textsearch_idx3" gin (textsearchable_index_col3)

How much space does this take:



freshports.dev=# select count(*) from generate_plist;
  count  
---------
 8478513
(1 row)


freshports.dev=# SELECT pg_table_size('generate_plist');
 pg_table_size 
---------------
    3578167296
(1 row)

freshports.dev=# select pg_total_relation_size('generate_plist');
 pg_total_relation_size 
------------------------
             6730162176
(1 row)

That’s 8.5 million rows taking up 3.3 GB but with indexes it’s 6.3 GB.

Disk is cheap. Time is expensive. I say the cost is worth the performance.

Mar 022021
 

In this post I will show you how I discovered that the backend was putting data into a new table (generate_plist) but the website was still using the ports.pkg_plist column. The result was stale data being shown on the website (for older ports) or no data at all (for newer ports).

How long? Since Oct 2017, because that’s when configure_plist showed up on the website. Nobody noticed until recently. I can only guess that I made the backend changes with intent to change the frontend later. I should have created myself a ticket.

Why was the change made? Primarily because all ports, especially Python based ports, have a pkg-plist file. These ports generate the list automatically at package-built time. Secondly, it also simplifies the search of pkg-plist.

Ironically, this bug also affects search, which is still using the old column.

Background

One set of data which FreshPorts shows is the list of files installed by a given package. This information can be provided via a file, called pkg-plist.

Here is one example for the comms/gnuradio port:

freshports.dev=# select pkg_plist from ports where id = 34724;
             pkg_plist              
------------------------------------
 include/libconcord.h              +
 lib/libconcord.a                  +
 lib/libconcord.so                 +
 lib/libconcord.so.4               +
 lib/libconcord.so.4.0.0           +
 share/mime/packages/libconcord.xml
(1 row)

freshports.dev=# 

NOTE: the above is one row of output. It is a collection of strings, separated by newlines.

If visit the website, you’ll see that list is wrong, because it’s outdated.

The generate-plist command is part of the ports infrastructure. FreshPorts gathers this data with this command and the resulting file.

$ make generate-plist
$ cat work/.PLIST.mktmp

A recent blog post goes into more detail. All that goes into the configure_plist table:

freshports.dev=# \d configure_plist
Did not find any relation named "configure_plist".
freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

The original data

This is what the data looked like, as demonstrated by this query:

freshports.dev=# select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
             pkg_plist              |     array_to_json     
------------------------------------+-----------------------
 include/libconcord.h              +| ["lib/libconcord.so"]
 lib/libconcord.a                  +| 
 lib/libconcord.so                 +| 
 lib/libconcord.so.4               +| 
 lib/libconcord.so.4.0.0           +| 
 share/mime/packages/libconcord.xml | 
(1 row)

NOTE: This example user :alpha: but I later changed this to
:alnum: in my code.

NOTE: The above is one row.

The goal is to grab all the .so libraries.

The new layout

The new table, with refreshed data, looks like:

freshports.dev=# select * from generate_plist where port_id = 34724;
    id    | port_id |                       installed_file                       
----------+---------+------------------------------------------------------------
 27791797 |   34724 | @shared-mime-info share/mime
 27791798 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/catalog.mk
 27791799 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/LICENSE
 27791800 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/GPLv2
 27791801 |   34724 | include/libconcord.h
 27791802 |   34724 | lib/libconcord.a
 27791803 |   34724 | lib/libconcord.so
 27791804 |   34724 | lib/libconcord.so.4
 27791805 |   34724 | lib/libconcord.so.4.0.0
 27791806 |   34724 | share/mime/packages/libconcord.xml
 27791807 |   34724 | @postexec /usr/sbin/service ldconfig restart > /dev/null
 27791808 |   34724 | @postunexec /usr/sbin/service ldconfig restart > /dev/null
(12 rows)

freshports.dev=# 

First go

NOTE: I have removed the array_to_json() call for now:

freshports.dev=# select installed_file, regexp_match(installed_file, 'lib/[[:alpha:]]*?\.so') 
from generate_plist where port_id = 34724;
                       installed_file                       |    regexp_match     
------------------------------------------------------------+---------------------
 @shared-mime-info share/mime                               | 
 /usr/local/share/licenses/libconcord-1.2_2/catalog.mk      | 
 /usr/local/share/licenses/libconcord-1.2_2/LICENSE         | 
 /usr/local/share/licenses/libconcord-1.2_2/GPLv2           | 
 include/libconcord.h                                       | 
 lib/libconcord.a                                           | 
 lib/libconcord.so                                          | {lib/libconcord.so}
 lib/libconcord.so.4                                        | {lib/libconcord.so}
 lib/libconcord.so.4.0.0                                    | {lib/libconcord.so}
 share/mime/packages/libconcord.xml                         | 
 @postexec /usr/sbin/service ldconfig restart > /dev/null   | 
 @postunexec /usr/sbin/service ldconfig restart > /dev/null | 
(12 rows)

That looks good so far. What I did not realize until later: each value of the regexp_match output is an array.

Let’s try this on a port with more libraries:

freshports.dev=# select distinct regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
   from generate_plist
  WHERE generate_plist.port_id = 9589 and regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so')  is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

That still looks good.

Getting rid of duplicate values and ignoring null

I thought this was a simple way to get rid of the uninteresting values. I like using WITH. It allows you to take a very complex set of data and adjust it before performing additional operations upon it. What I’m doing here is not very complicated and I am sure it can be done without WITH.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589
)
select distinct lib from tmp where lib is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

JSON time

In this section, there are three main changes.

Convert the results from an array to the first element of that array. We a dealing with a single file. We should get just one match.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select distinct lib[1] from tmp where lib is not null;
       lib        
------------------
 lib/hamlibtcl.so
 lib/libhamlib.so
 lib/Hamlib.so
(3 rows)

Next, combine all the rows into one row, using array.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array(select distinct lib[1] from tmp where lib is not null);
                       array                       
---------------------------------------------------
 {lib/hamlibtcl.so,lib/libhamlib.so,lib/Hamlib.so}
(1 row)

The last step, change that to JSON output.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                      array_to_json                      
---------------------------------------------------------
 ["lib/hamlibtcl.so","lib/libhamlib.so","lib/Hamlib.so"]
(1 row)

But…

I seems wasteful, or backward, to be converting to an array and then to JSON, but I think that’s the way it needs to be done for this data. I’m happy to have additional ideas if you have them.

EDIT: Magnus Hagander gave me the solution there. Use json_agg instead.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select json_agg(distinct lib[1]) from tmp where lib is not null;
                                              json_agg                                               
-----------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so", "lib/libipmiconsole.so", "lib/libipmidetect.so", "lib/libipmimonitoring.so"]
(1 row)

A more complex example

Let’s try sysutils/freeipmi which has more librarries:

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                                          array_to_json                                           
--------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so","lib/libipmidetect.so","lib/libipmiconsole.so","lib/libipmimonitoring.so"]
(1 row)

freshports.dev=# 

That seems to work.

Put it into a function

This is the function I created:

-- to convert the generate_plist table, multiple rows per port, into a
-- single row of JSON. Mostly used for Dependency line:
--
-- re https://github.com/FreshPorts/freshports/issues/216
--
CREATE OR REPLACE FUNCTION pkg_plist(bigint) returns json as $$
   WITH tmp AS (
       select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
         from generate_plist
        where generate_plist.port_id = $1
   )
   select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
$$ LANGUAGE SQL STABLE;

The changed code

The original code was:

freshports.dev=# explain analyse
select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.31 rows=1 width=190) (actual time=0.092..0.096 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.305 ms
 Execution Time: 0.138 ms
(4 rows)
freshports.dev=# 

The new code is:

freshports.dev=# explain analyse
freshports.dev-# select pkg_plist, pkg_plist(id) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.56 rows=1 width=190) (actual time=0.323..0.324 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.229 ms
 Execution Time: 0.340 ms
(4 rows)

That is an acceptable difference.

Thanks.

Jan 312021
 

In this post I’ll talk about how the code intially took the easy way out and how recent changes have tightened that up. In short, the code would raise N messages when just a single notification is sufficient.

Today the full vuxml database was reloaded into FreshPorts. This affects approximately 3454 ports and the marking of 60,436 commits (or revisions) as vulnerable. When you look at a port page, you can see the skulls which indicate a known vuln beside the commits in question (look under Commit History). That makes up for 565,576 unique combinations of commits, ports, and vuxml entries.

After all that is processed (over about 21.5 hours on the current server), we need to flush the cache of for each item that changed (ports, daily summaries, etc).

How cache clearing works

When a port is updated, a trigger on the ports table adds an entry to the cache_clearing_ports. A long running daemon monitors the table waiting for a LISTEN notification. When it arrives, the daemon reads the table and starts processing the entries, clearing the cache.

This is simplified though. Updates occur in many places and each invokes its own trigger. However, for this post, we will deal with only one.

current cache_clearing_ports table

This is the current cache_clearing_ports table:

freshports.org=# \d cache_clearing_ports
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

freshports.org=# 

As you can see, duplicate entries are OK. However, over a single commit, which is how the vuxml data is processed, this can result in a large number of duplicates.

How many duplicates?

Checking here, we have just over 1 million entries.

freshports.org=# select count(*) from cache_clearing_ports;
  count  
---------
 1005407
(1 row)

How unique are they?

Only 3500 ports:

freshports.org=# select count(distinct port_id) from commit_log_ports_vuxml;
 count 
-------
  3454
(1 row)

I recall when I was writing this: duplicates are no big deal. The cache gets cleared twice. So what?

It takes time to read through the cache_clearing_ports. A long time, when you have about 1 million entries.

The trigger

This is the existing trigger:

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port);

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

  DROP TRIGGER IF EXISTS ports_clear_cache ON ports;
CREATE TRIGGER ports_clear_cache
    AFTER UPDATE ON ports
    FOR EACH ROW
    EXECUTE PROCEDURE ports_clear_cache();

This is how I propose to change the table. First, I’m creating a new identical table to play with:

freshports.dev=# create table cache_clearing_ports_test (like cache_clearing_ports INCLUDING ALL);
CREATE TABLE

Just what I need. This creates a new table, including all the constraints etc, so I could play with it without affecting the ongoing database operation in my dev server.

This is the DDL I would run:

ALTER TABLE public.cache_clearing_ports_test
    ADD CONSTRAINT cache_clearing_ports_test_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports_test
    IS 'Let''s store just one instance per port_id';

The DDL for the real table would be:

ALTER TABLE public.cache_clearing_ports
    ADD CONSTRAINT cache_clearing_ports_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports
    IS 'Let''s store just one instance per port_id';

This is the amended test table:

freshports.dev=# \d cache_clearing_ports_test
                                      Table "public.cache_clearing_ports_test"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_test_pkey" PRIMARY KEY, btree (id)
    "cache_clearing_ports_port_id_idx" UNIQUE CONSTRAINT, btree (port_id)

freshports.dev=# 

Next step, fix the trigger, which will produce duplicate errors if left as it is.

The new trigger

The INSERT command has a conflict_target clause, which is very useful for this situation.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

What this says:

  • On the INSERT
  • IF there is a conflict on the unique port_id index
  • THEN do nothing

There are many other things you can do ON CONFLICT, but in my case, doing nothing is what I want.

But wait! There’s more!

I went one step farther. Why NOTIFY if we don’t have to.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
      l_row_count bigint;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            GET DIAGNOSTICS l_row_count = ROW_COUNT;
            IF l_row_count > 0 THEN
                  NOTIFY port_updated;
            END IF;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

In this solution, the code checks to see if anything was inserted. If so, it issues the NOTIFY.

Aug 112020
 

This idea came to me today from swills:

stupid question, would it be possible to make the urls on freshports case insensitive?

so that like https://www.freshports.org/x11/libx11 would go to https://www.freshports.org/x11/libX11

After confirming this was for manually typed URLs (as opposed to generated links), I started looking into it. This is what I found.

In this post, we’re going to look at some old PHP database code and the PostgreSQL stored procedures behind them. I’ll show you the multi-line stored procedure now in use and how a single query will now do the same thing. This change isn’t some clever coding or another solution. The database has evolved and the procedures did not have to. Until now.

Going from sysutils/py-mqttwarn to a database entry

I will base all my links to code on the 1.50.5 release.

If the Nginx webserver cannot follow the URL to a file on disk, it invokes www/–/index.php via this bit of declartion:

  location / {
    try_files $uri $uri/ @missing;
  }

  location @missing {
    rewrite ^ /--/index.php break;

    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass unix:/var/run/php-fpm.sock;
    fastcgi_index index.php;
    fastcgi_param SCRIPT_FILENAME $request_filename;
    include fastcgi_params;
  }

Line 2 says: try finding the URL using files on disk and if you file, go do what @missing points to.

Line 6 say: invoke /–/index.php for things you don’t find in a file.

Why /–/index.php? I wanted to put this file somewhere that it would be highly unlikely to collide with anything in the FreeBSD repo (with regards to directories and file names).

The code checks for other special stuff ((badges, scripts, API, etc) before invoking freshports_Parse404URI() on line 148. I will discuss that soon, but first: if nothing else is picked up by /–/index.php, then it’s time to invoke rewrite/missing.php

freshports_Parse404URI() is declared in rewrite/functions.php and the code relevant to our journey is on line 70:

if ($ElementRecord->FetchByName('/ports/head/' . $pathname)) {

Ignoring that hardcoded pathname for now, this function gets passed something like this:

/ports/head/sysutils/py-mqttwarn

There might be more on the path: branch name, commit number, etc.

FetchByName() is located on line 38 of classes/element_record

Now we are getting into the oldest code of FreshPorts. I’m sure this stuff hasn’t changed in over 15 years.

The database function elementGet() is invoked which looks like this:

CREATE OR REPLACE FUNCTION elementGet (text) RETURNS SETOF element_type AS $$
   SELECT id,
          name::text,
          directory_file_flag::text,
          status::text,
          case when IsCategory(Pathname_ID($1)) IS NULL THEN FALSE ELSE TRUE END,
          case when IsPort(    Pathname_ID($1)) IS NULL THEN FALSE ELSE TRUE END,
          element_pathname(id)
     FROM element
    WHERE id = PathName_ID($1);
$$ LANGUAGE SQL STABLE;

The key in this function is PathName_ID() which is used three times and uniquely identifies a single row in the element table.

Element table

The element table is an implementation of a filesystem using a self-referential solution (i.e. it is recursive).

The table (without constraints, references, and triggers) looks like this:

freshports.dev=# \d element
                                     Table "public.element"
       Column        |     Type     | Collation | Nullable |               Default               
---------------------+--------------+-----------+----------+-------------------------------------
 id                  | integer      |           | not null | nextval('element_id_seq'::regclass)
 name                | text         |           | not null | 
 parent_id           | integer      |           |          | 
 directory_file_flag | character(1) |           | not null | 
 status              | character(1) |           | not null | 
Indexes:
    "element_pkey" PRIMARY KEY, btree (id)
    "element_delete" btree (status) WHERE status = 'D'::bpchar
    "element_name" btree (name)
    "element_parent_id" btree (parent_id)

An entry in the top level directory will have parent_id = NULL.

Pathname_ID function

Pathname_ID() is a function which progresses along a supplied pathname from left to right following the each entry from the element table.

CREATE OR REPLACE FUNCTION Pathname_ID(text) returns int4 AS $$

   DECLARE
      in_element_pathname ALIAS FOR $1;

      debug_str         text;
      element_id        int4;
      element_parent_id int4;
      element_name      text;
      element_pathname  text;
      position          int4;
      slash             text;

begin
  slash            := '/';
  element_pathname := in_element_pathname;
  element_id       := -1;

  IF substr(element_pathname,1,1) = slash THEN
    element_pathname := substr(element_pathname, 2);
  END IF;

  position := strpos(element_pathname, slash);
  IF position = 0 THEN
    element_name     := element_pathname;
    element_pathname := '';
  ELSE
    element_name     := substr(element_pathname, 1, position-1);
    element_pathname := substr(element_pathname, position+1);
  END IF;

  select id 
    into element_parent_id
    from element 
   where element.name = element_name
     and parent_id is null;

  debug_str := debug_str || '1 = ' || element_name;

  WHILE(char_length(element_pathname) > 0) LOOP
    position := strpos(element_pathname, slash);
    if position = 0 then
       element_name     := element_pathname;
       element_pathname := '';
    else
       element_name     := substr(element_pathname, 1, position-1);
       element_pathname := substr(element_pathname, position+1);
    end if;

    debug_str := debug_str || '1 = ' || element_name;
    select id 
      into element_parent_id
      from element 
     where element.name = element_name 
       and parent_id    = element_parent_id;

    debug_str := debug_str || ' 2 = ' || element_name;
    debug_str := debug_str || ' 3 = ' || element_parent_id;
  end loop;

  return element_parent_id;
END;
$$  LANGUAGE 'plpgsql' STABLE;

The key lines are 35 and 54. To meet the goal of a case insensitive match, those operators would need to be changed from = to a case insensitive comparison, such as:

where lower(element.name) = lower(element_name)

I do not want to alter this code which is so far down the tree. Instead, I will create a new function which will take a different approach. It will also take advantage of new database features which did not exist 8 years ago, let along 15+ years ago.

Testing new queries

I did some playing around with queries today, looking to find the element_id for a given port, in our example sysutils/py-mqttwarn.

I started off with this, show me things named x11:

freshports.stage=# select *, element_pathname(id) from element where lower(name) = lower('X11');
   id    | name | parent_id | directory_file_flag | status |                    element_pathname                    
---------+------+-----------+---------------------+--------+--------------------------------------------------------
   43426 | x11  |     33542 | F                   | A      | /ports/head/misc/kdeutils11/x11
   52955 | x11  |      4760 | F                   | A      | /ports/head/misc/kdeutils2/x11
   58887 | x11  |     50378 | F                   | A      | /ports/head/textproc/xml-i18n-tools/x11
   58962 | x11  |      3582 | F                   | A      | /ports/head/www/gtkhtml/x11
   83338 | x11  |     77344 | D                   | A      | /doc/head/en_US.ISO8859-1/books/handbook/x11
  104943 | x11  |    104814 | D                   | A      | /doc/head/fr_FR.ISO8859-1/books/handbook/x11
   99879 | x11  |     86040 | D                   | A      | /doc/head/ja_JP.eucJP/books/handbook/x11
  200645 | x11  |    200550 | D                   | A      | /doc/head/nl_NL.ISO8859-1/books/handbook/x11
  110618 | x11  |     77691 | D                   | A      | /doc/head/de_DE.ISO8859-1/books/handbook/x11
  112293 | x11  |    112148 | D                   | A      | /doc/head/it_IT.ISO8859-15/books/handbook/x11
  113146 | X11  |     77283 | D                   | A      | /base/head/release/scripts/X11
  200746 | x11  |    200655 | D                   | A      | /doc/head/zh_CN.GB2312/books/handbook/x11
  171234 | x11  |    110131 | D                   | A      | /doc/head/ru_RU.KOI8-R/books/handbook/x11
  271292 | x11  |    271177 | D                   | A      | /doc/head/mn_MN.UTF-8/books/handbook/x11
  200939 | x11  |    110120 | D                   | A      | /doc/head/es_ES.ISO8859-1/books/handbook/x11
  234867 | x11  |    234757 | D                   | A      | /doc/head/zh_TW.Big5/books/handbook/x11
  255132 | x11  |    255022 | D                   | A      | /doc/head/pl_PL.ISO8859-2/books/handbook/x11
  288692 | x11  |    288608 | D                   | A      | /doc/head/el_GR.ISO8859-7/books/handbook/x11
  300947 | x11  |    300813 | D                   | A      | /doc/head/hu_HU.ISO8859-2/books/handbook/x11
  433144 | x11  |    433041 | D                   | A      | /ru_RU.KOI8-R/books/handbook/x11
  433549 | x11  |    433544 | D                   | A      | /mn_MN.UTF-8/books/handbook/x11
  434309 | x11  |    433663 | D                   | A      | /ja_JP.eucJP/books/handbook/x11
  439412 | x11  |    432543 | D                   | A      | /en_US.ISO8859-1/books/handbook/x11
  439459 | x11  |    433298 | D                   | A      | /nl_NL.ISO8859-1/books/handbook/x11
  457198 | x11  |    111733 | D                   | A      | /doc/head/pt_BR.ISO8859-1/books/handbook/x11
     231 | x11  |    464087 | D                   | A      | /ports/head/x11
  465674 | x11  |    465603 | D                   | A      | /doc/translations/nl_NL.ISO8859-1/books/handbook/x11
  501886 | x11  |    465481 | D                   | A      | /doc/translations/en_US.ISO8859-1/books/handbook/x11
  537516 | x11  |    537439 | D                   | A      | /doc/head/zh_CN.UTF-8/books/handbook/x11
  663947 | x11  |    663946 | D                   | A      | /doc/release/10.2.0/en_US.ISO8859-1/books/handbook/x11
  587542 | x11  |    587471 | D                   | D      | /doc/head/zh_TW.UTF-8/books/handbook/x11
  745067 | x11  |    743628 | D                   | A      | /ports/branches/2016Q4/x11
  770464 | x11  |    763349 | D                   | A      | /ports/branches/2017Q1/x11
  792524 | x11  |    790218 | D                   | A      | /ports/branches/2017Q2/x11
  812565 | x11  |    812271 | D                   | A      | /ports/branches/2017Q3/x11
  829995 | x11  |    829562 | D                   | A      | /ports/branches/2017Q4/x11
  870434 | x11  |    867105 | D                   | A      | /ports/branches/2018Q2/x11
  882372 | x11  |    881009 | D                   | A      | /ports/branches/2018Q3/x11
  901665 | x11  |    901631 | D                   | A      | /ports/branches/2018Q4/x11
  914724 | x11  |    914354 | D                   | A      | /ports/branches/2019Q1/x11
  936002 | x11  |    935538 | D                   | A      | /ports/branches/2019Q2/x11
  954616 | x11  |    949697 | D                   | A      | /ports/branches/2019Q3/x11
  971674 | x11  |    961357 | D                   | A      | /ports/branches/2019Q4/x11
  978479 | x11  |    978429 | D                   | A      | /ports/branches/2020Q1/x11
 1014222 | x11  |   1012899 | D                   | A      | /ports/branches/2020Q2/x11
 1055981 | x11  |   1053714 | D                   | A      | /ports/branches/2020Q3/x11
(46 rows)

Based on line 29, show me a child of that element named libX11:

freshports.stage=# select *, element_pathname(id) from element where lower(name) = lower('libX11') and parent_id = 231;
   id   |  name  | parent_id | directory_file_flag | status |    element_pathname    
--------+--------+-----------+---------------------+--------+------------------------
 163623 | libX11 |       231 | D                   | A      | /ports/head/x11/libX11
(1 row)

freshports.stage=# 

That’s the one.

Using this approach, I came up with few queries, which I later abandoned for another solution.

The abandoned queries

I tried this join of four tables:

freshports.stage=# explain analyse
freshports.stage-# SELECT E4.id, element_pathname(E4.id)
freshports.stage-#   FROM element E1, element E2, element E3, element E4
freshports.stage-#  WHERE E1.parent_id IS NULL
freshports.stage-#    AND E2.parent_id = E1.id
freshports.stage-#    AND E3.parent_id = E2.id
freshports.stage-#    AND E4.parent_id = E3.id
freshports.stage-#    AND E2.name = 'head'
freshports.stage-#    AND E1.name = 'ports'
freshports.stage-#    AND lower(E3.name) = lower('sysutils')
freshports.stage-#    AND lower(E4.name) = lower('py-mqttwarn');
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.05..74.95 rows=1 width=36) (actual time=1.833..2.299 rows=1 loops=1)
   ->  Nested Loop  (cost=19.62..67.62 rows=1 width=4) (actual time=0.126..0.138 rows=1 loops=1)
         ->  Nested Loop  (cost=19.19..27.23 rows=1 width=4) (actual time=0.061..0.063 rows=1 loops=1)
               ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.034..0.035 rows=1 loops=1)
                     Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.030..0.030 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.016..0.016 rows=38 loops=1)
                                 Index Cond: (name = 'ports'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.010..0.010 rows=100 loops=1)
                                 Index Cond: (parent_id IS NULL)
               ->  Bitmap Heap Scan on element e2  (cost=9.43..13.45 rows=1 width=8) (actual time=0.023..0.024 rows=1 loops=1)
                     Recheck Cond: ((name = 'head'::text) AND (parent_id = e1.id))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.020..0.020 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.009..0.009 rows=15 loops=1)
                                 Index Cond: (name = 'head'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.008..0.008 rows=5 loops=1)
                                 Index Cond: (parent_id = e1.id)
         ->  Index Scan using element_parent_id on element e3  (cost=0.43..40.38 rows=1 width=8) (actual time=0.064..0.074 rows=1 loops=1)
               Index Cond: (parent_id = e2.id)
               Filter: (lower(name) = 'sysutils'::text)
               Rows Removed by Filter: 88
   ->  Index Scan using element_parent_id on element e4  (cost=0.43..7.07 rows=1 width=8) (actual time=1.540..1.994 rows=1 loops=1)
         Index Cond: (parent_id = e3.id)
         Filter: (lower(name) = 'py-mqttwarn'::text)
         Rows Removed by Filter: 2380
 Planning Time: 0.689 ms
 Execution Time: 2.352 ms
(29 rows)

freshports.stage=# 

I tried rewriting with a JOIN JOIN approach, but that’s really just the previous query:

freshports.stage=# -- find a port which matches this
freshports.stage=# -- case insensitive
freshports.stage=# 
freshports.stage=# 
freshports.stage=# explain analyse
freshports.stage-# SELECT E4.id, element_pathname(E4.id)
freshports.stage-#   FROM element E1
freshports.stage-#   JOIN element E2 ON E1.parent_id IS NULL AND E1.name = 'ports'
freshports.stage-#                  AND E2.parent_id = E1.id AND E2.name = 'head'
freshports.stage-#   JOIN element E3 ON E3.parent_id = E2.id AND lower(E3.name) = lower('sysutils')
freshports.stage-#   JOIN element E4 ON E4.parent_id = E3.id AND lower(E4.name) = lower('bacula9-server');
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.05..74.95 rows=1 width=36) (actual time=3.729..4.788 rows=1 loops=1)
   ->  Nested Loop  (cost=19.62..67.62 rows=1 width=4) (actual time=0.278..0.309 rows=1 loops=1)
         ->  Nested Loop  (cost=19.19..27.23 rows=1 width=4) (actual time=0.129..0.132 rows=1 loops=1)
               ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.072..0.073 rows=1 loops=1)
                     Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.063..0.063 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.032..0.033 rows=38 loops=1)
                                 Index Cond: (name = 'ports'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.023..0.023 rows=100 loops=1)
                                 Index Cond: (parent_id IS NULL)
               ->  Bitmap Heap Scan on element e2  (cost=9.43..13.45 rows=1 width=8) (actual time=0.047..0.049 rows=1 loops=1)
                     Recheck Cond: ((name = 'head'::text) AND (parent_id = e1.id))
                     Heap Blocks: exact=1
                     ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.041..0.041 rows=0 loops=1)
                           ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.019..0.019 rows=15 loops=1)
                                 Index Cond: (name = 'head'::text)
                           ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.016..0.016 rows=5 loops=1)
                                 Index Cond: (parent_id = e1.id)
         ->  Index Scan using element_parent_id on element e3  (cost=0.43..40.38 rows=1 width=8) (actual time=0.147..0.173 rows=1 loops=1)
               Index Cond: (parent_id = e2.id)
               Filter: (lower(name) = 'sysutils'::text)
               Rows Removed by Filter: 88
   ->  Index Scan using element_parent_id on element e4  (cost=0.43..7.07 rows=1 width=8) (actual time=3.097..4.125 rows=1 loops=1)
         Index Cond: (parent_id = e3.id)
         Filter: (lower(name) = 'bacula9-server'::text)
         Rows Removed by Filter: 2380
 Planning Time: 1.608 ms
 Execution Time: 4.898 ms
(29 rows)

The next query involves passing the parent id down to the next level:

freshports.stage=# -- find a port which matches this
freshports.stage=# -- case insensitive
freshports.stage=# 
freshports.stage=# explain analyse
freshports.stage-# SELECT E4.ID, element_pathname(E4.id)
freshports.stage-#   FROM element E4
freshports.stage-#  WHERE lower(E4.name) = lower('py37-mqttwarn') AND E4.parent_id = (
freshports.stage(# SELECT E3.ID
freshports.stage(#   FROM element E3
freshports.stage(#  WHERE lower(E3.name) = lower('sysutils') AND E3.parent_id = (
freshports.stage(# SELECT E2.ID
freshports.stage(#   FROM element E2
freshports.stage(#  WHERE E2.name = 'head' AND E2.parent_id = (
freshports.stage(# SELECT E1.id
freshports.stage(#   FROM element E1 
freshports.stage(#  WHERE E1.parent_id IS NULL AND E1.name = 'ports')));
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using element_parent_id on element e4  (cost=68.93..110.05 rows=1 width=36) (actual time=1.902..1.902 rows=0 loops=1)
   Index Cond: (parent_id = $2)
   Filter: (lower(name) = 'py37-mqttwarn'::text)
   Rows Removed by Filter: 2381
   InitPlan 3 (returns $2)
     ->  Index Scan using element_parent_id on element e3  (cost=27.64..68.51 rows=1 width=4) (actual time=0.121..0.134 rows=1 loops=1)
           Index Cond: (parent_id = $1)
           Filter: (lower(name) = 'sysutils'::text)
           Rows Removed by Filter: 88
           InitPlan 2 (returns $1)
             ->  Bitmap Heap Scan on element e2  (cost=23.20..27.21 rows=1 width=4) (actual time=0.061..0.061 rows=1 loops=1)
                   Recheck Cond: ((name = 'head'::text) AND (parent_id = $0))
                   Heap Blocks: exact=1
                   InitPlan 1 (returns $0)
                     ->  Bitmap Heap Scan on element e1  (cost=9.76..13.77 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=1)
                           Recheck Cond: ((name = 'ports'::text) AND (parent_id IS NULL))
                           Heap Blocks: exact=1
                           ->  BitmapAnd  (cost=9.76..9.76 rows=1 width=0) (actual time=0.027..0.027 rows=0 loops=1)
                                 ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.011..0.011 rows=38 loops=1)
                                       Index Cond: (name = 'ports'::text)
                                 ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.95 rows=70 width=0) (actual time=0.013..0.013 rows=100 loops=1)
                                       Index Cond: (parent_id IS NULL)
                   ->  BitmapAnd  (cost=9.43..9.43 rows=1 width=0) (actual time=0.058..0.058 rows=0 loops=1)
                         ->  Bitmap Index Scan on element_name  (cost=0.00..4.56 rows=17 width=0) (actual time=0.016..0.016 rows=15 loops=1)
                               Index Cond: (name = 'head'::text)
                         ->  Bitmap Index Scan on element_parent_id  (cost=0.00..4.62 rows=26 width=0) (actual time=0.040..0.040 rows=5 loops=1)
                               Index Cond: (parent_id = $0)
 Planning Time: 0.235 ms
 Execution Time: 1.945 ms

The chosen query

The chosen query uses a relatively new table: element_pathname

freshports.stage=# \d element_pathname
            Table "public.element_pathname"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 element_id | integer |           | not null | 
 pathname   | text    |           | not null | 
Indexes:
    "element_pathname_pathname" UNIQUE, btree (pathname)
    "element_pathname_element_id" btree (element_id)
Foreign-key constraints:
    "element_pathname_element_id_fkey" FOREIGN KEY (element_id) REFERENCES element(id) ON DELETE CASCADE

freshports.stage=# 

This table is maintained via triggers on the element table.

The same query here is:

freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/py-mqttwarn';
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..19555.74 rows=5361 width=54) (actual time=158.979..160.785 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on element_pathname  (cost=0.00..18019.64 rows=2234 width=54) (actual time=145.503..156.289 rows=0 loops=3)
         Filter: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
         Rows Removed by Filter: 351214
 Planning Time: 0.483 ms
 Execution Time: 160.830 ms
(8 rows)

freshports.stage=#

Our query will always be on lower case, so let’s create an index on lower case. I am curious about the size of this index, and it added only 77MB to the database.

freshports.stage=# select pg_size_pretty(pg_indexes_size('element_pathname'));
 pg_size_pretty 
----------------
 107 MB
(1 row)

freshports.stage=# create index element_pathname_pathname_lc on element_pathname ((lower(pathname)));
CREATE INDEX
freshports.stage=# select pg_size_pretty(pg_indexes_size('element_pathname'));
 pg_size_pretty 
----------------
 184 MB
(1 row)

freshports.stage=# select count(*) from element_pathname;
  count  
---------
 1053643
(1 row)

freshports.stage=# 

Now let’s try that query again:

freshports.stage=# explain analyse select *                                                                          
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/py-mqttwarn';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.133..0.133 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.108..0.108 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/py-mqttwarn'::text)
 Planning Time: 0.087 ms
 Execution Time: 0.155 ms
(7 rows)

freshports.stage=# 

Now we’re using an index scan on that newly created lower case index.

Let’s do a few more queries to see how they go:

freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/bacula-server';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.050..0.051 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/bacula-server'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.037..0.037 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/bacula-server'::text)
 Planning Time: 0.073 ms
 Execution Time: 0.067 ms
(7 rows)
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/sysutils/bacula9-server';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.065..0.066 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/sysutils/bacula9-server'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.049..0.050 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/sysutils/bacula9-server'::text)
 Planning Time: 0.086 ms
 Execution Time: 0.088 ms
(7 rows)

freshports.stage=# 
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/lang/gcc10-devel';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.132..0.132 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/lang/gcc10-devel'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.116..0.117 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/lang/gcc10-devel'::text)
 Planning Time: 0.090 ms
 Execution Time: 0.155 ms
(7 rows)

freshports.stage=# 
freshports.stage=# explain analyse select *
FROM element_pathname where lower(pathname) = '/ports/head/lang/perl5.32';
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on element_pathname  (cost=241.38..9537.09 rows=5268 width=54) (actual time=0.059..0.060 rows=1 loops=1)
   Recheck Cond: (lower(pathname) = '/ports/head/lang/perl5.32'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on element_pathname_pathname_lc  (cost=0.00..240.06 rows=5268 width=0) (actual time=0.044..0.044 rows=1 loops=1)
         Index Cond: (lower(pathname) = '/ports/head/lang/perl5.32'::text)
 Planning Time: 0.085 ms
 Execution Time: 0.081 ms
(7 rows)

Yes, I think this will do nicely.

Things change

There is code and tables in here which date back to late 1999 or early 2000. Since then, the database has grown from 15 tables in Nov 2001 to 80 tables today.

I’m not sure when this svg was created or how many tables it represents, but I know its a few years old now.

Database and code evolve over time and it is sometimes interesteing to see the paths they take.

Apr 162020
 

This post is another in the packages series. It documents how the data is transformed from the raw form (pacakges_raw) into normalized data (packages).

The Packages – how is this data stored? post may be useful reading.

The primary purpose of this blog post is documentation of the above mentioned transformation process. Along the way I found a performance enhancement owing to a deferred constraint. That’s why this post is tagged with PostgreSQL.

Some data background

When data is copied into the package_raw table, two columns are left empty:

  1. abi_id
  2. port_id

Those values can be determined based on columns which are supplied:

  1. abi
  2. package_origin

Here is some sample data:

freshports.dev=# SELECT * FROM packages_raw WHERE abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin LIMIT 10;
   id    |         package_origin         |   package_name   | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------+------------------+-----------------+-----------------+--------+---------+-------------
 2749964 | accessibility/accerciser       | accerciser       | 3.22.0          | FreeBSD:13:i386 |        |         | latest
 2749124 | accessibility/at-spi2-atk      | at-spi2-atk      | 2.26.2          | FreeBSD:13:i386 |        |         | latest
 2749119 | accessibility/at-spi2-core     | at-spi2-core     | 2.28.0          | FreeBSD:13:i386 |        |         | latest
 2749003 | accessibility/atk              | atk              | 2.28.1          | FreeBSD:13:i386 |        |         | latest
 2748993 | accessibility/atkmm            | atkmm            | 2.24.2_4        | FreeBSD:13:i386 |        |         | latest
 2747796 | accessibility/caribou          | caribou          | 0.4.21_2        | FreeBSD:13:i386 |        |         | latest
 2760503 | accessibility/eflite           | eflite           | 0.4.1           | FreeBSD:13:i386 |        |         | latest
 2744976 | accessibility/kdeaccessibility | kdeaccessibility | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756892 | accessibility/kmag             | kmag             | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756793 | accessibility/kmousetool       | kmousetool       | 19.12.3_1       | FreeBSD:13:i386 |        |         | latest
(10 rows)

freshports.dev=# 

This date will eventually be copied into the packages table, but we need to populate the two empty columns first.

What should the data look like?

My first goal was to create a result set resembled the desired outcome.

        SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi         = 'FreeBSD:13:i386'
           AND P.package_set = 'latest'
      ORDER BY package_origin 
         LIMIT 10;
   id    |      name       | abi_id |         package_origin         | port_id |   package_name   | package_version | package_set 
---------+-----------------+--------+--------------------------------+---------+------------------+-----------------+-------------
 2749964 | FreeBSD:13:i386 |     10 | accessibility/accerciser       |   21760 | accerciser       | 3.22.0          | latest
 2749124 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-atk      |   26871 | at-spi2-atk      | 2.26.2          | latest
 2749119 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-core     |   26868 | at-spi2-core     | 2.28.0          | latest
 2749003 | FreeBSD:13:i386 |     10 | accessibility/atk              |   11601 | atk              | 2.28.1          | latest
 2748993 | FreeBSD:13:i386 |     10 | accessibility/atkmm            |   27818 | atkmm            | 2.24.2_4        | latest
 2747796 | FreeBSD:13:i386 |     10 | accessibility/caribou          |   35704 | caribou          | 0.4.21_2        | latest
 2760503 | FreeBSD:13:i386 |     10 | accessibility/eflite           |   22197 | eflite           | 0.4.1           | latest
 2744976 | FreeBSD:13:i386 |     10 | accessibility/kdeaccessibility |   11726 | kdeaccessibility | 19.12.3         | latest
 2756892 | FreeBSD:13:i386 |     10 | accessibility/kmag             |   31155 | kmag             | 19.12.3         | latest
 2756793 | FreeBSD:13:i386 |     10 | accessibility/kmousetool       |   31156 | kmousetool       | 19.12.3_1       | latest
(10 rows)

freshports.dev=# 

The ports_orgin is a new table created specifically for this query. Until now, there has been no need to obtain a port_id based upon its origin. This table will be updated via triggers upon the ports table. That bit of work has been deferred for now.

freshports.dev=# \d ports_origin
              Table "public.ports_origin"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 port_id     | integer |           | not null | 
 port_origin | text    |           | not null | 
Indexes:
    "port_origin_pkey" PRIMARY KEY, btree (port_id)
    "ports_origin_port_id_idx" btree (port_id)
    "ports_origin_port_origin_idx" btree (port_origin)
Foreign-key constraints:
    "ports_origin_port_id_fk" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE NOT VALID

The original test

The original test took 18 seconds to populate about 30,000 rows:

explain analyse
UPDATE packages_raw PR
   SET abi_id  = pkg.abi_id,
       port_id = pkg.port_id
  FROM (SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi        = 'FreeBSD:13:i386'
           AND P.package_set = 'latest') AS pkg
 WHERE PR.id = pkg.id;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=11081.844..11081.844 rows=0 loops=1)
   ->  Hash Join  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=2880.504..4680.566 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=9051.81..36627.53 rows=22050 width=113) (actual time=2659.534..3858.975 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.344..0.351 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=9051.81..36405.99 rows=22050 width=120) (actual time=2659.181..3844.350 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12452.32 rows=524132 width=77) (actual time=0.370..2365.531 rows=459733 loops=1)
                     ->  Hash  (cost=8560.18..8560.18 rows=22050 width=51) (actual time=143.257..143.258 rows=30492 loops=1)
                           Buckets: 16384  Batches: 4  Memory Usage: 767kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=1018.44..8560.18 rows=22050 width=51) (actual time=86.264..117.885 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..1012.92 rows=22050 width=0) (actual time=85.918..85.918 rows=66670 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.029..220.029 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.458..172.482 rows=57807 loops=1)
 Planning Time: 7.125 ms
 Trigger for constraint packages_raw_abi_id_fk: time=1151.252 calls=30474
 Trigger for constraint packages_raw_port_id_fk: time=6532.501 calls=30474
 Execution Time: 18782.602 ms

See https://explain.depesz.com/s/6d9Z for more detail.

I noticed all the time spent checking foreign keys while updating 30474 rows. I figured I could save perhaps most of that nearly 7 seconds spent on constraint.

I thought we could defer that. I changed the packages_raw_port_id_fk constraint to be DEFERRABLE INITIALLY DEFERRED. I also added ON UPDATE CASCADE and ON DELETE CASCADE which should have been there from the start (but that’s not relevant to the change in performance).

Test number 2

On the second try, we get:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=12221.244..12221.244 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2772.498..6044.554 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2552.187..5468.500 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.335..0.338 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2551.846..5455.815 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.337..2901.118 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=171.243..171.244 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=100.492..143.901 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=100.172..100.173 rows=60966 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=219.460..219.460 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.459..172.605 rows=57807 loops=1)
 Planning Time: 9.476 ms
 Trigger for constraint packages_raw_abi_id_fk: time=363.506 calls=30474
 Execution Time: 12589.008 ms

Roughly 5 seconds of savings. See https://explain.depesz.com/s/Y9gj.

OK, let’s do the same for abi_id.

3rd time

After setting packages_raw_abi_id_fk to DEFERRABLE INITIALLY DEFERRED, the 3rd test results were:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=8798.389..8798.389 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2841.848..4960.062 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2620.392..3970.915 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.343..0.350 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2620.040..3956.973 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.340..2796.802 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=167.195..167.196 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=110.897..140.914 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=110.419..110.420 rows=91440 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.662..220.662 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.446..173.184 rows=57807 loops=1)
 Planning Time: 4.560 ms
 Execution Time: 8800.805 ms

That’s nearly 10 seconds faster. Sure, there might be some caching involved but the time spent checking foreign keys is greatly reduced.

See https://explain.depesz.com/s/XXnj.

Next, I noticed the sequential scan on ports_origin. Let’s see about that.

Yeah, that self-join

The self-join was noticed by xocolatl as being unnecessary. Let’s remove that to get:

UPDATE packages_raw PR
   SET abi_id  = abi.id,
       port_id = PO.port_id
  FROM abi, ports_origin PO
 WHERE PR.abi         = 'FreeBSD:13:i386'
   AND PR.package_set = 'latest'
   AND PR.abi         = abi.name
   AND PO.port_origin = PR.package_origin;

                                                                                QUERY PLAN                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=6183.437..6183.437 rows=0 loops=1)
   ->  Hash Join  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=418.091..845.851 rows=38905 loops=1)
         Hash Cond: (pr.package_origin = po.port_origin)
         ->  Nested Loop  (cost=989.00..9048.43 rows=16056 width=87) (actual time=174.079..268.378 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.030..0.036 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Bitmap Heap Scan on packages_raw pr  (cost=989.00..8886.84 rows=16056 width=77) (actual time=174.032..256.085 rows=30492 loops=1)
                     Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                     Heap Blocks: exact=416
                     ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..984.98 rows=16056 width=0) (actual time=173.458..173.458 rows=121903 loops=1)
                           Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=243.062..243.062 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.451..195.454 rows=57807 loops=1)
 Planning Time: 2.922 ms
 Execution Time: 6186.449 ms
(17 rows)

Well, isn’t that better?

See https://explain.depesz.com/s/cnyw for details.

That sequential scan on ports_origin takes up 195ms. Nothing to worry about.

Things to follow up on

This query doesn’t find values for all the FreeBSD:13:i386 * latest combination. Some are not matched.

freshports.dev=# SELECT * FROM packages_raw WHERE port_id IS NULL AND abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin;
   id    |            package_origin            |          package_name          | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------------+--------------------------------+-----------------+-----------------+--------+---------+-------------
 2745970 | databases/rubygem-peek-mysql2-rails5 | rubygem-peek-mysql2-rails5     | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2740242 | devel/p5-File-ShareDir-Tarball       | p5-File-ShareDir-Tarball       | 0.2.2           | FreeBSD:13:i386 |        |         | latest
 2741383 | devel/p5-MooX-Singleton              | p5-MooX-Singleton              | 1.20            | FreeBSD:13:i386 |        |         | latest
 2740542 | devel/p5-MooseX-MungeHas             | p5-MooseX-MungeHas             | 0.011           | FreeBSD:13:i386 |        |         | latest
 2733133 | devel/py-aspy.yaml                   | py27-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2741040 | devel/py-aspy.yaml                   | py37-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2743867 | devel/py-nodeenv                     | py27-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2744182 | devel/py-nodeenv                     | py37-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2746002 | devel/rubygem-peek-gc-rails5         | rubygem-peek-gc-rails5         | 0.0.2           | FreeBSD:13:i386 |        |         | latest
 2745932 | devel/rubygem-peek-rails5            | rubygem-peek-rails5            | 1.0.1           | FreeBSD:13:i386 |        |         | latest
 2745897 | devel/rubygem-peek-rblineprof-rails5 | rubygem-peek-rblineprof-rails5 | 0.2.0           | FreeBSD:13:i386 |        |         | latest
 2745876 | devel/rubygem-peek-redis-rails5      | rubygem-peek-redis-rails5      | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2749944 | security/picosha2                    | picosha2                       | 1.0.0.23        | FreeBSD:13:i386 |        |         | latest
 2749043 | sysutils/atf-rk3399                  | atf-rk3399                     | v2.1            | FreeBSD:13:i386 |        |         | latest
 2746664 | sysutils/py-leviathan                | py27-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2746885 | sysutils/py-leviathan                | py37-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2737873 | sysutils/swapusage                   | swapusage                      | 1.0.0           | FreeBSD:13:i386 |        |         | latest
 2744052 | www/rubygem-gon-rails5               | rubygem-gon-rails5             | 6.2.1           | FreeBSD:13:i386 |        |         | latest
(18 rows)

This is a development database and those ports do not exist on this host. It could also be an out-of-date ports_origin table, but I haven’t checked that yet. All of those ports were added to the FreeBSD repo on these dates:

  1. February 5 2019
  2. April 15 2019
  3. May 26-29 2019

They are present in the production database. Nothing to be done here.

However, the system will have to flag such entries for later followup.