Converting a column-based expression to a row based solution

In this post I will show you how I discovered that the backend was putting data into a new table (generate_plist) but the website was still using the ports.pkg_plist column. The result was stale data being shown on the website (for older ports) or no data at all (for newer ports).

How long? Since Oct 2017, because that’s when configure_plist showed up on the website. Nobody noticed until recently. I can only guess that I made the backend changes with intent to change the frontend later. I should have created myself a ticket.

Why was the change made? Primarily because all ports, especially Python based ports, have a pkg-plist file. These ports generate the list automatically at package-built time. Secondly, it also simplifies the search of pkg-plist.

Ironically, this bug also affects search, which is still using the old column.

Background

One set of data which FreshPorts shows is the list of files installed by a given package. This information can be provided via a file, called pkg-plist.

Here is one example for the comms/gnuradio port:

freshports.dev=# select pkg_plist from ports where id = 34724;
             pkg_plist              
------------------------------------
 include/libconcord.h              +
 lib/libconcord.a                  +
 lib/libconcord.so                 +
 lib/libconcord.so.4               +
 lib/libconcord.so.4.0.0           +
 share/mime/packages/libconcord.xml
(1 row)

freshports.dev=# 

NOTE: the above is one row of output. It is a collection of strings, separated by newlines.

If visit the website, you’ll see that list is wrong, because it’s outdated.

The generate-plist command is part of the ports infrastructure. FreshPorts gathers this data with this command and the resulting file.

$ make generate-plist
$ cat work/.PLIST.mktmp

A recent blog post goes into more detail. All that goes into the configure_plist table:

freshports.dev=# \d configure_plist
Did not find any relation named "configure_plist".
freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

The original data

This is what the data looked like, as demonstrated by this query:

freshports.dev=# select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
             pkg_plist              |     array_to_json     
------------------------------------+-----------------------
 include/libconcord.h              +| ["lib/libconcord.so"]
 lib/libconcord.a                  +| 
 lib/libconcord.so                 +| 
 lib/libconcord.so.4               +| 
 lib/libconcord.so.4.0.0           +| 
 share/mime/packages/libconcord.xml | 
(1 row)

NOTE: This example user :alpha: but I later changed this to
:alnum: in my code.

NOTE: The above is one row.

The goal is to grab all the .so libraries.

The new layout

The new table, with refreshed data, looks like:

freshports.dev=# select * from generate_plist where port_id = 34724;
    id    | port_id |                       installed_file                       
----------+---------+------------------------------------------------------------
 27791797 |   34724 | @shared-mime-info share/mime
 27791798 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/catalog.mk
 27791799 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/LICENSE
 27791800 |   34724 | /usr/local/share/licenses/libconcord-1.2_2/GPLv2
 27791801 |   34724 | include/libconcord.h
 27791802 |   34724 | lib/libconcord.a
 27791803 |   34724 | lib/libconcord.so
 27791804 |   34724 | lib/libconcord.so.4
 27791805 |   34724 | lib/libconcord.so.4.0.0
 27791806 |   34724 | share/mime/packages/libconcord.xml
 27791807 |   34724 | @postexec /usr/sbin/service ldconfig restart > /dev/null
 27791808 |   34724 | @postunexec /usr/sbin/service ldconfig restart > /dev/null
(12 rows)

freshports.dev=# 

First go

NOTE: I have removed the array_to_json() call for now:

freshports.dev=# select installed_file, regexp_match(installed_file, 'lib/[[:alpha:]]*?\.so') 
from generate_plist where port_id = 34724;
                       installed_file                       |    regexp_match     
------------------------------------------------------------+---------------------
 @shared-mime-info share/mime                               | 
 /usr/local/share/licenses/libconcord-1.2_2/catalog.mk      | 
 /usr/local/share/licenses/libconcord-1.2_2/LICENSE         | 
 /usr/local/share/licenses/libconcord-1.2_2/GPLv2           | 
 include/libconcord.h                                       | 
 lib/libconcord.a                                           | 
 lib/libconcord.so                                          | {lib/libconcord.so}
 lib/libconcord.so.4                                        | {lib/libconcord.so}
 lib/libconcord.so.4.0.0                                    | {lib/libconcord.so}
 share/mime/packages/libconcord.xml                         | 
 @postexec /usr/sbin/service ldconfig restart > /dev/null   | 
 @postunexec /usr/sbin/service ldconfig restart > /dev/null | 
(12 rows)

That looks good so far. What I did not realize until later: each value of the regexp_match output is an array.

Let’s try this on a port with more libraries:

freshports.dev=# select distinct regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
   from generate_plist
  WHERE generate_plist.port_id = 9589 and regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so')  is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

That still looks good.

Getting rid of duplicate values and ignoring null

I thought this was a simple way to get rid of the uninteresting values. I like using WITH. It allows you to take a very complex set of data and adjust it before performing additional operations upon it. What I’m doing here is not very complicated and I am sure it can be done without WITH.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589
)
select distinct lib from tmp where lib is not null;
        lib         
--------------------
 {lib/libhamlib.so}
 {lib/hamlibtcl.so}
 {lib/Hamlib.so}
(3 rows)

JSON time

In this section, there are three main changes.

Convert the results from an array to the first element of that array. We a dealing with a single file. We should get just one match.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select distinct lib[1] from tmp where lib is not null;
       lib        
------------------
 lib/hamlibtcl.so
 lib/libhamlib.so
 lib/Hamlib.so
(3 rows)

Next, combine all the rows into one row, using array.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array(select distinct lib[1] from tmp where lib is not null);
                       array                       
---------------------------------------------------
 {lib/hamlibtcl.so,lib/libhamlib.so,lib/Hamlib.so}
(1 row)

The last step, change that to JSON output.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 9589)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                      array_to_json                      
---------------------------------------------------------
 ["lib/hamlibtcl.so","lib/libhamlib.so","lib/Hamlib.so"]
(1 row)

But…

I seems wasteful, or backward, to be converting to an array and then to JSON, but I think that’s the way it needs to be done for this data. I’m happy to have additional ideas if you have them.

EDIT: Magnus Hagander gave me the solution there. Use json_agg instead.

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select json_agg(distinct lib[1]) from tmp where lib is not null;
                                              json_agg                                               
-----------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so", "lib/libipmiconsole.so", "lib/libipmidetect.so", "lib/libipmimonitoring.so"]
(1 row)

A more complex example

Let’s try sysutils/freeipmi which has more librarries:

freshports.dev=# WITH tmp AS (
    select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
      from generate_plist
     where generate_plist.port_id = 14152)
select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
                                          array_to_json                                           
--------------------------------------------------------------------------------------------------
 ["lib/libfreeipmi.so","lib/libipmidetect.so","lib/libipmiconsole.so","lib/libipmimonitoring.so"]
(1 row)

freshports.dev=# 

That seems to work.

Put it into a function

This is the function I created:

-- to convert the generate_plist table, multiple rows per port, into a
-- single row of JSON. Mostly used for Dependency line:
--
-- re https://github.com/FreshPorts/freshports/issues/216
--
CREATE OR REPLACE FUNCTION pkg_plist(bigint) returns json as $$
   WITH tmp AS (
       select regexp_match(installed_file, 'lib/[[:alnum:]]*?\.so') as lib 
         from generate_plist
        where generate_plist.port_id = $1
   )
   select array_to_json(array(select distinct lib[1] from tmp where lib is not null));
$$ LANGUAGE SQL STABLE;

The changed code

The original code was:

freshports.dev=# explain analyse
select pkg_plist, array_to_json(regexp_match(pkg_plist, 'lib/[[:alpha:]]*?\.so')) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.31 rows=1 width=190) (actual time=0.092..0.096 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.305 ms
 Execution Time: 0.138 ms
(4 rows)
freshports.dev=# 

The new code is:

freshports.dev=# explain analyse
freshports.dev-# select pkg_plist, pkg_plist(id) from ports where id = 34724;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Index Scan using ports_pkey on ports  (cost=0.29..8.56 rows=1 width=190) (actual time=0.323..0.324 rows=1 loops=1)
   Index Cond: (id = 34724)
 Planning Time: 0.229 ms
 Execution Time: 0.340 ms
(4 rows)

That is an acceptable difference.

Thanks.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top