Apr 162020
 

This post is another in the packages series. It documents how the data is transformed from the raw form (pacakges_raw) into normalized data (packages).

The Packages – how is this data stored? post may be useful reading.

The primary purpose of this blog post is documentation of the above mentioned transformation process. Along the way I found a performance enhancement owing to a deferred constraint. That’s why this post is tagged with PostgreSQL.

Some data background

When data is copied into the package_raw table, two columns are left empty:

  1. abi_id
  2. port_id

Those values can be determined based on columns which are supplied:

  1. abi
  2. package_origin

Here is some sample data:

freshports.dev=# SELECT * FROM packages_raw WHERE abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin LIMIT 10;
   id    |         package_origin         |   package_name   | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------+------------------+-----------------+-----------------+--------+---------+-------------
 2749964 | accessibility/accerciser       | accerciser       | 3.22.0          | FreeBSD:13:i386 |        |         | latest
 2749124 | accessibility/at-spi2-atk      | at-spi2-atk      | 2.26.2          | FreeBSD:13:i386 |        |         | latest
 2749119 | accessibility/at-spi2-core     | at-spi2-core     | 2.28.0          | FreeBSD:13:i386 |        |         | latest
 2749003 | accessibility/atk              | atk              | 2.28.1          | FreeBSD:13:i386 |        |         | latest
 2748993 | accessibility/atkmm            | atkmm            | 2.24.2_4        | FreeBSD:13:i386 |        |         | latest
 2747796 | accessibility/caribou          | caribou          | 0.4.21_2        | FreeBSD:13:i386 |        |         | latest
 2760503 | accessibility/eflite           | eflite           | 0.4.1           | FreeBSD:13:i386 |        |         | latest
 2744976 | accessibility/kdeaccessibility | kdeaccessibility | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756892 | accessibility/kmag             | kmag             | 19.12.3         | FreeBSD:13:i386 |        |         | latest
 2756793 | accessibility/kmousetool       | kmousetool       | 19.12.3_1       | FreeBSD:13:i386 |        |         | latest
(10 rows)

freshports.dev=# 

This date will eventually be copied into the packages table, but we need to populate the two empty columns first.

What should the data look like?

My first goal was to create a result set resembled the desired outcome.

        SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi         = 'FreeBSD:13:i386'
           AND P.package_set = 'latest'
      ORDER BY package_origin 
         LIMIT 10;
   id    |      name       | abi_id |         package_origin         | port_id |   package_name   | package_version | package_set 
---------+-----------------+--------+--------------------------------+---------+------------------+-----------------+-------------
 2749964 | FreeBSD:13:i386 |     10 | accessibility/accerciser       |   21760 | accerciser       | 3.22.0          | latest
 2749124 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-atk      |   26871 | at-spi2-atk      | 2.26.2          | latest
 2749119 | FreeBSD:13:i386 |     10 | accessibility/at-spi2-core     |   26868 | at-spi2-core     | 2.28.0          | latest
 2749003 | FreeBSD:13:i386 |     10 | accessibility/atk              |   11601 | atk              | 2.28.1          | latest
 2748993 | FreeBSD:13:i386 |     10 | accessibility/atkmm            |   27818 | atkmm            | 2.24.2_4        | latest
 2747796 | FreeBSD:13:i386 |     10 | accessibility/caribou          |   35704 | caribou          | 0.4.21_2        | latest
 2760503 | FreeBSD:13:i386 |     10 | accessibility/eflite           |   22197 | eflite           | 0.4.1           | latest
 2744976 | FreeBSD:13:i386 |     10 | accessibility/kdeaccessibility |   11726 | kdeaccessibility | 19.12.3         | latest
 2756892 | FreeBSD:13:i386 |     10 | accessibility/kmag             |   31155 | kmag             | 19.12.3         | latest
 2756793 | FreeBSD:13:i386 |     10 | accessibility/kmousetool       |   31156 | kmousetool       | 19.12.3_1       | latest
(10 rows)

freshports.dev=# 

The ports_orgin is a new table created specifically for this query. Until now, there has been no need to obtain a port_id based upon its origin. This table will be updated via triggers upon the ports table. That bit of work has been deferred for now.

freshports.dev=# \d ports_origin
              Table "public.ports_origin"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 port_id     | integer |           | not null | 
 port_origin | text    |           | not null | 
Indexes:
    "port_origin_pkey" PRIMARY KEY, btree (port_id)
    "ports_origin_port_id_idx" btree (port_id)
    "ports_origin_port_origin_idx" btree (port_origin)
Foreign-key constraints:
    "ports_origin_port_id_fk" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE NOT VALID

The original test

The original test took 18 seconds to populate about 30,000 rows:

explain analyse
UPDATE packages_raw PR
   SET abi_id  = pkg.abi_id,
       port_id = pkg.port_id
  FROM (SELECT P.id,
               abi.name,
               abi.id AS abi_id,
               P.package_origin,
               PO.port_id,
               P.package_name,
               P.package_version,
               P.package_set
          FROM packages_raw P 
          JOIN abi on P.abi = abi.name
          JOIN ports_origin PO on PO.port_origin = P.package_origin
         WHERE P.abi        = 'FreeBSD:13:i386'
           AND P.package_set = 'latest') AS pkg
 WHERE PR.id = pkg.id;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=11081.844..11081.844 rows=0 loops=1)
   ->  Hash Join  (cost=11144.47..40827.87 rows=30175 width=115) (actual time=2880.504..4680.566 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=9051.81..36627.53 rows=22050 width=113) (actual time=2659.534..3858.975 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.344..0.351 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=9051.81..36405.99 rows=22050 width=120) (actual time=2659.181..3844.350 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12452.32 rows=524132 width=77) (actual time=0.370..2365.531 rows=459733 loops=1)
                     ->  Hash  (cost=8560.18..8560.18 rows=22050 width=51) (actual time=143.257..143.258 rows=30492 loops=1)
                           Buckets: 16384  Batches: 4  Memory Usage: 767kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=1018.44..8560.18 rows=22050 width=51) (actual time=86.264..117.885 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..1012.92 rows=22050 width=0) (actual time=85.918..85.918 rows=66670 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.029..220.029 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.458..172.482 rows=57807 loops=1)
 Planning Time: 7.125 ms
 Trigger for constraint packages_raw_abi_id_fk: time=1151.252 calls=30474
 Trigger for constraint packages_raw_port_id_fk: time=6532.501 calls=30474
 Execution Time: 18782.602 ms

See https://explain.depesz.com/s/6d9Z for more detail.

I noticed all the time spent checking foreign keys while updating 30474 rows. I figured I could save perhaps most of that nearly 7 seconds spent on constraint.

I thought we could defer that. I changed the packages_raw_port_id_fk constraint to be DEFERRABLE INITIALLY DEFERRED. I also added ON UPDATE CASCADE and ON DELETE CASCADE which should have been there from the start (but that’s not relevant to the change in performance).

Test number 2

On the second try, we get:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=12221.244..12221.244 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2772.498..6044.554 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2552.187..5468.500 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.335..0.338 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2551.846..5455.815 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.337..2901.118 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=171.243..171.244 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=100.492..143.901 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=100.172..100.173 rows=60966 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=219.460..219.460 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.459..172.605 rows=57807 loops=1)
 Planning Time: 9.476 ms
 Trigger for constraint packages_raw_abi_id_fk: time=363.506 calls=30474
 Execution Time: 12589.008 ms

Roughly 5 seconds of savings. See https://explain.depesz.com/s/Y9gj.

OK, let’s do the same for abi_id.

3rd time

After setting packages_raw_abi_id_fk to DEFERRABLE INITIALLY DEFERRED, the 3rd test results were:

Update on packages_raw pr  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=8798.389..8798.389 rows=0 loops=1)
   ->  Hash Join  (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2841.848..4960.062 rows=38905 loops=1)
         Hash Cond: (p.package_origin = po.port_origin)
         ->  Nested Loop  (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2620.392..3970.915 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.343..0.350 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Hash Join  (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2620.040..3956.973 rows=30492 loops=1)
                     Hash Cond: (pr.id = p.id)
                     ->  Seq Scan on packages_raw pr  (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.340..2796.802 rows=459733 loops=1)
                     ->  Hash  (cost=10510.70..10510.70 rows=14998 width=51) (actual time=167.195..167.196 rows=30492 loops=1)
                           Buckets: 16384 (originally 16384)  Batches: 4 (originally 2)  Memory Usage: 897kB
                           ->  Bitmap Heap Scan on packages_raw p  (cost=978.15..10510.70 rows=14998 width=51) (actual time=110.897..140.914 rows=30492 loops=1)
                                 Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                                 Heap Blocks: exact=416
                                 ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..974.40 rows=14998 width=0) (actual time=110.419..110.420 rows=91440 loops=1)
                                       Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=220.662..220.662 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.446..173.184 rows=57807 loops=1)
 Planning Time: 4.560 ms
 Execution Time: 8800.805 ms

That’s nearly 10 seconds faster. Sure, there might be some caching involved but the time spent checking foreign keys is greatly reduced.

See https://explain.depesz.com/s/XXnj.

Next, I noticed the sequential scan on ports_origin. Let’s see about that.

Yeah, that self-join

The self-join was noticed by xocolatl as being unnecessary. Let’s remove that to get:

UPDATE packages_raw PR
   SET abi_id  = abi.id,
       port_id = PO.port_id
  FROM abi, ports_origin PO
 WHERE PR.abi         = 'FreeBSD:13:i386'
   AND PR.package_set = 'latest'
   AND PR.abi         = abi.name
   AND PO.port_origin = PR.package_origin;

                                                                                QUERY PLAN                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages_raw pr  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=6183.437..6183.437 rows=0 loops=1)
   ->  Hash Join  (cost=3081.65..12658.10 rows=21940 width=109) (actual time=418.091..845.851 rows=38905 loops=1)
         Hash Cond: (pr.package_origin = po.port_origin)
         ->  Nested Loop  (cost=989.00..9048.43 rows=16056 width=87) (actual time=174.079..268.378 rows=30492 loops=1)
               ->  Seq Scan on abi  (cost=0.00..1.04 rows=1 width=42) (actual time=0.030..0.036 rows=1 loops=1)
                     Filter: (name = 'FreeBSD:13:i386'::text)
                     Rows Removed by Filter: 8
               ->  Bitmap Heap Scan on packages_raw pr  (cost=989.00..8886.84 rows=16056 width=77) (actual time=174.032..256.085 rows=30492 loops=1)
                     Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
                     Heap Blocks: exact=416
                     ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..984.98 rows=16056 width=0) (actual time=173.458..173.458 rows=121903 loops=1)
                           Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets))
         ->  Hash  (cost=974.07..974.07 rows=57807 width=30) (actual time=243.062..243.062 rows=57807 loops=1)
               Buckets: 16384  Batches: 8  Memory Usage: 576kB
               ->  Seq Scan on ports_origin po  (cost=0.00..974.07 rows=57807 width=30) (actual time=0.451..195.454 rows=57807 loops=1)
 Planning Time: 2.922 ms
 Execution Time: 6186.449 ms
(17 rows)

Well, isn’t that better?

See https://explain.depesz.com/s/cnyw for details.

That sequential scan on ports_origin takes up 195ms. Nothing to worry about.

Things to follow up on

This query doesn’t find values for all the FreeBSD:13:i386 * latest combination. Some are not matched.

freshports.dev=# SELECT * FROM packages_raw WHERE port_id IS NULL AND abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin;
   id    |            package_origin            |          package_name          | package_version |       abi       | abi_id | port_id | package_set 
---------+--------------------------------------+--------------------------------+-----------------+-----------------+--------+---------+-------------
 2745970 | databases/rubygem-peek-mysql2-rails5 | rubygem-peek-mysql2-rails5     | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2740242 | devel/p5-File-ShareDir-Tarball       | p5-File-ShareDir-Tarball       | 0.2.2           | FreeBSD:13:i386 |        |         | latest
 2741383 | devel/p5-MooX-Singleton              | p5-MooX-Singleton              | 1.20            | FreeBSD:13:i386 |        |         | latest
 2740542 | devel/p5-MooseX-MungeHas             | p5-MooseX-MungeHas             | 0.011           | FreeBSD:13:i386 |        |         | latest
 2733133 | devel/py-aspy.yaml                   | py27-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2741040 | devel/py-aspy.yaml                   | py37-aspy.yaml                 | 1.3.0           | FreeBSD:13:i386 |        |         | latest
 2743867 | devel/py-nodeenv                     | py27-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2744182 | devel/py-nodeenv                     | py37-nodeenv                   | 1.3.3           | FreeBSD:13:i386 |        |         | latest
 2746002 | devel/rubygem-peek-gc-rails5         | rubygem-peek-gc-rails5         | 0.0.2           | FreeBSD:13:i386 |        |         | latest
 2745932 | devel/rubygem-peek-rails5            | rubygem-peek-rails5            | 1.0.1           | FreeBSD:13:i386 |        |         | latest
 2745897 | devel/rubygem-peek-rblineprof-rails5 | rubygem-peek-rblineprof-rails5 | 0.2.0           | FreeBSD:13:i386 |        |         | latest
 2745876 | devel/rubygem-peek-redis-rails5      | rubygem-peek-redis-rails5      | 1.2.0           | FreeBSD:13:i386 |        |         | latest
 2749944 | security/picosha2                    | picosha2                       | 1.0.0.23        | FreeBSD:13:i386 |        |         | latest
 2749043 | sysutils/atf-rk3399                  | atf-rk3399                     | v2.1            | FreeBSD:13:i386 |        |         | latest
 2746664 | sysutils/py-leviathan                | py27-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2746885 | sysutils/py-leviathan                | py37-leviathan                 | 0.1.1           | FreeBSD:13:i386 |        |         | latest
 2737873 | sysutils/swapusage                   | swapusage                      | 1.0.0           | FreeBSD:13:i386 |        |         | latest
 2744052 | www/rubygem-gon-rails5               | rubygem-gon-rails5             | 6.2.1           | FreeBSD:13:i386 |        |         | latest
(18 rows)

This is a development database and those ports do not exist on this host. It could also be an out-of-date ports_origin table, but I haven’t checked that yet. All of those ports were added to the FreeBSD repo on these dates:

  1. February 5 2019
  2. April 15 2019
  3. May 26-29 2019

They are present in the production database. Nothing to be done here.

However, the system will have to flag such entries for later followup.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive