This post is another in the packages series. It documents how the data is transformed from the raw form (packages_raw) into normalized data (packages).
The Packages – how is this data stored? post may be useful reading.
The primary purpose of this blog post is documentation of the above mentioned transformation process. Along the way I found a performance enhancement owing to a deferred constraint. That’s why this post is tagged with PostgreSQL.
Some data background
When data is copied into the package_raw table, two columns are left empty:
- abi_id
- port_id
Those values can be determined based on columns which are supplied:
- abi
- package_origin
Here is some sample data:
freshports.dev=# SELECT * FROM packages_raw WHERE abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin LIMIT 10; id | package_origin | package_name | package_version | abi | abi_id | port_id | package_set ---------+--------------------------------+------------------+-----------------+-----------------+--------+---------+------------- 2749964 | accessibility/accerciser | accerciser | 3.22.0 | FreeBSD:13:i386 | | | latest 2749124 | accessibility/at-spi2-atk | at-spi2-atk | 2.26.2 | FreeBSD:13:i386 | | | latest 2749119 | accessibility/at-spi2-core | at-spi2-core | 2.28.0 | FreeBSD:13:i386 | | | latest 2749003 | accessibility/atk | atk | 2.28.1 | FreeBSD:13:i386 | | | latest 2748993 | accessibility/atkmm | atkmm | 2.24.2_4 | FreeBSD:13:i386 | | | latest 2747796 | accessibility/caribou | caribou | 0.4.21_2 | FreeBSD:13:i386 | | | latest 2760503 | accessibility/eflite | eflite | 0.4.1 | FreeBSD:13:i386 | | | latest 2744976 | accessibility/kdeaccessibility | kdeaccessibility | 19.12.3 | FreeBSD:13:i386 | | | latest 2756892 | accessibility/kmag | kmag | 19.12.3 | FreeBSD:13:i386 | | | latest 2756793 | accessibility/kmousetool | kmousetool | 19.12.3_1 | FreeBSD:13:i386 | | | latest (10 rows) freshports.dev=#
This date will eventually be copied into the packages table, but we need to populate the two empty columns first.
What should the data look like?
My first goal was to create a result set resembled the desired outcome.
SELECT P.id, abi.name, abi.id AS abi_id, P.package_origin, PO.port_id, P.package_name, P.package_version, P.package_set FROM packages_raw P JOIN abi on P.abi = abi.name JOIN ports_origin PO on PO.port_origin = P.package_origin WHERE P.abi = 'FreeBSD:13:i386' AND P.package_set = 'latest' ORDER BY package_origin LIMIT 10; id | name | abi_id | package_origin | port_id | package_name | package_version | package_set ---------+-----------------+--------+--------------------------------+---------+------------------+-----------------+------------- 2749964 | FreeBSD:13:i386 | 10 | accessibility/accerciser | 21760 | accerciser | 3.22.0 | latest 2749124 | FreeBSD:13:i386 | 10 | accessibility/at-spi2-atk | 26871 | at-spi2-atk | 2.26.2 | latest 2749119 | FreeBSD:13:i386 | 10 | accessibility/at-spi2-core | 26868 | at-spi2-core | 2.28.0 | latest 2749003 | FreeBSD:13:i386 | 10 | accessibility/atk | 11601 | atk | 2.28.1 | latest 2748993 | FreeBSD:13:i386 | 10 | accessibility/atkmm | 27818 | atkmm | 2.24.2_4 | latest 2747796 | FreeBSD:13:i386 | 10 | accessibility/caribou | 35704 | caribou | 0.4.21_2 | latest 2760503 | FreeBSD:13:i386 | 10 | accessibility/eflite | 22197 | eflite | 0.4.1 | latest 2744976 | FreeBSD:13:i386 | 10 | accessibility/kdeaccessibility | 11726 | kdeaccessibility | 19.12.3 | latest 2756892 | FreeBSD:13:i386 | 10 | accessibility/kmag | 31155 | kmag | 19.12.3 | latest 2756793 | FreeBSD:13:i386 | 10 | accessibility/kmousetool | 31156 | kmousetool | 19.12.3_1 | latest (10 rows) freshports.dev=#
The ports_orgin is a new table created specifically for this query. Until now, there has been no need to obtain a port_id based upon its origin. This table will be updated via triggers upon the ports table. That bit of work has been deferred for now.
freshports.dev=# \d ports_origin Table "public.ports_origin" Column | Type | Collation | Nullable | Default -------------+---------+-----------+----------+--------- port_id | integer | | not null | port_origin | text | | not null | Indexes: "port_origin_pkey" PRIMARY KEY, btree (port_id) "ports_origin_port_id_idx" btree (port_id) "ports_origin_port_origin_idx" btree (port_origin) Foreign-key constraints: "ports_origin_port_id_fk" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE NOT VALID
The original test
The original test took 18 seconds to populate about 30,000 rows:
explain analyse UPDATE packages_raw PR SET abi_id = pkg.abi_id, port_id = pkg.port_id FROM (SELECT P.id, abi.name, abi.id AS abi_id, P.package_origin, PO.port_id, P.package_name, P.package_version, P.package_set FROM packages_raw P JOIN abi on P.abi = abi.name JOIN ports_origin PO on PO.port_origin = P.package_origin WHERE P.abi = 'FreeBSD:13:i386' AND P.package_set = 'latest') AS pkg WHERE PR.id = pkg.id; ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Update on packages_raw pr (cost=11144.47..40827.87 rows=30175 width=115) (actual time=11081.844..11081.844 rows=0 loops=1) -> Hash Join (cost=11144.47..40827.87 rows=30175 width=115) (actual time=2880.504..4680.566 rows=38905 loops=1) Hash Cond: (p.package_origin = po.port_origin) -> Nested Loop (cost=9051.81..36627.53 rows=22050 width=113) (actual time=2659.534..3858.975 rows=30492 loops=1) -> Seq Scan on abi (cost=0.00..1.04 rows=1 width=42) (actual time=0.344..0.351 rows=1 loops=1) Filter: (name = 'FreeBSD:13:i386'::text) Rows Removed by Filter: 8 -> Hash Join (cost=9051.81..36405.99 rows=22050 width=120) (actual time=2659.181..3844.350 rows=30492 loops=1) Hash Cond: (pr.id = p.id) -> Seq Scan on packages_raw pr (cost=0.00..12452.32 rows=524132 width=77) (actual time=0.370..2365.531 rows=459733 loops=1) -> Hash (cost=8560.18..8560.18 rows=22050 width=51) (actual time=143.257..143.258 rows=30492 loops=1) Buckets: 16384 Batches: 4 Memory Usage: 767kB -> Bitmap Heap Scan on packages_raw p (cost=1018.44..8560.18 rows=22050 width=51) (actual time=86.264..117.885 rows=30492 loops=1) Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) Heap Blocks: exact=416 -> Bitmap Index Scan on packages_raw_abi_package_set_idx (cost=0.00..1012.92 rows=22050 width=0) (actual time=85.918..85.918 rows=66670 loops=1) Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) -> Hash (cost=974.07..974.07 rows=57807 width=30) (actual time=220.029..220.029 rows=57807 loops=1) Buckets: 16384 Batches: 8 Memory Usage: 576kB -> Seq Scan on ports_origin po (cost=0.00..974.07 rows=57807 width=30) (actual time=0.458..172.482 rows=57807 loops=1) Planning Time: 7.125 ms Trigger for constraint packages_raw_abi_id_fk: time=1151.252 calls=30474 Trigger for constraint packages_raw_port_id_fk: time=6532.501 calls=30474 Execution Time: 18782.602 ms
See https://explain.depesz.com/s/6d9Z for more detail.
I noticed all the time spent checking foreign keys while updating 30474 rows. I figured I could save perhaps most of that nearly 7 seconds spent on constraint.
I thought we could defer that. I changed the packages_raw_port_id_fk constraint to be DEFERRABLE INITIALLY DEFERRED. I also added ON UPDATE CASCADE and ON DELETE CASCADE which should have been there from the start (but that’s not relevant to the change in performance).
Test number 2
On the second try, we get:
Update on packages_raw pr (cost=12937.84..38416.01 rows=20524 width=115) (actual time=12221.244..12221.244 rows=0 loops=1) -> Hash Join (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2772.498..6044.554 rows=38905 loops=1) Hash Cond: (p.package_origin = po.port_origin) -> Nested Loop (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2552.187..5468.500 rows=30492 loops=1) -> Seq Scan on abi (cost=0.00..1.04 rows=1 width=42) (actual time=0.335..0.338 rows=1 loops=1) Filter: (name = 'FreeBSD:13:i386'::text) Rows Removed by Filter: 8 -> Hash Join (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2551.846..5455.815 rows=30492 loops=1) Hash Cond: (pr.id = p.id) -> Seq Scan on packages_raw pr (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.337..2901.118 rows=459733 loops=1) -> Hash (cost=10510.70..10510.70 rows=14998 width=51) (actual time=171.243..171.244 rows=30492 loops=1) Buckets: 16384 (originally 16384) Batches: 4 (originally 2) Memory Usage: 897kB -> Bitmap Heap Scan on packages_raw p (cost=978.15..10510.70 rows=14998 width=51) (actual time=100.492..143.901 rows=30492 loops=1) Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) Heap Blocks: exact=416 -> Bitmap Index Scan on packages_raw_abi_package_set_idx (cost=0.00..974.40 rows=14998 width=0) (actual time=100.172..100.173 rows=60966 loops=1) Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) -> Hash (cost=974.07..974.07 rows=57807 width=30) (actual time=219.460..219.460 rows=57807 loops=1) Buckets: 16384 Batches: 8 Memory Usage: 576kB -> Seq Scan on ports_origin po (cost=0.00..974.07 rows=57807 width=30) (actual time=0.459..172.605 rows=57807 loops=1) Planning Time: 9.476 ms Trigger for constraint packages_raw_abi_id_fk: time=363.506 calls=30474 Execution Time: 12589.008 ms
Roughly 5 seconds of savings. See https://explain.depesz.com/s/Y9gj.
OK, let’s do the same for abi_id.
3rd time
After setting packages_raw_abi_id_fk to DEFERRABLE INITIALLY DEFERRED, the 3rd test results were:
Update on packages_raw pr (cost=12937.84..38416.01 rows=20524 width=115) (actual time=8798.389..8798.389 rows=0 loops=1) -> Hash Join (cost=12937.84..38416.01 rows=20524 width=115) (actual time=2841.848..4960.062 rows=38905 loops=1) Hash Cond: (p.package_origin = po.port_origin) -> Nested Loop (cost=10845.18..34762.92 rows=14998 width=113) (actual time=2620.392..3970.915 rows=30492 loops=1) -> Seq Scan on abi (cost=0.00..1.04 rows=1 width=42) (actual time=0.343..0.350 rows=1 loops=1) Filter: (name = 'FreeBSD:13:i386'::text) Rows Removed by Filter: 8 -> Hash Join (cost=10845.18..34611.91 rows=14998 width=120) (actual time=2620.040..3956.973 rows=30492 loops=1) Hash Cond: (pr.id = p.id) -> Seq Scan on packages_raw pr (cost=0.00..12835.28 rows=384928 width=77) (actual time=0.340..2796.802 rows=459733 loops=1) -> Hash (cost=10510.70..10510.70 rows=14998 width=51) (actual time=167.195..167.196 rows=30492 loops=1) Buckets: 16384 (originally 16384) Batches: 4 (originally 2) Memory Usage: 897kB -> Bitmap Heap Scan on packages_raw p (cost=978.15..10510.70 rows=14998 width=51) (actual time=110.897..140.914 rows=30492 loops=1) Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) Heap Blocks: exact=416 -> Bitmap Index Scan on packages_raw_abi_package_set_idx (cost=0.00..974.40 rows=14998 width=0) (actual time=110.419..110.420 rows=91440 loops=1) Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) -> Hash (cost=974.07..974.07 rows=57807 width=30) (actual time=220.662..220.662 rows=57807 loops=1) Buckets: 16384 Batches: 8 Memory Usage: 576kB -> Seq Scan on ports_origin po (cost=0.00..974.07 rows=57807 width=30) (actual time=0.446..173.184 rows=57807 loops=1) Planning Time: 4.560 ms Execution Time: 8800.805 ms
That’s nearly 10 seconds faster. Sure, there might be some caching involved but the time spent checking foreign keys is greatly reduced.
See https://explain.depesz.com/s/XXnj.
Next, I noticed the sequential scan on ports_origin. Let’s see about that.
Yeah, that self-join
The self-join was noticed by xocolatl as being unnecessary. Let’s remove that to get:
UPDATE packages_raw PR SET abi_id = abi.id, port_id = PO.port_id FROM abi, ports_origin PO WHERE PR.abi = 'FreeBSD:13:i386' AND PR.package_set = 'latest' AND PR.abi = abi.name AND PO.port_origin = PR.package_origin; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Update on packages_raw pr (cost=3081.65..12658.10 rows=21940 width=109) (actual time=6183.437..6183.437 rows=0 loops=1) -> Hash Join (cost=3081.65..12658.10 rows=21940 width=109) (actual time=418.091..845.851 rows=38905 loops=1) Hash Cond: (pr.package_origin = po.port_origin) -> Nested Loop (cost=989.00..9048.43 rows=16056 width=87) (actual time=174.079..268.378 rows=30492 loops=1) -> Seq Scan on abi (cost=0.00..1.04 rows=1 width=42) (actual time=0.030..0.036 rows=1 loops=1) Filter: (name = 'FreeBSD:13:i386'::text) Rows Removed by Filter: 8 -> Bitmap Heap Scan on packages_raw pr (cost=989.00..8886.84 rows=16056 width=77) (actual time=174.032..256.085 rows=30492 loops=1) Recheck Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) Heap Blocks: exact=416 -> Bitmap Index Scan on packages_raw_abi_package_set_idx (cost=0.00..984.98 rows=16056 width=0) (actual time=173.458..173.458 rows=121903 loops=1) Index Cond: ((abi = 'FreeBSD:13:i386'::text) AND (package_set = 'latest'::package_sets)) -> Hash (cost=974.07..974.07 rows=57807 width=30) (actual time=243.062..243.062 rows=57807 loops=1) Buckets: 16384 Batches: 8 Memory Usage: 576kB -> Seq Scan on ports_origin po (cost=0.00..974.07 rows=57807 width=30) (actual time=0.451..195.454 rows=57807 loops=1) Planning Time: 2.922 ms Execution Time: 6186.449 ms (17 rows)
Well, isn’t that better?
See https://explain.depesz.com/s/cnyw for details.
That sequential scan on ports_origin takes up 195ms. Nothing to worry about.
Things to follow up on
This query doesn’t find values for all the FreeBSD:13:i386 * latest combination. Some are not matched.
freshports.dev=# SELECT * FROM packages_raw WHERE port_id IS NULL AND abi = 'FreeBSD:13:i386' AND package_set = 'latest' ORDER BY package_origin; id | package_origin | package_name | package_version | abi | abi_id | port_id | package_set ---------+--------------------------------------+--------------------------------+-----------------+-----------------+--------+---------+------------- 2745970 | databases/rubygem-peek-mysql2-rails5 | rubygem-peek-mysql2-rails5 | 1.2.0 | FreeBSD:13:i386 | | | latest 2740242 | devel/p5-File-ShareDir-Tarball | p5-File-ShareDir-Tarball | 0.2.2 | FreeBSD:13:i386 | | | latest 2741383 | devel/p5-MooX-Singleton | p5-MooX-Singleton | 1.20 | FreeBSD:13:i386 | | | latest 2740542 | devel/p5-MooseX-MungeHas | p5-MooseX-MungeHas | 0.011 | FreeBSD:13:i386 | | | latest 2733133 | devel/py-aspy.yaml | py27-aspy.yaml | 1.3.0 | FreeBSD:13:i386 | | | latest 2741040 | devel/py-aspy.yaml | py37-aspy.yaml | 1.3.0 | FreeBSD:13:i386 | | | latest 2743867 | devel/py-nodeenv | py27-nodeenv | 1.3.3 | FreeBSD:13:i386 | | | latest 2744182 | devel/py-nodeenv | py37-nodeenv | 1.3.3 | FreeBSD:13:i386 | | | latest 2746002 | devel/rubygem-peek-gc-rails5 | rubygem-peek-gc-rails5 | 0.0.2 | FreeBSD:13:i386 | | | latest 2745932 | devel/rubygem-peek-rails5 | rubygem-peek-rails5 | 1.0.1 | FreeBSD:13:i386 | | | latest 2745897 | devel/rubygem-peek-rblineprof-rails5 | rubygem-peek-rblineprof-rails5 | 0.2.0 | FreeBSD:13:i386 | | | latest 2745876 | devel/rubygem-peek-redis-rails5 | rubygem-peek-redis-rails5 | 1.2.0 | FreeBSD:13:i386 | | | latest 2749944 | security/picosha2 | picosha2 | 1.0.0.23 | FreeBSD:13:i386 | | | latest 2749043 | sysutils/atf-rk3399 | atf-rk3399 | v2.1 | FreeBSD:13:i386 | | | latest 2746664 | sysutils/py-leviathan | py27-leviathan | 0.1.1 | FreeBSD:13:i386 | | | latest 2746885 | sysutils/py-leviathan | py37-leviathan | 0.1.1 | FreeBSD:13:i386 | | | latest 2737873 | sysutils/swapusage | swapusage | 1.0.0 | FreeBSD:13:i386 | | | latest 2744052 | www/rubygem-gon-rails5 | rubygem-gon-rails5 | 6.2.1 | FreeBSD:13:i386 | | | latest (18 rows)
This is a development database and those ports do not exist on this host. It could also be an out-of-date ports_origin table, but I haven’t checked that yet. All of those ports were added to the FreeBSD repo on these dates:
- February 5 2019
- April 15 2019
- May 26-29 2019
They are present in the production database. Nothing to be done here.
However, the system will have to flag such entries for later followup.