Updating the packages table – INSERT, UPDATE, DELETE – plus some ZFS configuration

In the previous post I documented how the packages_raw table is updated to populate the abi_id and port_id columns. This post will discuss how that data is then pushed into the packages table.

Thoughts

When updates arrive, they arrive in quantity. The latest update was 30474 records. From that, we will have some new packages, some will go away, and some will be updates. Many will have no change. I see two approaches:

  1. UPDATE where found and nothing to change, INSERT where not found, DELETE where no longer present.
  2. DELETE everything related to this import, INSERT everything imported.

Option 1 will have far less database churn. Option 2 will maximize it.

Let’s look at option 1 mostly because I like that approach best and it should be straight forward to implement.

That approach also makes me want to keep statistics on package building changes.

UPDATE

I want to update only records which have changes. This works:

SELECT *
  FROM packages_raw PR, packages P
 WHERE P.abi_id           = PR.abi_id
   AND P.package_set      = PR.package_set
   AND P.package_name     = PR.package_name
   AND P.package_version != PR.package_version
   AND PR.abi             = 'FreeBSD:13:amd64'
   AND PR.package_set     = 'latest';

Therefore I only want to update those rows. Let’s try this query:

UPDATE packages p
   SET package_version = PR.package_version
  FROM packages_raw PR
 WHERE P.abi_id           = PR.abi_id
   AND P.package_set      = PR.package_set
   AND P.package_name     = PR.package_name
   AND P.package_version != PR.package_version;
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages p  (cost=168509.43..183856.43 rows=392902 width=70) (actual time=11500.885..11500.885 rows=0 loops=1)
   ->  Merge Join  (cost=168509.43..183856.43 rows=392902 width=70) (actual time=7492.742..10230.503 rows=2562 loops=1)
         Merge Cond: ((p.abi_id = pr.abi_id) AND (p.package_set = pr.package_set) AND (p.package_name = pr.package_name))
         Join Filter: (p.package_version <> pr.package_version)
         Rows Removed by Join Filter: 455631
         ->  Sort  (cost=82431.21..83603.67 rows=468984 width=48) (actual time=2378.262..3650.466 rows=459029 loops=1)
               Sort Key: p.abi_id, p.package_set, p.package_name
               Sort Method: external merge  Disk: 26824kB
               ->  Seq Scan on packages p  (cost=0.00..9400.84 rows=468984 width=48) (actual time=0.277..1327.490 rows=459029 loops=1)
         ->  Materialize  (cost=86078.22..88376.89 rows=459733 width=36) (actual time=5114.403..6312.587 rows=459496 loops=1)
               ->  Sort  (cost=86078.22..87227.55 rows=459733 width=36) (actual time=5114.397..6247.665 rows=459496 loops=1)
                     Sort Key: pr.abi_id, pr.package_set, pr.package_name
                     Sort Method: external merge  Disk: 21408kB
                     ->  Seq Scan on packages_raw pr  (cost=0.00..17695.33 rows=459733 width=36) (actual time=0.283..3710.428 rows=459733 loops=1)
 Planning Time: 4.093 ms
 Execution Time: 11501.204 ms
(16 rows)

freshports.dev=# 

By this time it was late and I finished this for the day. Good thing I did…

Let ZFS ARC cache all the database

The next morning, I started looking at shared_buffers and random_page_cost.

I was worried about this update taking 10 seconds, but it does a join across two 450,000 row tables to then update 2500 rows. See https://explain.depesz.com/s/xlWa. I knew I had set this before, so I went searching for values. I checked my production server, where I found it shared_buffers = 128MB. Eventually I found my blog post about caching in ZFS vs caching in PostgreSQL: Website slowness – the investigation.

After setting zfs set primarycache=all, I did the steps found in that post to pre-cache the database:

[dan@x8dtu:~] $ sudo zfs snapshot zroot/data/pg01-postgres@for.caching
[dan@x8dtu:~] $ sudo zfs send zroot/data/pg01-postgres@for.caching > /dev/null
[dan@x8dtu:~] $ sudo zfs destroy zroot/data/pg01-postgres@for.caching
[dan@x8dtu:~] $ 

This server also runs off SSD for which random_page_cost=1.1 is appropriate.

Don’t process the whole table

For all of these operations we must be cautious regarding the contents of packages_raw.
We can never assume that packages_raw contains entries for all ABI and package_set combination already present in the packages table. Rather, it will contain only those sets which were recently imported. If the check and import implemented with sufficient frequency, it will detect new repos immediately and packages_raw will contain data from just one repo.

For this testing, we are using ABI FreeBSD:13:amd64 and package set latest.

The UPDATE: testing with a smaller set and more ZFS caching

Now the results have gone from about 7 or 11 seconds to about 1.5 seconds:

begin;
set track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS) 
UPDATE packages p
   SET package_version = PR.package_version
  FROM packages_raw PR
 WHERE P.abi_id           = PR.abi_id
   AND P.package_set      = PR.package_set
   AND P.package_name     = PR.package_name
   AND P.package_version != PR.package_version
   AND PR.abi             = 'FreeBSD:13:amd64'
   AND PR.package_set     = 'latest';
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on packages p  (cost=15321.52..26841.38 rows=17989 width=69) (actual time=305.336..305.336 rows=0 loops=1)
   Buffers: shared hit=9313 read=7201 dirtied=1105 written=10
   I/O Timings: read=72.845
   ->  Hash Join  (cost=15321.52..26841.38 rows=17989 width=69) (actual time=234.095..264.130 rows=469 loops=1)
         Hash Cond: ((pr.abi_id = p.abi_id) AND (pr.package_name = p.package_name))
         Join Filter: (p.package_version <> pr.package_version)
         Rows Removed by Join Filter: 30880
         Buffers: shared hit=154 read=6374 dirtied=135
         I/O Timings: read=48.906
         ->  Bitmap Heap Scan on packages_raw pr  (cost=480.97..11900.62 rows=19088 width=36) (actual time=17.957..34.032 rows=31401 loops=1)
               Recheck Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
               Heap Blocks: exact=530
               Buffers: shared read=1668
               I/O Timings: read=17.758
               ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..476.20 rows=19088 width=0) (actual time=17.832..17.832 rows=62784 loops=1)
                     Index Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
                     Buffers: shared read=773
                     I/O Timings: read=10.732
         ->  Hash  (cost=10655.10..10655.10 rows=279030 width=47) (actual time=210.372..210.373 rows=275766 loops=1)
               Buckets: 524288  Batches: 1  Memory Usage: 26085kB
               Buffers: shared hit=154 read=4706 dirtied=135
               I/O Timings: read=31.148
               ->  Seq Scan on packages p  (cost=0.00..10655.10 rows=279030 width=47) (actual time=0.023..113.982 rows=275766 loops=1)
                     Filter: (package_set = 'latest'::package_sets)
                     Rows Removed by Filter: 183263
                     Buffers: shared hit=154 read=4706 dirtied=135
                     I/O Timings: read=31.148
 Planning Time: 0.583 ms
 Execution Time: 308.639 ms
(29 rows)

freshports.dev=# 

See https://explain.depesz.com/s/N6KT for details.

The INSERT

We have to rows in packages_raw which are not found in packages.

How do we find that data?

begin;
set track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS) 
SELECT *
  FROM packages_raw PR
 WHERE NOT EXISTS (
 SELECT *
   FROM packages P
  WHERE P.abi_id           = PR.abi_id
    AND P.package_set      = PR.package_set
    AND P.package_name     = PR.package_name
    AND P.package_version  = PR.package_version
    AND PR.abi             = 'FreeBSD:13:amd64'
    AND PR.package_set     = 'latest')
 AND PR.abi_id  IS NOT NULL
 AND PR.port_id IS NOT NULL
 AND PR.abi             = 'FreeBSD:13:amd64'
 AND PR.package_set     = 'latest';
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1480.76..21125.96 rows=15908 width=79) (actual time=21.782..137.166 rows=503 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=127168
   ->  Nested Loop Anti Join  (cost=480.76..18535.16 rows=6628 width=79) (actual time=13.563..122.878 rows=168 loops=3)
         Join Filter: ((pr.abi = 'FreeBSD:13:amd64'::text) AND (pr.package_set = 'latest'::package_sets))
         Buffers: shared hit=127168
         ->  Parallel Bitmap Heap Scan on packages_raw pr  (cost=480.34..11732.96 rows=6892 width=79) (actual time=9.732..21.140 rows=10461 loops=3)
               Recheck Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
               Filter: ((abi_id IS NOT NULL) AND (port_id IS NOT NULL))
               Rows Removed by Filter: 6
               Heap Blocks: exact=201
               Buffers: shared hit=1668
               ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..476.20 rows=19088 width=0) (actual time=16.749..16.749 rows=62784 loops=1)
                     Index Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
                     Buffers: shared hit=773
         ->  Index Only Scan using packages_all_idx on packages p  (cost=0.42..0.97 rows=1 width=29) (actual time=0.008..0.008 rows=1 loops=31383)
               Index Cond: ((abi_id = pr.abi_id) AND (package_name = pr.package_name) AND (package_set = pr.package_set) AND (package_set = 'latest'::package_sets))
               Filter: (package_version = pr.package_version)
               Rows Removed by Filter: 0
               Heap Fetches: 31349
               Buffers: shared hit=125500
 Planning Time: 1.609 ms
 Execution Time: 137.356 ms
(24 rows)

freshports.dev=# 

The two IS NOT NULL clauses avoids inserting package data for which we did not find a matching port. This is not uncommon on the development server, but will be flagged on the production host.

The INSERT then becomes:

begin;
set track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS) 
INSERT INTO packages (abi_id, port_id, package_version, package_name, package_set)
SELECT abi_id, port_id, package_version, package_name, package_set
  FROM packages_raw PR
 WHERE NOT EXISTS (
 SELECT *
   FROM packages P
  WHERE P.abi_id           = PR.abi_id
    AND P.package_set      = PR.package_set
    AND P.package_name     = PR.package_name
    AND P.package_version  = PR.package_version
    AND PR.abi             = 'FreeBSD:13:amd64'
    AND PR.package_set     = 'latest')
 AND PR.abi_id  IS NOT NULL
 AND PR.port_id IS NOT NULL
 AND PR.abi             = 'FreeBSD:13:amd64'
 AND PR.package_set     = 'latest';
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Insert on packages  (cost=480.76..28278.42 rows=15908 width=58) (actual time=341.049..341.050 rows=0 loops=1)
   Buffers: shared hit=133908 read=1 dirtied=828 written=9
   I/O Timings: read=0.048
   ->  Nested Loop Anti Join  (cost=480.76..28278.42 rows=15908 width=58) (actual time=21.860..313.677 rows=503 loops=1)
         Join Filter: ((pr.abi = 'FreeBSD:13:amd64'::text) AND (pr.package_set = 'latest'::package_sets))
         Buffers: shared hit=127747 dirtied=24
         ->  Bitmap Heap Scan on packages_raw pr  (cost=480.34..11899.98 rows=16541 width=51) (actual time=16.643..48.263 rows=31383 loops=1)
               Recheck Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
               Filter: ((abi_id IS NOT NULL) AND (port_id IS NOT NULL))
               Rows Removed by Filter: 18
               Heap Blocks: exact=530
               Buffers: shared hit=1668
               ->  Bitmap Index Scan on packages_raw_abi_package_set_idx  (cost=0.00..476.20 rows=19088 width=0) (actual time=16.354..16.354 rows=62784 loops=1)
                     Index Cond: ((abi = 'FreeBSD:13:amd64'::text) AND (package_set = 'latest'::package_sets))
                     Buffers: shared hit=773
         ->  Index Only Scan using packages_all_idx on packages p  (cost=0.42..0.97 rows=1 width=29) (actual time=0.007..0.007 rows=1 loops=31383)
               Index Cond: ((abi_id = pr.abi_id) AND (package_name = pr.package_name) AND (package_set = pr.package_set) AND (package_set = 'latest'::package_sets))
               Filter: (package_version = pr.package_version)
               Rows Removed by Filter: 0
               Heap Fetches: 31450
               Buffers: shared hit=125576 dirtied=24
 Planning Time: 1.607 ms
 Execution Time: 341.237 ms
(23 rows)

freshports.dev=# 

See https://explain.depesz.com/s/ylCc.

The DELETE

We must be cautious with this query. The goal is to remove entries from the packages table which are not available from the repo.

We want to find entries in the packages table which are not found in the package_raw table, the inverse of what we do for INSERT.

Originally, I thought package_version was required in this SELECT, but that will pick up entries which exist, but only the package version has changed. If we always did the DELETE after the UPDATE, the inclusion of package_version in the comparison would still work, but it is better to not include it.

begin;
set track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS)
DELETE FROM packages P
 WHERE P.abi_id      = 11
   AND P.package_set = 'latest'
   AND NOT EXISTS (
     SELECT *
       FROM packages_raw PR
      WHERE P.abi_id       = PR.abi_id
        AND P.package_set  = PR.package_set
        AND P.package_name = PR.package_name
        AND PR.abi         = 'FreeBSD:13:amd64'
        AND PR.package_set = 'latest');
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on packages p  (cost=3.08..6108.16 rows=19353 width=12) (actual time=52.662..52.662 rows=0 loops=1)
   Buffers: shared hit=35880
   ->  Merge Anti Join  (cost=3.08..6108.16 rows=19353 width=12) (actual time=23.933..52.586 rows=36 loops=1)
         Merge Cond: (p.package_name = pr.package_name)
         Join Filter: ((p.abi_id = pr.abi_id) AND (p.package_set = pr.package_set))
         Buffers: shared hit=35844
         ->  Index Scan using packages_all_idx on packages p  (cost=0.42..6057.09 rows=19354 width=28) (actual time=0.018..15.736 rows=31384 loops=1)
               Index Cond: ((abi_id = 11) AND (package_set = 'latest'::package_sets))
               Buffers: shared hit=27257
         ->  Sort  (cost=2.66..2.66 rows=1 width=29) (actual time=23.518..25.581 rows=31383 loops=1)
               Sort Key: pr.package_name
               Sort Method: quicksort  Memory: 3278kB
               Buffers: shared hit=8587
               ->  Index Scan using packages_raw_all on packages_raw pr  (cost=0.42..2.65 rows=1 width=29) (actual time=0.021..14.889 rows=31383 loops=1)
                     Index Cond: ((abi_id = 11) AND (package_set = 'latest'::package_sets))
                     Filter: (abi = 'FreeBSD:13:amd64'::text)
                     Buffers: shared hit=8587
 Planning Time: 0.421 ms
 Execution Time: 52.712 ms
(19 rows)

See https://explain.depesz.com/s/ysWG for details.

Putting it all into one function

This is the function, complete with stats updates.

CREATE OR REPLACE FUNCTION UpdatePackagesFromRawPackages( IN a_abi_name text, IN a_package_set package_sets)
    RETURNS TABLE(inserts integer, updates integer, deletes integer)
    LANGUAGE plpgsql VOLATILE
    AS $$
  DECLARE
    l_abi_id  integer;
    l_inserts integer;
    l_deletes integer;
    l_updates integer;
  BEGIN
    SELECT id
      INTO l_abi_id
      FROM abi
     WHERE name = a_abi_name;

    IF NOT FOUND THEN
        RAISE EXCEPTION 'cannot find id for ABI %', a_abi_name;
    END IF;

    UPDATE packages p
       SET package_version = PR.package_version
      FROM packages_raw PR
     WHERE P.abi_id           = PR.abi_id
       AND P.package_set      = PR.package_set
       AND P.package_name     = PR.package_name
       AND P.package_version != PR.package_version
       AND PR.abi             = a_abi_name
       AND PR.package_set     = a_package_set;

    GET DIAGNOSTICS l_updates = ROW_COUNT;

    INSERT INTO packages (abi_id, port_id, package_version, package_name, package_set)
    SELECT abi_id, port_id, package_version, package_name, package_set
      FROM packages_raw PR
     WHERE NOT EXISTS (
     SELECT *
       FROM packages P
      WHERE P.abi_id           = PR.abi_id
        AND P.package_set      = PR.package_set
        AND P.package_name     = PR.package_name
        AND P.package_version  = PR.package_version
        AND PR.abi             = a_abi_name
        AND PR.package_set     = a_package_set)
     AND PR.abi_id  IS NOT NULL
     AND PR.port_id IS NOT NULL
     AND PR.abi             = a_abi_name
     AND PR.package_set     = a_package_set;

    GET DIAGNOSTICS l_inserts = ROW_COUNT;

    DELETE FROM packages P
     WHERE P.abi_id      = l_abi_id
       AND P.package_set = a_package_set
       AND NOT EXISTS (
         SELECT *
           FROM packages_raw PR
          WHERE P.abi_id       = PR.abi_id
            AND P.package_set  = PR.package_set
            AND P.package_name = PR.package_name
            AND PR.abi         = a_abi_name
            AND PR.package_set = a_package_set);

    GET DIAGNOSTICS l_deletes = ROW_COUNT;

    INSERT INTO package_imports (abi_id, package_set, date, inserts, updates, deletes)
                         values (l_abi_id, a_package_set, CURRENT_TIMESTAMP, l_inserts, l_updates, l_deletes);

    RETURN QUERY (SELECT l_inserts, l_updates, l_deletes);

  END;
$$;


And it seems to work as expected. More testing to follow.

freshports.dev=# begin; select * from UpdatePackagesFromRawPackages('FreeBSD:13:amd64', 'latest');
BEGIN
 inserts | updates | deletes 
---------+---------+---------
      35 |     469 |      36
(1 row)

freshports.dev=# select * from package_imports;
 id | abi_id | package_set |        date        | inserts | updates | deletes 
----+--------+-------------+--------------------+---------+---------+---------
  1 |     11 | latest      | 20:43:46.194598+00 |      35 |     469 |      36
(1 row)

freshports.dev=# rollback;

That is a total of 540 records updates, out of 30,474 records imported. That’s far better than deleting and inserting all that data.

More work on this over the weekend.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top