Changing how we record ports which need cache clearing

In this post I’ll talk about how the code intially took the easy way out and how recent changes have tightened that up. In short, the code would raise N messages when just a single notification is sufficient.

Today the full vuxml database was reloaded into FreshPorts. This affects approximately 3454 ports and the marking of 60,436 commits (or revisions) as vulnerable. When you look at a port page, you can see the skulls which indicate a known vuln beside the commits in question (look under Commit History). That makes up for 565,576 unique combinations of commits, ports, and vuxml entries.

After all that is processed (over about 21.5 hours on the current server), we need to flush the cache of for each item that changed (ports, daily summaries, etc).

How cache clearing works

When a port is updated, a trigger on the ports table adds an entry to the cache_clearing_ports. A long running daemon monitors the table waiting for a LISTEN notification. When it arrives, the daemon reads the table and starts processing the entries, clearing the cache.

This is simplified though. Updates occur in many places and each invokes its own trigger. However, for this post, we will deal with only one.

current cache_clearing_ports table

This is the current cache_clearing_ports table:

freshports.org=# \d cache_clearing_ports
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

freshports.org=#

As you can see, duplicate entries are OK. However, over a single commit, which is how the vuxml data is processed, this can result in a large number of duplicates.

How many duplicates?

Checking here, we have just over 1 million entries.

freshports.org=# select count(*) from cache_clearing_ports;
  count  
---------
 1005407
(1 row)

How unique are they?

Only 3500 ports:

freshports.org=# select count(distinct port_id) from commit_log_ports_vuxml;
 count 
-------
  3454
(1 row)

I recall when I was writing this: duplicates are no big deal. The cache gets cleared twice. So what?

It takes time to read through the cache_clearing_ports. A long time, when you have about 1 million entries.

The trigger

This is the existing trigger:

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port);

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

  DROP TRIGGER IF EXISTS ports_clear_cache ON ports;
CREATE TRIGGER ports_clear_cache
    AFTER UPDATE ON ports
    FOR EACH ROW
    EXECUTE PROCEDURE ports_clear_cache();

This is how I propose to change the table. First, I’m creating a new identical table to play with:

freshports.dev=# create table cache_clearing_ports_test (like cache_clearing_ports INCLUDING ALL);
CREATE TABLE

Just what I need. This creates a new table, including all the constraints etc, so I could play with it without affecting the ongoing database operation in my dev server.

This is the DDL I would run:

ALTER TABLE public.cache_clearing_ports_test
    ADD CONSTRAINT cache_clearing_ports_test_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports_test
    IS 'Let''s store just one instance per port_id';

The DDL for the real table would be:

ALTER TABLE public.cache_clearing_ports
    ADD CONSTRAINT cache_clearing_ports_port_id_idx UNIQUE (port_id);

COMMENT ON CONSTRAINT cache_clearing_ports_port_id_idx ON public.cache_clearing_ports
    IS 'Let''s store just one instance per port_id';

This is the amended test table:

freshports.dev=# \d cache_clearing_ports_test
                                      Table "public.cache_clearing_ports_test"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_test_pkey" PRIMARY KEY, btree (id)
    "cache_clearing_ports_port_id_idx" UNIQUE CONSTRAINT, btree (port_id)

freshports.dev=#

Next step, fix the trigger, which will produce duplicate errors if left as it is.

The new trigger

The INSERT command has a conflict_target clause, which is very useful for this situation.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            NOTIFY port_updated;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

What this says:

On the INSERT
IF there is a conflict on the unique port_id index
THEN do nothing

There are many other things you can do ON CONFLICT, but in my case, doing nothing is what I want.

But wait! There’s more!

I went one step farther. Why NOTIFY if we don’t have to.

CREATE OR REPLACE FUNCTION ports_clear_cache() RETURNS TRIGGER AS $$
   DECLARE
      l_port      text;
      l_category  text;
      l_row_count bigint;
   BEGIN
      IF TG_OP = 'UPDATE' THEN
            SELECT category, name
              INTO l_category, l_port
              FROM ports_all
             WHERE id = NEW.id;

            INSERT INTO cache_clearing_ports (port_id, category, port)
                 VALUES (NEW.id, l_category, l_port)
            ON CONFLICT ON CONSTRAINT cache_clearing_ports_port_id_idx DO NOTHING;

            GET DIAGNOSTICS l_row_count = ROW_COUNT;
            IF l_row_count > 0 THEN
                  NOTIFY port_updated;
            END IF;
      END IF;

      -- when a port changes, add an entry to the cache clearing table
      RETURN NEW;
   END
$$ LANGUAGE 'plpgsql';

In this solution, the code checks to see if anything was inserted. If so, it issues the NOTIFY.