Jan 262020
 

FreshPorts had a cache clearing issue recently. It lingered. For a long time. It took me a while to figure it out. It turned out to be a Python 3 compatibility issue.

In this post:

  • PostgreSQL 12.1
  • FreeBSD 12.1

I will outline how database changes invoke external scripts asynchronously which then clear the cache. I will go into the debugging of the issue and how the bug was fixed.

How caching works

FreshPorts caches data on disk. When you go to visit a page, the website checks the cache on disk, and if not found, it runs the queries, creates the HTML then saves it to disk.

Really? You cache to disk? Why?

Q: Why don’t you using memcached or some other caching tool?

A: FreshPorts was created long before such fancy tools.

The database side of caching

FreshPorts uses PostgreSQL, an amazing database which just happens to be open source and freely available. The initial work was done on a lesser database system, but then I was introduced to PostgreSQL in July 2000, which had grown-up features I expected, such as stored procedures and relational integrity. I never looked back.

Cache clearing in FreshPorts relies upon asynchronous notification using the LISTEN and NOTIFY commands.

A small python script (about 220 lines) issues several LISTEN commands, then starts polling.

This is the connection:

import configparser

config = configparser.ConfigParser()
config.read('/usr/local/etc/freshports/fp-listen.ini')

DSN = 'host=' + config['database']['HOST'] + ' dbname=' + config['database']['DBNAME'] + ' user=' + config['database']['DBUSER'] + ' password=' + config['database']['PASSWORD']

conn = psycopg2.connect(DSN)
curs = conn.cursor()

Next, we read from the database a list of LISTEN channels and then issue a LISTEN command for that channel

curs.execute("SELECT name, script_name FROM listen_for ORDER BY id")
listens_for = curs.fetchall()

listens = dict()
print ("These are the (event name, script name) pairs we are ready for:")
for listen in listens_for:
  curs.execute("LISTEN %s" % listen[0])
  listens[listen[0]] = listen[1]
  print ("('%s', '%s')" % (listen[0], listen[1]))

In this case, name is a channel and script_name is what the Python script will run when it encounters a NOTIFY for that channel.

This is the main loop of the code:

while 1:
  if select.select([conn],[],[],5)==([],[],[]):
    syslog.syslog(syslog.LOG_NOTICE, 'timeout! *************')
  else:
    conn.poll()
    syslog.syslog(syslog.LOG_NOTICE, 'Just woke up! *************')
    while conn.notifies:
      notify = conn.notifies.pop(0);
      # in real life, do something with each...
      syslog.syslog(syslog.LOG_NOTICE, "Got NOTIFY: %d, %s, %s" % (notify.pid, notify.channel, notify.payload));
      if notify.channel in listens:
        syslog.syslog(syslog.LOG_NOTICE, "found key %s" % (notify.channel));
        clear_cache = True;
        if listens[notify.channel]   == 'listen_port':
          syslog.syslog(syslog.LOG_NOTICE, "invoking RemoveCacheEntry()");
          RemoveCacheEntry()
        elif listens[notify.channel] == 'listen_date_updated':
          syslog.syslog(syslog.LOG_NOTICE, "invoking ClearDateCacheEntries()");
          ClearDateCacheEntries()
        ...
        ...
        else:
          clear_cache = False;
          syslog.syslog(syslog.LOG_ERR, "Code does not know what to do when '%s' is found." % notify.channel)

        if clear_cache:
          ClearMiscCaches()

      else:
        syslog.syslog(syslog.LOG_NOTICE, 'no such key in listens array for %s!' % (notify.channel))

This isn’t great code, patches & suggestions welcomed.

I remember this worked differently in earlier versions of PostgreSQL, but this approach has been in use for several years now. The poll command times out occasionally, and when it does, we just loop again.

When we receive a NOTIFY, we verify it is in the list of what we LISTEN for, then go through the list and invoke the appropriate script.

The database side

Triggers will invoke a NOTIFY when and where relevant. An update on the commit_log table will add an entry to the cache_clearing_dates table and invoke NOTIFY date_updated.

How cache clearing works

With the database issuing the NOTIFY, the script receives the NOTIFY date_updated. It will then query the cache_clearing_dates table to see what dates need to be cleared. On disk this looks like this:

[dan@dev-nginx01:/var/service/fp-listen/fp-listen] $ sudo ls -l /var/db/freshports/cache/daily/2020/01
total 340
-rw-rw-r--  1 www  freshports  143080 Jan  9 23:26 08.daily.head
-rw-rw-r--  1 www  freshports  107017 Jan 11 04:57 10.daily.head
-rw-rw-r--  1 www  freshports  152478 Jan 21 14:26 17.daily.head
-rw-rw-r--  1 www  freshports  129303 Jan 25 18:12 23.daily.head

Thus, the code just deletes the given file pertaining to the date in question.

Finding the issue

Debugging the caching was challenging in that the first step was to reproduce the issue.

I started the small Python app on my dev FreshPorts jail:

[root@dev-nginx01:/var/service/fp-listen] # echo ./fp-listen/fp-listen.py | su -fm freshports
These are the (event name, script name) pairs we are ready for:
('port_updated', 'listen_port')
('ports_moved', 'listen_ports_moved')
('ports_updating', 'listen_ports_updating')
('vuxml', 'listen_vuxml')
('category_new', 'listen_category_new')
('date_updated', 'listen_date_updated')

The output you see is what the app has read from the listen_for table. The events are data-driven. See lines 14-18 above.

freshports.dev=# SELECT * FROM listen_for;
 id |      name      |      script_name      
----+----------------+-----------------------
  5 | port_updated   | listen_port
  6 | ports_moved    | listen_ports_moved
  7 | ports_updating | listen_ports_updating
  8 | vuxml          | listen_vuxml
 10 | category_new   | listen_category_new
 11 | date_updated   | listen_date_updated
(6 rows)

freshports.dev=# 

Now that the code is running, in a separate ssh session, I connect to the database and manually issue a NOTIFY command:

freshports.dev=# NOTIFY date_updated;
NOTIFY
freshports.dev=# 

BOOM, the Python application dies in a heap:

Traceback (most recent call last):
  File "./fp-listen/fp-listen.py", line 190, in 
    if listens.has_key(notify.channel):
AttributeError: 'dict' object has no attribute 'has_key'
[root@dev-nginx01:/var/service/fp-listen] #

NOTE:

  1. the code you see above has already been patched
  2. line 190 in the code is actually line 29 as shown above

The fix

The fix was:

-      if listens.has_key(notify.channel):
+      if notify.channel in listens:

This fix is seen online 8 above and is a known and documented Python 2 – > Python 3 migration issue

Look at all the backed up queues!

When fp-listen gets a NOTIFY for ports or dates which need clearing, that data can contain a payload, but this must be shorter than 8000 bytes. As mentioned above, we store this data in tables instead of sending it directly to the Python app.

Looking in the tables:

[dan@x8dtu-pg02:~] $ ': psql freshports.org
Password for user dan: 
psql (12.1)
Type "help" for help.

freshports.org=# select count(*) from cache_clearing_ports;
 count 
-------
 20508
(1 row)

freshports.org=# select count(*) from cache_clearing_dates;
 count 
-------
   290
(1 row)

freshports.org=# 

So how long has this been an issue:

freshports.org=# select * from cache_clearing_dates order by date_added asc limit 5;
   id   | date_to_clear |   date_added    
--------+---------------+-----------------
 172300 | 2020-01-22    | 00:08:21.443837
 172301 | 2020-01-22    | 00:12:37.328254
 172302 | 2020-01-22    | 00:30:32.244191
 172303 | 2020-01-22    | 00:41:27.386203
 172304 | 2020-01-22    | 00:43:57.517707
(5 rows)

freshports.org=# select * from cache_clearing_dates order by date_added desc limit 5;
   id   | date_to_clear |   date_added    
--------+---------------+-----------------
 172299 | 2020-01-21    | 23:58:53.74748
 172298 | 2020-01-21    | 23:49:24.205331
 172188 | 2020-01-20    | 23:20:03.807726
 172297 | 2020-01-21    | 23:16:04.518664
 172187 | 2020-01-20    | 23:06:55.273125
(5 rows)

freshports.org=# 

Oh, it seems the cache_clearing_dates.date_added be a timestamp. I’ll fix that issue later.

Let’s try the ports cache queue:

freshports.org=# select * from cache_clearing_ports order by date_added asc limit 5;
    id    | port_id | category |     port      |         date_added         
----------+---------+----------+---------------+----------------------------
 16669514 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
 16669515 |   16343 | devel    | py-setuptools | 2020-01-20 17:23:04.747768
 16669516 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
 16669517 |   49106 | lang     | python37      | 2020-01-20 17:23:04.747768
 16669513 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
(5 rows)

freshports.org=# select * from cache_clearing_ports order by date_added desc limit 5;
    id    | port_id | category |       port       |         date_added         
----------+---------+----------+------------------+----------------------------
 16691617 |   37057 | print    | R-cran-knitr     | 2020-01-22 15:37:13.407464
 16691618 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691615 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691616 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691619 |   37051 | textproc | R-cran-yaml      | 2020-01-22 15:37:13.407464
(5 rows)

freshports.org=# 

From this, the problem arose on January 20th.

FYI, the tables look like this:

freshports.org=# \d cache_clearing_ports 
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

freshports.org=# \d cache_clearing_dates
                                       Table "public.cache_clearing_dates"
    Column     |          Type          | Collation | Nullable |                     Default                      
---------------+------------------------+-----------+----------+--------------------------------------------------
 id            | integer                |           | not null | nextval('cache_clearing_dates_id_seq'::regclass)
 date_to_clear | date                   |           |          | 
 date_added    | time without time zone |           |          | now()

freshports.org=# 

Releasing the code

After I fixed the code, packaged it, and installed it on test, I encountered this error:

Jan 22 17:28:32 test-nginx01 fp-listen[62850]: ClearMiscCaches() is clearing '/var/db/freshports/cache/new/*'

Eh? That’s the wrong directory. It should be news not new. Here’s the code in question

def ClearMiscCaches():
  syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()');
  filenameglob = config['dirs']['NEWS_CACHE_DIR'];
  syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing %s' % (filenameglob));
  for filename in glob.glob(filenameglob):
    syslog.syslog(syslog.LOG_NOTICE, 'removing %s' % (filename))
    try:
      if os.path.isfile(filename):
        os.remove(filename)
      else:
        shutil.rmtree(filename)

    except (OSError, err):
      if err[0] == 2:
        pass  # no file to delete, so no worries

      else:
        syslog.syslog(syslog.LOG_CRIT, 'ERROR: error deleting cache entry.  Error message is %s' % (err))
        continue

Here is the entry from fp-listen.ini (see line 4):

NEWS_CACHE_DIR  = '%(BASEDIR)s/cache/new/*'

That should be:

NEWS_CACHE_DIR  = '%(BASEDIR)s/cache/news/*'

I had to make this fix on each server, other than dev. It was incorrect everywhere.

I restarted the service, which is controlled by daemontools:

sudo svc -t /var/service/fp-listen

I found the code on line 5 was not working as expected. No syslog entries were being recorded. glob.glob(filenameglob) was not enumerating everything. Tests in a small script worked. That same test script when run by the freshports user also yielded expected results. I could not locate a reason why this would fail in fp-listen.

New ClearMiscCaches() code

After reviewing the documentation for glob, I stumbled across Path and used it. I changed the ClearMiscCaches() code as shown below:

def ClearMiscCaches():
  syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()');
  
  news_cache_dir = config['dirs']['NEWS_CACHE_DIR'];
  syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing out entries in %s' % (news_cache_dir));

  for filename in Path(news_cache_dir).iterdir():
    syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is removing %s' % (filename))
    try:
      if Path(filename).is_file():
        Path(filename).unlink()
      else:
        shutil.rmtree(filename)

The code on line 5 is the key. We are iterating over a path to get the files we need. dotfiles will be overlooked, which is sufficient for our needs. We do not use dotfiles (e.g. .Caching_Not_Done_Like_This).

It was very strange, because glob.glob(filenameglob) worked in other parts of the fp-listen code, but not in this directory.

New package

With that code change, the new package was created and made it way through test, stage, and beta servers before being moved into production.

The cause of the problem seems to be the migration from Python 2 to Python 3. I would like to more testing on the glob issue, but I have moved onto other issues.

Why LISTEN/NOTIFY?

LISTEN/NOTIFY is a valuable tool when your database needs to invoke external events. Let the transaction[s] complete and then the external code will eventually do your bidding via the LISTEN portion of the event.

Hopefully, this will open up a whole new avenue of solutions for you.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive