FreshPorts had a cache clearing issue recently. It lingered. For a long time. It took me a while to figure it out. It turned out to be a Python 3 compatibility issue.
In this post:
- PostgreSQL 12.1
- FreeBSD 12.1
I will outline how database changes invoke external scripts asynchronously which then clear the cache. I will go into the debugging of the issue and how the bug was fixed.
How caching works
FreshPorts caches data on disk. When you go to visit a page, the website checks the cache on disk, and if not found, it runs the queries, creates the HTML then saves it to disk.
Really? You cache to disk? Why?
Q: Why don’t you using memcached or some other caching tool?
A: FreshPorts was created long before such fancy tools.
The database side of caching
FreshPorts uses PostgreSQL, an amazing database which just happens to be open source and freely available. The initial work was done on a lesser database system, but then I was introduced to PostgreSQL in July 2000, which had grown-up features I expected, such as stored procedures and relational integrity. I never looked back.
Cache clearing in FreshPorts relies upon asynchronous notification using the LISTEN and NOTIFY commands.
A small python script (about 220 lines) issues several LISTEN commands, then starts polling.
This is the connection:
import configparser config = configparser.ConfigParser() config.read('/usr/local/etc/freshports/fp-listen.ini') DSN = 'host=' + config['database']['HOST'] + ' dbname=' + config['database']['DBNAME'] + ' user=' + config['database']['DBUSER'] + ' password=' + config['database']['PASSWORD'] conn = psycopg2.connect(DSN) curs = conn.cursor()
Next, we read from the database a list of LISTEN channels and then issue a LISTEN command for that channel
curs.execute("SELECT name, script_name FROM listen_for ORDER BY id") listens_for = curs.fetchall() listens = dict() print ("These are the (event name, script name) pairs we are ready for:") for listen in listens_for: curs.execute("LISTEN %s" % listen[0]) listens[listen[0]] = listen[1] print ("('%s', '%s')" % (listen[0], listen[1]))
In this case, name is a channel and script_name is what the Python script will run when it encounters a NOTIFY for that channel.
This is the main loop of the code:
while 1: if select.select([conn],[],[],5)==([],[],[]): syslog.syslog(syslog.LOG_NOTICE, 'timeout! *************') else: conn.poll() syslog.syslog(syslog.LOG_NOTICE, 'Just woke up! *************') while conn.notifies: notify = conn.notifies.pop(0); # in real life, do something with each... syslog.syslog(syslog.LOG_NOTICE, "Got NOTIFY: %d, %s, %s" % (notify.pid, notify.channel, notify.payload)); if notify.channel in listens: syslog.syslog(syslog.LOG_NOTICE, "found key %s" % (notify.channel)); clear_cache = True; if listens[notify.channel] == 'listen_port': syslog.syslog(syslog.LOG_NOTICE, "invoking RemoveCacheEntry()"); RemoveCacheEntry() elif listens[notify.channel] == 'listen_date_updated': syslog.syslog(syslog.LOG_NOTICE, "invoking ClearDateCacheEntries()"); ClearDateCacheEntries() ... ... else: clear_cache = False; syslog.syslog(syslog.LOG_ERR, "Code does not know what to do when '%s' is found." % notify.channel) if clear_cache: ClearMiscCaches() else: syslog.syslog(syslog.LOG_NOTICE, 'no such key in listens array for %s!' % (notify.channel))
This isn’t great code, patches & suggestions welcomed.
I remember this worked differently in earlier versions of PostgreSQL, but this approach has been in use for several years now. The poll command times out occasionally, and when it does, we just loop again.
When we receive a NOTIFY, we verify it is in the list of what we LISTEN for, then go through the list and invoke the appropriate script.
The database side
Triggers will invoke a NOTIFY when and where relevant. An update on the commit_log table will add an entry to the cache_clearing_dates table and invoke NOTIFY date_updated.
How cache clearing works
With the database issuing the NOTIFY, the script receives the NOTIFY date_updated. It will then query the cache_clearing_dates table to see what dates need to be cleared. On disk this looks like this:
[dan@dev-nginx01:/var/service/fp-listen/fp-listen] $ sudo ls -l /var/db/freshports/cache/daily/2020/01 total 340 -rw-rw-r-- 1 www freshports 143080 Jan 9 23:26 08.daily.head -rw-rw-r-- 1 www freshports 107017 Jan 11 04:57 10.daily.head -rw-rw-r-- 1 www freshports 152478 Jan 21 14:26 17.daily.head -rw-rw-r-- 1 www freshports 129303 Jan 25 18:12 23.daily.head
Thus, the code just deletes the given file pertaining to the date in question.
Finding the issue
Debugging the caching was challenging in that the first step was to reproduce the issue.
I started the small Python app on my dev FreshPorts jail:
[root@dev-nginx01:/var/service/fp-listen] # echo ./fp-listen/fp-listen.py | su -fm freshports These are the (event name, script name) pairs we are ready for: ('port_updated', 'listen_port') ('ports_moved', 'listen_ports_moved') ('ports_updating', 'listen_ports_updating') ('vuxml', 'listen_vuxml') ('category_new', 'listen_category_new') ('date_updated', 'listen_date_updated')
The output you see is what the app has read from the listen_for table. The events are data-driven. See lines 14-18 above.
freshports.dev=# SELECT * FROM listen_for; id | name | script_name ----+----------------+----------------------- 5 | port_updated | listen_port 6 | ports_moved | listen_ports_moved 7 | ports_updating | listen_ports_updating 8 | vuxml | listen_vuxml 10 | category_new | listen_category_new 11 | date_updated | listen_date_updated (6 rows) freshports.dev=#
Now that the code is running, in a separate ssh session, I connect to the database and manually issue a NOTIFY command:
freshports.dev=# NOTIFY date_updated; NOTIFY freshports.dev=#
BOOM, the Python application dies in a heap:
Traceback (most recent call last): File "./fp-listen/fp-listen.py", line 190, inif listens.has_key(notify.channel): AttributeError: 'dict' object has no attribute 'has_key' [root@dev-nginx01:/var/service/fp-listen] #
NOTE:
- the code you see above has already been patched
- line 190 in the code is actually line 29 as shown above
The fix
The fix was:
- if listens.has_key(notify.channel): + if notify.channel in listens:
This fix is seen online 8 above and is a known and documented Python 2 – > Python 3 migration issue
Look at all the backed up queues!
When fp-listen gets a NOTIFY for ports or dates which need clearing, that data can contain a payload, but this must be shorter than 8000 bytes. As mentioned above, we store this data in tables instead of sending it directly to the Python app.
Looking in the tables:
[dan@x8dtu-pg02:~] $ ': psql freshports.org Password for user dan: psql (12.1) Type "help" for help. freshports.org=# select count(*) from cache_clearing_ports; count ------- 20508 (1 row) freshports.org=# select count(*) from cache_clearing_dates; count ------- 290 (1 row) freshports.org=#
So how long has this been an issue:
freshports.org=# select * from cache_clearing_dates order by date_added asc limit 5; id | date_to_clear | date_added --------+---------------+----------------- 172300 | 2020-01-22 | 00:08:21.443837 172301 | 2020-01-22 | 00:12:37.328254 172302 | 2020-01-22 | 00:30:32.244191 172303 | 2020-01-22 | 00:41:27.386203 172304 | 2020-01-22 | 00:43:57.517707 (5 rows) freshports.org=# select * from cache_clearing_dates order by date_added desc limit 5; id | date_to_clear | date_added --------+---------------+----------------- 172299 | 2020-01-21 | 23:58:53.74748 172298 | 2020-01-21 | 23:49:24.205331 172188 | 2020-01-20 | 23:20:03.807726 172297 | 2020-01-21 | 23:16:04.518664 172187 | 2020-01-20 | 23:06:55.273125 (5 rows) freshports.org=#
Oh, it seems the cache_clearing_dates.date_added be a timestamp. I’ll fix that issue later.
Let’s try the ports cache queue:
freshports.org=# select * from cache_clearing_ports order by date_added asc limit 5; id | port_id | category | port | date_added ----------+---------+----------+---------------+---------------------------- 16669514 | 56394 | graphics | py-pygeoapi | 2020-01-20 17:23:04.747768 16669515 | 16343 | devel | py-setuptools | 2020-01-20 17:23:04.747768 16669516 | 56394 | graphics | py-pygeoapi | 2020-01-20 17:23:04.747768 16669517 | 49106 | lang | python37 | 2020-01-20 17:23:04.747768 16669513 | 56394 | graphics | py-pygeoapi | 2020-01-20 17:23:04.747768 (5 rows) freshports.org=# select * from cache_clearing_ports order by date_added desc limit 5; id | port_id | category | port | date_added ----------+---------+----------+------------------+---------------------------- 16691617 | 37057 | print | R-cran-knitr | 2020-01-22 15:37:13.407464 16691618 | 37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464 16691615 | 37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464 16691616 | 37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464 16691619 | 37051 | textproc | R-cran-yaml | 2020-01-22 15:37:13.407464 (5 rows) freshports.org=#
From this, the problem arose on January 20th.
FYI, the tables look like this:
freshports.org=# \d cache_clearing_ports Table "public.cache_clearing_ports" Column | Type | Collation | Nullable | Default ------------+-----------------------------+-----------+----------+-------------------------------------------------- id | integer | | not null | nextval('cache_clearing_ports_id_seq'::regclass) port_id | integer | | not null | category | text | | not null | port | text | | not null | date_added | timestamp without time zone | | not null | now() Indexes: "cache_clearing_ports_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE freshports.org=# \d cache_clearing_dates Table "public.cache_clearing_dates" Column | Type | Collation | Nullable | Default ---------------+------------------------+-----------+----------+-------------------------------------------------- id | integer | | not null | nextval('cache_clearing_dates_id_seq'::regclass) date_to_clear | date | | | date_added | time without time zone | | | now() freshports.org=#
Releasing the code
After I fixed the code, packaged it, and installed it on test, I encountered this error:
Jan 22 17:28:32 test-nginx01 fp-listen[62850]: ClearMiscCaches() is clearing '/var/db/freshports/cache/new/*'
Eh? That’s the wrong directory. It should be news not new. Here’s the code in question
def ClearMiscCaches(): syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()'); filenameglob = config['dirs']['NEWS_CACHE_DIR']; syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing %s' % (filenameglob)); for filename in glob.glob(filenameglob): syslog.syslog(syslog.LOG_NOTICE, 'removing %s' % (filename)) try: if os.path.isfile(filename): os.remove(filename) else: shutil.rmtree(filename) except (OSError, err): if err[0] == 2: pass # no file to delete, so no worries else: syslog.syslog(syslog.LOG_CRIT, 'ERROR: error deleting cache entry. Error message is %s' % (err)) continue
Here is the entry from fp-listen.ini (see line 4):
NEWS_CACHE_DIR = '%(BASEDIR)s/cache/new/*'
That should be:
NEWS_CACHE_DIR = '%(BASEDIR)s/cache/news/*'
I had to make this fix on each server, other than dev. It was incorrect everywhere.
I restarted the service, which is controlled by daemontools:
sudo svc -t /var/service/fp-listen
I found the code on line 5 was not working as expected. No syslog entries were being recorded. glob.glob(filenameglob) was not enumerating everything. Tests in a small script worked. That same test script when run by the freshports user also yielded expected results. I could not locate a reason why this would fail in fp-listen.
New ClearMiscCaches() code
After reviewing the documentation for glob, I stumbled across Path and used it. I changed the ClearMiscCaches() code as shown below:
def ClearMiscCaches(): syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()'); news_cache_dir = config['dirs']['NEWS_CACHE_DIR']; syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing out entries in %s' % (news_cache_dir)); for filename in Path(news_cache_dir).iterdir(): syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is removing %s' % (filename)) try: if Path(filename).is_file(): Path(filename).unlink() else: shutil.rmtree(filename)
The code on line 5 is the key. We are iterating over a path to get the files we need. dotfiles will be overlooked, which is sufficient for our needs. We do not use dotfiles (e.g. .Caching_Not_Done_Like_This).
It was very strange, because glob.glob(filenameglob) worked in other parts of the fp-listen code, but not in this directory.
New package
With that code change, the new package was created and made it way through test, stage, and beta servers before being moved into production.
The cause of the problem seems to be the migration from Python 2 to Python 3. I would like to more testing on the glob issue, but I have moved onto other issues.
Why LISTEN/NOTIFY?
LISTEN/NOTIFY is a valuable tool when your database needs to invoke external events. Let the transaction[s] complete and then the external code will eventually do your bidding via the LISTEN portion of the event.
Hopefully, this will open up a whole new avenue of solutions for you.