Dan Langille

I've been playing with Open Source software, starting with FreeBSD, since New Zealand Post installed DSL on my street in 1998. From there, I started writing at The FreeBSD Diary, moving my work here after I discovered WordPress. Along the way, I started the BSDCan and PGCon conferences. I slowly moved from software development into full time systems administration and now work for very-well known company who has been a big force in the security industry.

Mar 292020
 

When I started this, I thought it was be straight forward and take only a few hours at most. It turned into multiple hours over two days.

This is part two of a two-part post. The first part dealt with changing a database column from time to timestamp.

In this post:

  1. PostgreSQL 12.2
  2. Nagios 3.5
  3. FreeBSD 12.1

The queries

The two queries I want to run are:

SELECT count(*) FROM cache_clearing_dates WHERE date_added > now() - interval '1 hour';
SELECT count(*) FROM cache_clearing_ports WHERE date_added > now() - interval '1 hour';

The script

Combining those two into a script, I came up with this, after a bit of trial and error:

#!/bin/sh

# This script checks the cache_clearing_ports and cache_clearing_dates in
# a freshports database

# arguments
# * server   - IP address or hostname of the PostgreSQL database server
# * database - database name
# * user     - database user
#
# port is always the default value: 5432
#
# The password must be in ~.pgpass for the Unix user which runs this script.
#
# For Nagios, that defaults to ~nagios/.pgpass
#
# See https://www.postgresql.org/docs/12/libpq-pgpass.html
#

#
# arguments used to create simple output of a single value
# and to never issue a password prompt
#
psql="/usr/local/bin/psql --no-align --tuples-only --no-password "

server=$1
database=$2
user=$3

# we return zero, all good, by default
result=0

# the two database queries we will run
query_dates="SELECT count(*) FROM cache_clearing_dates WHERE date_added < now() - interval '1 hour';"
query_ports="SELECT count(*) FROM cache_clearing_ports WHERE date_added < now() - interval '1 hour';"

# how is the dates table?

dates=`$psql -c "$query_ports" --host=$server $database $user`
if [ $? == 0 ] ; then
# uncomment for debugging
#  echo all ok with dates query
else
  # when we hit an error, we do not care about the rest of the checks
  echo error on dates query:$? $dates
  exit 2
fi

# how is the ports table?

ports=`$psql -c "$query_dates" --host=$server $database $user`
if [ $? == 0 ] ; then
# uncomment for debugging
#  echo all ok with ports query
else
  # when we hit an error, we don't care about the rest of the checks
  echo error on ports query: $? $ports
  exit 2
fi

# check the results

if [ "$ports" != "0" ] ; then
  echo There are $ports ports entries in the cache
  result=2
fi

if [ "$dates" != "0" ] ; then
  echo There are $dates dates entries in the cache
  result=2
fi

if [ $result == 0 ] ; then
  echo All OK
fi

exit $result

How is this script invoked?

I run this script via net-mgmt/nrpe3. In general, I run that tool on every host and jail. The line within the configuration file looks like this:

$ grep check_freshports_cache_clearing /usr/local/etc/nrpe.cfg
command[check_freshports_cache_clearing]=/usr/local/libexec/nagios-custom/check_freshports_cache_clearing pg03.int.example.org freshports.dev nagios

I store my personal Nagios scripts in a non-default location: /usr/local/libexec/nagios-custom

Usually, Nagios scripts are in /usr/local/libexec/nagios

Why? Mostly so I can keep them all in one place, and there is never any chance of file name collision.

Script parameters

Please note the following about these parameters:

  • very simple
  • rather inflexible
  • they all must be provided
  • they must appear in this order

The parameters are:

  1. pg03.int.example.org : server – IP address or hostname of the PostgreSQL database server
  2. freshports.dev : database – database name
  3. nagios : database user

Do not confuse the nagios database user with the nagios unix user. They are two separate things.

The nagios user above is part of the PostgreSQL database system. I will show you how I created that later.

The nagios unix user is created by the Nagios port. Installing Nagios or nrpe is outside the scope of this post.

Password

I am a big fan of never putting sensitive information on the command line, such as passwords. Even if nobody else is ever on your system, don’t do it.

The solution here is the .pgpass as documented by the PostgreSQL project.

The script is run by nrpe which runs as the nagios user.

$ ps auwwx | grep nrpe
nagios     85203  0.0  0.0 15812  6368  -  IsJ  01:54   0:01.48 /usr/local/sbin/nrpe3 -c /usr/local/etc/nrpe.cfg -d
dan        15654  0.0  0.0 11324  2648  3  S+J  18:10   0:00.00 grep nrpe

It is because the script runs as the nagios user that the .pgpass files resides at ~nagios/.pgpass. It must be chown nagios and chmod 0600. My file looks like this, with contents slightly modified for public display:

$ sudo cat ~nagios/.pgpass
# hostname:port:database:username:password
pg03.int.example.org:5432:freshports.dev:nagios:notmypassword

The file looks like this:

$ ls -l  ~nagios/.pgpass
-rw-------  1 nagios  nagios  111 Mar 28 19:30 /var/spool/nagios/.pgpass

Why /var/spool/nagios?

Because of this:

$ grep nagios /etc/passwd 
nagios:*:181:181:Nagios pseudo-user:/var/spool/nagios:/usr/sbin/nologin

That is why. That is the home directory for the nagios user as defined in the UIDs file of the FreeBSD ports tree.

Creating the database user & permissions

These are the commands I used to create the nagios database user and the permissions required to query the tables.

First, I create a new group, or role. The group has SELECT rights on the table.

CREATE ROLE reporting;
GRANT SELECT ON cache_clearing_dates TO GROUP reporting;
GRANT SELECT ON cache_clearing_ports TO GROUP reporting;

Here, I create the nagios database user and add them to the role I just created.

CREATE ROLE nagios WITH LOGIN PASSWORD 'notmypassword';
GRANT reporting TO nagios;

Of note:

  • WITH LOGIN – allows this user to login. A role having the LOGIN attribute can be thought of as a user.
  • PASSWORD ‘notmypassword’ – specifies the password for this user. It also appears in the .pgpassfile for the user which will be running this script.

Testing the script

When testing the script from the command line, I did it this way:

[dan@dev-ingress01:~] $ /usr/local/libexec/nagios-custom/check_freshports_cache_clearing pg03.int.example.org freshports.dev nagios
All OK

I ran this script as my regular user and I also created a ~/.pgpass file for myself, which looks just like the one for the nagios unix user.

Things to check

If you can’t get this to work, the things to check are:

  • pg_hba.conf file for the PostgreSQL database server
  • permissions on the database tables
  • permissions on the .pgpass files in question
  • make sure it runs fine from the command line before trying nrpe

Monitoring is key

Monitoring is key to early detection of problems. When something should always be in a given state, it should be monitored to ensure it is in that state. Having a monitoring system which will provide early detection of anomalies helps me fix problems early, rather than later.

I try to monitor everything. Connections to expected hosts. Services running. Plenty of disk space.

If you’re not monitoring yet, start slowly and build it up. That time will be well spent.

Hope this helps.

Mar 292020
 

The plan for today was not decided until after I got out of bed. I sat down with coffee and looked through the list of tasks. I found one which appealed: monitor the cache_clearing_ports & cache_clearing_dates tables

This is part one of a two-part story. The first part deals with fixing a database table before doing the planned work in part two, a Nagios check.

In this post:

  1. PostgreSQL 12.2
  2. Nagios 3.5
  3. FreeBSD 12.1

FreshPorts stores a lot of data in the database, but caches pages on disk. Specific events clear that cache, and the two tables listed will contain lists of items to remove from the cache.

Disk? Cache?

FreshPorts was built long before modern cache tools existed. We built our own. They work great for what is needed. The server uses ZFS and anything frequently used from disk is also cached in memory (see ZFS ARC). A page takes longer to load the first time it is built, but afterwards, it is fast enough.

The checking query

Sometimes the cache clearing lags. Sometimes there is a bug in the code. For whatever reasons behind the lag, I want to be notified if there are entries in this table which are more than 60 minutes old.

Here is the query I started with:

freshports.dev=# SELECT count(*) FROM cache_clearing_dates WHERE date_added > now() - interval '1 hour';
ERROR:  operator does not exist: time without time zone > timestamp with time zone
LINE 1: ...unt(*) FROM cache_clearing_dates WHERE date_added > now() - ...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

What? Why doesn’t what work? Look at the caret above? The comparison operator is the issue. I can’t compae those two values.

Which two values?

A time without time zone and a timestamp with time zone.

Where do I have a time field? Let’s check the table.

freshports.dev=# \d cache_clearing_dates
                                       Table "public.cache_clearing_dates"
    Column     |          Type          | Collation | Nullable |                     Default                      
---------------+------------------------+-----------+----------+--------------------------------------------------
 id            | integer                |           | not null | nextval('cache_clearing_dates_id_seq'::regclass)
 date_to_clear | date                   |           |          | 
 date_added    | time without time zone |           |          | now()

freshports.dev=# 

Oh. The date_added column should be a timestamp without time zone. This table was created incorrectly. I checked test, stage, and prod. All the same.

The cache_clearing_ports table is correct:

freshports.org=# \d public.cache_clearing_ports
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

Fixing cache_clearing_dates

This is the table structure I started with:

freshports.dev=# \d cache_clearing_dates
                                       Table "public.cache_clearing_dates"
    Column     |          Type          | Collation | Nullable |                     Default                      
---------------+------------------------+-----------+----------+--------------------------------------------------
 id            | integer                |           | not null | nextval('cache_clearing_dates_id_seq'::regclass)
 date_to_clear | date                   |           |          | 
 date_added    | time without time zone |           |          | now()

freshports.dev=#

Ummm, shouldn’t date_added be timestamp without time zone?

I’m sure that field does not have a date value in it. It should.

The tests

I wanted verification this was wrong. I was sure, but I wanted to be positively sure.

The better way to do this, and I thought of this after I started this blog post, is:

freshports.dev=# SELECT now()::time without time zone;
       now       
-----------------
 15:07:35.428659
(1 row)

freshports.dev=# 

To compose this test, take the Default value from line 7, and cast it use the Type on the same line.

What I really want is this:

freshports.dev=# SELECT now()::timestamp without time zone;
            now             
----------------------------
 2020-03-28 15:13:55.946128
(1 row)

freshports.dev=# 

Yes, that’s better. I want to know how long this entry has been in the table. A date is required for that.

Don’t try this at home

What did I try first? I don’t recommend this, but it worked for me.

freshports.dev=# INSERT INTO public.cache_clearing_dates (date_to_clear) VALUES ('2020-01-01');
INSERT 0 1
freshports.dev=# select * from cache_clearing_dates;
   id   | date_to_clear |   date_added    
--------+---------------+-----------------
 162763 | 2020-01-01    | 14:51:33.617565
(1 row)

freshports.dev=# rollback;

Yeah, that column has to change.

The change

PLEASE NOTE that in my specific case, this table has no data. There is no existing data to convert. I will show you one way to do that, but if you have data to convert, you may need to find answers elsewhere.

Here is a first attempt. Notice the transaction. If you have a lot of data in there, please be aware of how this might be blocking your applications. Mind you, if you have such a table, and applications, I would think you would already be aware of those issues.

First failure

This should be easy. ALTER TABLE…

freshports.dev=# ALTER TABLE cache_clearing_dates ALTER COLUMN date_added SET DATA TYPE timestamp without time zone;
ERROR:  column "date_added" cannot be cast automatically to type timestamp without time zone
HINT:  You might need to specify "USING date_added::timestamp without time zone".
freshports.dev=# rollback;
ROLLBACK

That’s new to me. Let’s read up on that a bit.. Oh yes, this is HOW PostgreSQL will convert your existing data.

Overwriting the column

THE FOLLOWING WILL LOSE DATA. What I am doing in the following command will set all rows to now() and existing values will be lost.

freshports.dev=# begin;
BEGIN
freshports.dev=# ALTER TABLE cache_clearing_dates ALTER COLUMN date_added SET DATA TYPE timestamp without time zone USING now();
ALTER TABLE
freshports.dev=# \d cache_clearing_dates
                                          Table "public.cache_clearing_dates"
    Column     |            Type             | Collation | Nullable |                     Default                      
---------------+-----------------------------+-----------+----------+--------------------------------------------------
 id            | integer                     |           | not null | nextval('cache_clearing_dates_id_seq'::regclass)
 date_to_clear | date                        |           |          | 
 date_added    | timestamp without time zone |           |          | now()

freshports.dev=# rollback;
ROLLBACK
freshports.dev=# 

Prepending a fixed date value

This attempt will prepend the current date to all values in the table.

freshports.dev=# begin;
BEGIN
freshports.dev=# -- put some test data into my table
freshports.dev=# INSERT INTO public.cache_clearing_dates (date_to_clear) VALUES ('2020-01-01');
INSERT 0 1
freshports.dev=# -- test out the prepend code
freshports.dev=# select now()::date + date_added from public.cache_clearing_dates;
          ?column?          
----------------------------
 2020-03-28 15:27:22.119605
(1 row)

freshports.dev=# -- alter the table
freshports.dev=# ALTER TABLE cache_clearing_dates ALTER COLUMN date_added SET DATA TYPE timestamp without time zone USING now()::date + date_added;
ALTER TABLE
freshports.dev=# SELECT * FROM public.cache_clearing_dates;
   id   | date_to_clear |         date_added         
--------+---------------+----------------------------
 162765 | 2020-01-01    | 2020-03-28 15:27:22.119605
(1 row)

freshports.dev=# -- undo my changes
freshports.dev=# rollback;
ROLLBACK

The change I used

This is the command I used to update my table:

freshports.dev=# ALTER TABLE cache_clearing_dates ALTER COLUMN date_added SET DATA TYPE timestamp without time zone USING now()::date + date_added;
ALTER TABLE

This is the test I used to verify I was getting what I wanted:

freshports.dev=# begin; insert into public.cache_clearing_dates (date_to_clear) values ('2020-01-01');
BEGIN
INSERT 0 1
freshports.dev=# SELECT * FROM public.cache_clearing_dates;
   id   | date_to_clear |         date_added         
--------+---------------+----------------------------
 162766 | 2020-01-01    | 2020-03-28 15:30:22.439249
(1 row)

freshports.dev=# rollback;
ROLLBACK
freshports.dev=# 

The queries for the Nagios check

The queries are straight forward. There should be zero entries more than 60 minutes old:

freshports.dev=# SELECT count(*) FROM cache_clearing_dates WHERE date_added > now() - interval '1 hour';
 count 
-------
     0
(1 row)

freshports.dev=# SELECT count(*) FROM cache_clearing_ports WHERE date_added > now() - interval '1 hour';
 count 
-------
     0
(1 row)

freshports.dev=# 

I combined those two into a script, which appears in part two of this post.

Jan 262020
 

FreshPorts had a cache clearing issue recently. It lingered. For a long time. It took me a while to figure it out. It turned out to be a Python 3 compatibility issue.

In this post:

  • PostgreSQL 12.1
  • FreeBSD 12.1

I will outline how database changes invoke external scripts asynchronously which then clear the cache. I will go into the debugging of the issue and how the bug was fixed.

How caching works

FreshPorts caches data on disk. When you go to visit a page, the website checks the cache on disk, and if not found, it runs the queries, creates the HTML then saves it to disk.

Really? You cache to disk? Why?

Q: Why don’t you using memcached or some other caching tool?

A: FreshPorts was created long before such fancy tools.

The database side of caching

FreshPorts uses PostgreSQL, an amazing database which just happens to be open source and freely available. The initial work was done on a lesser database system, but then I was introduced to PostgreSQL in July 2000, which had grown-up features I expected, such as stored procedures and relational integrity. I never looked back.

Cache clearing in FreshPorts relies upon asynchronous notification using the LISTEN and NOTIFY commands.

A small python script (about 220 lines) issues several LISTEN commands, then starts polling.

This is the connection:

import configparser

config = configparser.ConfigParser()
config.read('/usr/local/etc/freshports/fp-listen.ini')

DSN = 'host=' + config['database']['HOST'] + ' dbname=' + config['database']['DBNAME'] + ' user=' + config['database']['DBUSER'] + ' password=' + config['database']['PASSWORD']

conn = psycopg2.connect(DSN)
curs = conn.cursor()

Next, we read from the database a list of LISTEN channels and then issue a LISTEN command for that channel

curs.execute("SELECT name, script_name FROM listen_for ORDER BY id")
listens_for = curs.fetchall()

listens = dict()
print ("These are the (event name, script name) pairs we are ready for:")
for listen in listens_for:
  curs.execute("LISTEN %s" % listen[0])
  listens[listen[0]] = listen[1]
  print ("('%s', '%s')" % (listen[0], listen[1]))

In this case, name is a channel and script_name is what the Python script will run when it encounters a NOTIFY for that channel.

This is the main loop of the code:

while 1:
  if select.select([conn],[],[],5)==([],[],[]):
    syslog.syslog(syslog.LOG_NOTICE, 'timeout! *************')
  else:
    conn.poll()
    syslog.syslog(syslog.LOG_NOTICE, 'Just woke up! *************')
    while conn.notifies:
      notify = conn.notifies.pop(0);
      # in real life, do something with each...
      syslog.syslog(syslog.LOG_NOTICE, "Got NOTIFY: %d, %s, %s" % (notify.pid, notify.channel, notify.payload));
      if notify.channel in listens:
        syslog.syslog(syslog.LOG_NOTICE, "found key %s" % (notify.channel));
        clear_cache = True;
        if listens[notify.channel]   == 'listen_port':
          syslog.syslog(syslog.LOG_NOTICE, "invoking RemoveCacheEntry()");
          RemoveCacheEntry()
        elif listens[notify.channel] == 'listen_date_updated':
          syslog.syslog(syslog.LOG_NOTICE, "invoking ClearDateCacheEntries()");
          ClearDateCacheEntries()
        ...
        ...
        else:
          clear_cache = False;
          syslog.syslog(syslog.LOG_ERR, "Code does not know what to do when '%s' is found." % notify.channel)

        if clear_cache:
          ClearMiscCaches()

      else:
        syslog.syslog(syslog.LOG_NOTICE, 'no such key in listens array for %s!' % (notify.channel))

This isn’t great code, patches & suggestions welcomed.

I remember this worked differently in earlier versions of PostgreSQL, but this approach has been in use for several years now. The poll command times out occasionally, and when it does, we just loop again.

When we receive a NOTIFY, we verify it is in the list of what we LISTEN for, then go through the list and invoke the appropriate script.

The database side

Triggers will invoke a NOTIFY when and where relevant. An update on the commit_log table will add an entry to the cache_clearing_dates table and invoke NOTIFY date_updated.

How cache clearing works

With the database issuing the NOTIFY, the script receives the NOTIFY date_updated. It will then query the cache_clearing_dates table to see what dates need to be cleared. On disk this looks like this:

[dan@dev-nginx01:/var/service/fp-listen/fp-listen] $ sudo ls -l /var/db/freshports/cache/daily/2020/01
total 340
-rw-rw-r--  1 www  freshports  143080 Jan  9 23:26 08.daily.head
-rw-rw-r--  1 www  freshports  107017 Jan 11 04:57 10.daily.head
-rw-rw-r--  1 www  freshports  152478 Jan 21 14:26 17.daily.head
-rw-rw-r--  1 www  freshports  129303 Jan 25 18:12 23.daily.head

Thus, the code just deletes the given file pertaining to the date in question.

Finding the issue

Debugging the caching was challenging in that the first step was to reproduce the issue.

I started the small Python app on my dev FreshPorts jail:

[root@dev-nginx01:/var/service/fp-listen] # echo ./fp-listen/fp-listen.py | su -fm freshports
These are the (event name, script name) pairs we are ready for:
('port_updated', 'listen_port')
('ports_moved', 'listen_ports_moved')
('ports_updating', 'listen_ports_updating')
('vuxml', 'listen_vuxml')
('category_new', 'listen_category_new')
('date_updated', 'listen_date_updated')

The output you see is what the app has read from the listen_for table. The events are data-driven. See lines 14-18 above.

freshports.dev=# SELECT * FROM listen_for;
 id |      name      |      script_name      
----+----------------+-----------------------
  5 | port_updated   | listen_port
  6 | ports_moved    | listen_ports_moved
  7 | ports_updating | listen_ports_updating
  8 | vuxml          | listen_vuxml
 10 | category_new   | listen_category_new
 11 | date_updated   | listen_date_updated
(6 rows)

freshports.dev=# 

Now that the code is running, in a separate ssh session, I connect to the database and manually issue a NOTIFY command:

freshports.dev=# NOTIFY date_updated;
NOTIFY
freshports.dev=# 

BOOM, the Python application dies in a heap:

Traceback (most recent call last):
  File "./fp-listen/fp-listen.py", line 190, in 
    if listens.has_key(notify.channel):
AttributeError: 'dict' object has no attribute 'has_key'
[root@dev-nginx01:/var/service/fp-listen] #

NOTE:

  1. the code you see above has already been patched
  2. line 190 in the code is actually line 29 as shown above

The fix

The fix was:

-      if listens.has_key(notify.channel):
+      if notify.channel in listens:

This fix is seen online 8 above and is a known and documented Python 2 – > Python 3 migration issue

Look at all the backed up queues!

When fp-listen gets a NOTIFY for ports or dates which need clearing, that data can contain a payload, but this must be shorter than 8000 bytes. As mentioned above, we store this data in tables instead of sending it directly to the Python app.

Looking in the tables:

[dan@x8dtu-pg02:~] $ ': psql freshports.org
Password for user dan: 
psql (12.1)
Type "help" for help.

freshports.org=# select count(*) from cache_clearing_ports;
 count 
-------
 20508
(1 row)

freshports.org=# select count(*) from cache_clearing_dates;
 count 
-------
   290
(1 row)

freshports.org=# 

So how long has this been an issue:

freshports.org=# select * from cache_clearing_dates order by date_added asc limit 5;
   id   | date_to_clear |   date_added    
--------+---------------+-----------------
 172300 | 2020-01-22    | 00:08:21.443837
 172301 | 2020-01-22    | 00:12:37.328254
 172302 | 2020-01-22    | 00:30:32.244191
 172303 | 2020-01-22    | 00:41:27.386203
 172304 | 2020-01-22    | 00:43:57.517707
(5 rows)

freshports.org=# select * from cache_clearing_dates order by date_added desc limit 5;
   id   | date_to_clear |   date_added    
--------+---------------+-----------------
 172299 | 2020-01-21    | 23:58:53.74748
 172298 | 2020-01-21    | 23:49:24.205331
 172188 | 2020-01-20    | 23:20:03.807726
 172297 | 2020-01-21    | 23:16:04.518664
 172187 | 2020-01-20    | 23:06:55.273125
(5 rows)

freshports.org=# 

Oh, it seems the cache_clearing_dates.date_added be a timestamp. I’ll fix that issue later.

Let’s try the ports cache queue:

freshports.org=# select * from cache_clearing_ports order by date_added asc limit 5;
    id    | port_id | category |     port      |         date_added         
----------+---------+----------+---------------+----------------------------
 16669514 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
 16669515 |   16343 | devel    | py-setuptools | 2020-01-20 17:23:04.747768
 16669516 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
 16669517 |   49106 | lang     | python37      | 2020-01-20 17:23:04.747768
 16669513 |   56394 | graphics | py-pygeoapi   | 2020-01-20 17:23:04.747768
(5 rows)

freshports.org=# select * from cache_clearing_ports order by date_added desc limit 5;
    id    | port_id | category |       port       |         date_added         
----------+---------+----------+------------------+----------------------------
 16691617 |   37057 | print    | R-cran-knitr     | 2020-01-22 15:37:13.407464
 16691618 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691615 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691616 |   37124 | textproc | R-cran-rmarkdown | 2020-01-22 15:37:13.407464
 16691619 |   37051 | textproc | R-cran-yaml      | 2020-01-22 15:37:13.407464
(5 rows)

freshports.org=# 

From this, the problem arose on January 20th.

FYI, the tables look like this:

freshports.org=# \d cache_clearing_ports 
                                        Table "public.cache_clearing_ports"
   Column   |            Type             | Collation | Nullable |                     Default                      
------------+-----------------------------+-----------+----------+--------------------------------------------------
 id         | integer                     |           | not null | nextval('cache_clearing_ports_id_seq'::regclass)
 port_id    | integer                     |           | not null | 
 category   | text                        |           | not null | 
 port       | text                        |           | not null | 
 date_added | timestamp without time zone |           | not null | now()
Indexes:
    "cache_clearing_ports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "cache_clearing_ports_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE

freshports.org=# \d cache_clearing_dates
                                       Table "public.cache_clearing_dates"
    Column     |          Type          | Collation | Nullable |                     Default                      
---------------+------------------------+-----------+----------+--------------------------------------------------
 id            | integer                |           | not null | nextval('cache_clearing_dates_id_seq'::regclass)
 date_to_clear | date                   |           |          | 
 date_added    | time without time zone |           |          | now()

freshports.org=# 

Releasing the code

After I fixed the code, packaged it, and installed it on test, I encountered this error:

Jan 22 17:28:32 test-nginx01 fp-listen[62850]: ClearMiscCaches() is clearing '/var/db/freshports/cache/new/*'

Eh? That’s the wrong directory. It should be news not new. Here’s the code in question

def ClearMiscCaches():
  syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()');
  filenameglob = config['dirs']['NEWS_CACHE_DIR'];
  syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing %s' % (filenameglob));
  for filename in glob.glob(filenameglob):
    syslog.syslog(syslog.LOG_NOTICE, 'removing %s' % (filename))
    try:
      if os.path.isfile(filename):
        os.remove(filename)
      else:
        shutil.rmtree(filename)

    except (OSError, err):
      if err[0] == 2:
        pass  # no file to delete, so no worries

      else:
        syslog.syslog(syslog.LOG_CRIT, 'ERROR: error deleting cache entry.  Error message is %s' % (err))
        continue

Here is the entry from fp-listen.ini (see line 4):

NEWS_CACHE_DIR  = '%(BASEDIR)s/cache/new/*'

That should be:

NEWS_CACHE_DIR  = '%(BASEDIR)s/cache/news/*'

I had to make this fix on each server, other than dev. It was incorrect everywhere.

I restarted the service, which is controlled by daemontools:

sudo svc -t /var/service/fp-listen

I found the code on line 5 was not working as expected. No syslog entries were being recorded. glob.glob(filenameglob) was not enumerating everything. Tests in a small script worked. That same test script when run by the freshports user also yielded expected results. I could not locate a reason why this would fail in fp-listen.

New ClearMiscCaches() code

After reviewing the documentation for glob, I stumbled across Path and used it. I changed the ClearMiscCaches() code as shown below:

def ClearMiscCaches():
  syslog.syslog(syslog.LOG_NOTICE, 'invoked: ClearMiscCaches()');
  
  news_cache_dir = config['dirs']['NEWS_CACHE_DIR'];
  syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is clearing out entries in %s' % (news_cache_dir));

  for filename in Path(news_cache_dir).iterdir():
    syslog.syslog(syslog.LOG_NOTICE, 'ClearMiscCaches() is removing %s' % (filename))
    try:
      if Path(filename).is_file():
        Path(filename).unlink()
      else:
        shutil.rmtree(filename)

The code on line 5 is the key. We are iterating over a path to get the files we need. dotfiles will be overlooked, which is sufficient for our needs. We do not use dotfiles (e.g. .Caching_Not_Done_Like_This).

It was very strange, because glob.glob(filenameglob) worked in other parts of the fp-listen code, but not in this directory.

New package

With that code change, the new package was created and made it way through test, stage, and beta servers before being moved into production.

The cause of the problem seems to be the migration from Python 2 to Python 3. I would like to more testing on the glob issue, but I have moved onto other issues.

Why LISTEN/NOTIFY?

LISTEN/NOTIFY is a valuable tool when your database needs to invoke external events. Let the transaction[s] complete and then the external code will eventually do your bidding via the LISTEN portion of the event.

Hopefully, this will open up a whole new avenue of solutions for you.

Jan 252020
 

I’ve noticed a log of sanity test failures related to devel/binutils lately. Every single commit.

Issue 133 deals with that. Here, I plan to document what happened.

FreshPorts has, since inception, worked off the commit emails. The incoming email is parsed and XML created. The XML is then used to update the database.

I started looking in the message archives. I confirmed that the original email indicated deletes were occurring.

Yet, the XML had no deletes:

[dan@dev-ingress01:/var/db/freshports/message-queues/archive/2019_11/2019_11_15] $ cat 2019.11.15.10.09.16.85661.txt.xml 
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE UPDATES PUBLIC "-//FreshPorts//DTD FreshPorts 2.0//EN" "https://www.freshports.org/docs/fp-updates.dtd">

<UPDATES Version="1.3.2.1">
    <UPDATE>
        <DATE Month="11" Day="15" Year="2019"></DATE>
        <TIME Timezone="UTC" Minute="9" Second="3" Hour="10"></TIME>
        <OS Repo="ports" Branch="head" Id="FreeBSD"></OS>
        <LOG>Update binutils to 2.33.1

While here, convert binutils into flavors, that ease a lot maintenance
and helps cleaning out the code.

This is inspired by the same work on going on the xtoolchained gcc by jhb@

PR:		241756
Exp-run:	antoine (portmgr)
Discussed with:	jhb
Differential Revision:	https://reviews.freebsd.org/D22258</LOG>
        <PEOPLE>
            <UPDATER Handle="bapt"></UPDATER>
        </PEOPLE>
        <MESSAGE Id="201911151009.xAFA9393081239@repo.freebsd.org" EncodingLosses="false" Subject="svn commit: r517642 - in head: . devel devel/aarch64-binutils devel/aarch64-none-elf-binutils devel/aarch64-none-elf-gcc devel/amd64-binutils devel/arduino-tools devel/arm-gnueabi-binutils devel/ar...">
            <DATE Year="2019" Day="15" Month="11"></DATE>
            <TIME Second="3" Minute="9" Timezone="UTC" Hour="10"></TIME>
            <REPOSITORY>ports</REPOSITORY>
            <REVISION>517642</REVISION>
            <TO Email="ports-committers@freebsd.org"></TO>
            <TO Email="svn-ports-all@freebsd.org,"></TO>
        </MESSAGE>
        <FILES>
            <FILE Action="Add" Path="head/devel/binutils/pkg-plist-aarch64" Revision="517642"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-amd64" Revision="517642" Action="Add"></FILE>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-arm-gnueabi"></FILE>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-arm-none-eabi"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-avr" Revision="517642" Action="Add"></FILE>
            <FILE Action="Add" Path="head/devel/binutils/pkg-plist-i386" Revision="517642"></FILE>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-mingw32"></FILE>
            <FILE Revision="517642" Path="head/devel/binutils/pkg-plist-mips" Action="Add"></FILE>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-mips64"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-powerpc64" Revision="517642" Action="Add"></FILE>
            <FILE Action="Add" Path="head/devel/binutils/pkg-plist-riscv64" Revision="517642"></FILE>
            <FILE Revision="517642" Path="head/devel/binutils/pkg-plist-s390x" Action="Add"></FILE>
            <FILE Revision="517642" Path="head/devel/binutils/pkg-plist-sparc64" Action="Add"></FILE>
        </FILES>
        <PR Id="241756"></PR>
    </UPDATE>
</UPDATES>
[dan@dev-ingress01:/var/db/freshports/message-queues/archive/2019_11/2019_11_15] $ 

Test the code

Testing the code, I didn’t get anything different out:

[freshports@dev-ingress01 /usr/local/libexec/freshports]$ cat /var/db/freshports/message-queues/archive/2019_11/2019_11_15/2019.11.15.10.09.16.85661.txt.raw | ./process_mail.pl 
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE UPDATES PUBLIC "-//FreshPorts//DTD FreshPorts 2.0//EN" "https://www.freshports.org/docs/fp-updates.dtd">

<UPDATES Version="1.3.2.1">
    <UPDATE>
        <DATE Year="2019" Month="11" Day="15"></DATE>
        <TIME Second="3" Minute="9" Hour="10" Timezone="UTC"></TIME>
        <OS Id="FreeBSD" Branch="head" Repo="ports"></OS>
        <LOG>Update binutils to 2.33.1

While here, convert binutils into flavors, that ease a lot maintenance
and helps cleaning out the code.

This is inspired by the same work on going on the xtoolchained gcc by jhb@

PR:		241756
Exp-run:	antoine (portmgr)
Discussed with:	jhb
Differential Revision:	https://reviews.freebsd.org/D22258</LOG>
        <PEOPLE>
            <UPDATER Handle="bapt"></UPDATER>
        </PEOPLE>
        <MESSAGE EncodingLosses="false" Id="201911151009.xAFA9393081239@repo.freebsd.org" Subject="svn commit: r517642 - in head: . devel devel/aarch64-binutils devel/aarch64-none-elf-binutils devel/aarch64-none-elf-gcc devel/amd64-binutils devel/arduino-tools devel/arm-gnueabi-binutils devel/ar...">
            <DATE Month="11" Year="2019" Day="15"></DATE>
            <TIME Hour="10" Second="3" Minute="9" Timezone="UTC"></TIME>
            <REPOSITORY>ports</REPOSITORY>
            <REVISION>517642</REVISION>
            <TO Email="ports-committers@freebsd.org"></TO>
            <TO Email="svn-ports-all@freebsd.org,"></TO>
        </MESSAGE>
        <FILES>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-aarch64"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-amd64" Revision="517642" Action="Add"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-arm-gnueabi" Revision="517642" Action="Add"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-arm-none-eabi"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-avr"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-i386"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-mingw32"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-mips"></FILE>
            <FILE Action="Add" Revision="517642" Path="head/devel/binutils/pkg-plist-mips64"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-powerpc64"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-riscv64" Action="Add" Revision="517642"></FILE>
            <FILE Revision="517642" Action="Add" Path="head/devel/binutils/pkg-plist-s390x"></FILE>
            <FILE Path="head/devel/binutils/pkg-plist-sparc64" Revision="517642" Action="Add"></FILE>
        </FILES>
        <PR Id="241756"></PR>
    </UPDATE>
</UPDATES>
[freshports@dev-ingress01 /usr/local/libexec/freshports]$ 

It must be the code, or a change in email layout.

The above includes only Add actions, which directly relate to the Added: entries in the email.

Finding the issue

I found the issue. This was the fix:

-		last if($line =~ /^Directory Properties:/);
+		next if($line =~ /^Directory Properties:/);

In the existing code, when a line containing only Directory Properties: was encountered, that signaled the end of processing for this commit.

I conclude that the email layout changed.

This meant that the deletes were never being processed, given the location of this text within the original email.

With the above code, I ran the script again and dumped the XML into a file. I originally wanted to show you a diff, but by its nature XML is not always the same each time it is generated. For example, the fields within a group do not necessarily appear in the same order. As I was typing this, I thought there MUST be an online XML diff tool. I found https://www.corefiling.com/opensource/xmldiff/ and created this for your viewing pleasure:

+       <FILE Action='Add' Path='head/devel/binutils/pkg-plist-aarch64-none-elf' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/aarch64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/aarch64-none-elf-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/amd64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/arm-gnueabi-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/arm-none-eabi-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/avr-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/i386-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/mingw32-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/mips-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/mips64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/powerpc64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/riscv64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/s390x-binutils/' Revision='517642'></FILE>
+       <FILE Action='Remove' Path='head/devel/sparc64-binutils/' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/MOVED' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/aarch64-none-elf-gcc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/arduino-tools/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/arm-none-eabi-gcc492/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/arm-none-eabi-newlib/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/avr-gcc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/avr-libc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/binutils/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/binutils/distinfo' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/mingw32-gcc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/powerpc64-gcc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/riscv64-gcc/Makefile' Revision='517642'></FILE>
+       <FILE Action='Modify' Path='head/devel/sope4/Makefile' Revision='517642'></FILE>

That is a lot of missing data.

Running the new XML

After running this:

[freshports@dev-ingress01 /usr/local/libexec/freshports]$ ./load_xml_into_db.pl  /tmp/2019.11.15.10.09.16.85661-new.txt.xml -O > /tmp/2019.11.15.10.09.16.85661-new.txt.loading
Use of uninitialized value $depends in concatenation (.) or string at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1363.
Use of uninitialized value $depends in string eq at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1364.
Use of uninitialized value $depends in concatenation (.) or string at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1363.
Use of uninitialized value $depends in string eq at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1364.
Use of uninitialized value $depends in concatenation (.) or string at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1363.
Use of uninitialized value $depends in string eq at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1364.
Use of uninitialized value in concatenation (.) or string at /usr/local/lib/perl5/site_perl/FreshPorts/ports_generate_plist.pm line 40.
Use of uninitialized value in string eq at /usr/local/lib/perl5/site_perl/FreshPorts/ports_generate_plist.pm line 50.
Use of uninitialized value in split at /usr/local/lib/perl5/site_perl/FreshPorts/port.pm line 1296.
[freshports@dev-ingress01 /usr/local/libexec/freshports]$ 

devel/s390x-binutils was marked as deleted. Score!

I’ll fix up those other errors next, but they are beyond the scope of this post.

Not shown here: I went back and re-processed each subsequent commit on devel/binutils

Thanks for coming to my TED talk.

Jan 092020
 

XML can be tricky. XML plays a heavy role in FreshPorts. Since the early days, converting to XML then loading into the database has been a priority. This choice means the database-loading code doesn’t need to know much about the source of the data. It always means we can change the data source without modifying the database loading code.

You might ask: why aren’t you using JSON?

FreshPorts predates JSON by several years.

Today I found an issue caused by some characters in a commit. I am not blaming the committer. There is no fault there. This is about the code and how it can be improved to not fall over.

The commit email

Looking at the email file on disk, I found this:

Author: pkubaj
Date: Wed Jan  8 21:36:57 2020
New Revision: 522460
URL: https://svnweb.freebsd.org/changeset/ports/522460

Log:
  multimedia/obs-studio: fix build on powerpc64E
  

That E after 64, that’s it.

The commit in the repo

The commit was on multimedia/obs-studio at Wed Jan 8 21:36:57 2020 UTC. Viewing that commit within the repo, nothing seems amiss

Not seeing it there makes me question: is that website front end hiding the issue?

svn log

Let’s try viewing the svn log:

[dan@pkg01:~/ports/head] $ svn log -r 522460 
------------------------------------------------------------------------
r522460 | pkubaj | 2020-01-08 21:36:57 +0000 (Wed, 08 Jan 2020) | 10 lines

multimedia/obs-studio: fix build on powerpc64

Merge upstream commit to use GCC's SSE->AltiVec translation. Since it depends on compiling with GCC, it only works on ELFv1. Hopefully it will be possible to build it on ELFv2 in the future.

Also use luajit only where it's actually available. Since it's optional, the port builds anyway.

PR:		243199
Approved by:	yuri (maintainer)
MFH:		2020Q1 (fix build blanket)

------------------------------------------------------------------------
[dan@pkg01:~/ports/head] $ 

OK we don’t see it there, but if I pipe the output through less, I see it:

[dan@pkg01:~/ports/head] $ svn log -r 522460 | less
------------------------------------------------------------------------
r522460 | pkubaj | 2020-01-08 21:36:57 +0000 (Wed, 08 Jan 2020) | 10 lines

multimedia/obs-studio: fix build on powerpc64^E

Merge upstream commit to use GCC's SSE->AltiVec translation. Since it depends on compiling with GCC, it only works on ELFv1. Hopefully it will be possible to build it on ELFv2 in the future.

Also use luajit only where it's actually available. Since it's optional, the port builds anyway.

PR:             243199
Approved by:    yuri (maintainer)
MFH:            2020Q1 (fix build blanket)

------------------------------------------------------------------------
(END)

That’s interesting. Piping the output through more gave similar results.

The error

The error I see is:


Code point \u0005 is not a valid character in XML at /usr/local/lib/perl5/site_perl/FreshPorts/process_svn_mail.pm line 183.

The XML generated contains:




    
        
        
        
        

Yes, it terminates with the LOG message, which is the text which contains the non-printing character.

What’s next?

I’m going to consult with others and see how to fix this. The code in question is perl and I’m not the best at that.

Solved with this patch

--- ingress/modules/trunk/process_svn_mail.pm	2019/01/02 21:26:27	5200
+++ ingress/modules/trunk/process_svn_mail.pm	2020/01/09 17:16:10	5244
@@ -52,6 +52,11 @@
 
 	$Log = &GetLog($message);
 
+	# re https://news.freshports.org/2020/01/09/code-point-u0005-is-not-a-valid-character-in-xml/
+	#    https://twitter.com/FreshPorts/status/1215286202691211264
+	#	
+	$Log =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
+
 #print "log: '$Log'\n";
 
 	if ($Log eq '') {

The most relevant hint was: “Just look at the XML spec, it says exactly which control characters are allowed.” by Garrett Wollman. Based on that, I found the suggestion for the code I used above.

Very nice. Very simple.

I implemented on the dev server. It worked. Progressing through test, stage, and onto prod, it worked.

Thank you.

Dec 212019
 

The website has been slow for some time. I think it is because of increased disk-io.

In this post:

  • FreeBSD 12.0
  • PostgreSQL 11.5
  • LibreNMS does the graphs via SNMP

The diskio overall

By increased, I mean it has gone from about 0.1k to about 1.6k, peaking to 140-150. This is steady and persistent.

This graphs shows the jump which occurred earlier this year.

FreshPorts disk io

FreshPorts disk io

My eyes can’t tell for sure which two drives are involved, but going to the graphs for individual drives, it is clearly ada3 and ada2.

ada2

This is for ada2:

ada diskio

ada diskio

ada3

And the same period for ada3:

ada3 diskio

ada3 diskio

zroot

zpool status confirms this is the zroot pool.

$ zpool status
  pool: main_tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 0 days 02:43:45 with 0 errors on Wed Dec 18 06:32:53 2019
config:

	NAME        STATE     READ WRITE CKSUM
	main_tank   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada0p1  ONLINE       0     0     0
	    ada1p1  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 0 days 00:04:40 with 0 errors on Thu Dec 19 03:10:19 2019
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

errors: No known data errors

The filesystems

What’s on there?

[dan@x8dtu:~] $ zfs list -r zroot
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
zroot                                  26.8G  78.8G    96K  /zroot
zroot/ROOT                             5.92G  78.8G    96K  none
zroot/ROOT/default                     5.92G  78.8G  4.19G  /
zroot/TESTING                           616K  78.8G    88K  /zroot/TESTING
zroot/TESTING/top                       528K  78.8G    88K  /zroot/TESTING/top
zroot/TESTING/top/test1                  88K  78.8G    88K  /zroot/TESTING/top/test1
zroot/TESTING/top/test2                  88K  78.8G    88K  /zroot/TESTING/top/test2
zroot/TESTING/top/test3                  88K  78.8G    88K  /zroot/TESTING/top/test3
zroot/TESTING/top/test4                  88K  78.8G    88K  /zroot/TESTING/top/test4
zroot/TESTING/top/test5                  88K  78.8G    88K  /zroot/TESTING/top/test5
zroot/data                             20.8G  78.8G    88K  /zroot/data
zroot/data/dan-pg                        88K  78.8G    88K  /usr/home/dan/pg
zroot/data/freshports                  4.51G  78.8G    88K  none
zroot/data/freshports/repo             4.51G  78.8G    88K  none
zroot/data/freshports/repo/PORTS-head  4.51G  78.8G  4.51G  /iocage/jails/x8dtu-ingress01/root/var/db/freshports/ports-jail/var/db/repos/PORTS-head
zroot/data/ingress01-testing            675M  78.8G   675M  /iocage/jails/x8dtu-ingress01/root/usr/home/dan/tmp-fast
zroot/data/pg01-postgres               15.6G  78.8G  15.6G  /iocage/jails/x8dtu-pg01/root/var/db/postgres
zroot/data/pg02-postgres               9.25M  78.8G  9.25M  /iocage/jails/x8dtu-pg02/root/var/db/postgres
zroot/tmp                               120K  78.8G   120K  /tmp
zroot/usr                               192K  78.8G    96K  /usr
zroot/usr/src                            96K  78.8G    96K  /usr/src
zroot/var                              9.77M  78.8G    96K  /var
zroot/var/audit                          96K  78.8G    96K  /var/audit
zroot/var/crash                          96K  78.8G    96K  /var/crash
zroot/var/log                          8.48M  78.8G  8.48M  /var/log
zroot/var/mail                          120K  78.8G   120K  /var/mail
zroot/var/tmp                           912K  78.8G   912K  /var/tmp
[dan@x8dtu:~] $ 

What is likely to have the traffic?

  • zroot/data/freshports/repo/PORTS-head – that’s the man FreeBSD ports repo
  • zroot/data/pg01-postgres – the PostgreSQL database server (pg02 is another database server, but not used)
  • PostgreSQL backends

    The average number of PostgreSQL backends has jumped from about 1 to 3. This is the PostgreSQL pg01 jail.

    PostgreSQL backend pg01 jail

    PostgreSQL backend pg01 jail

    What changed?

    Two things come to mind:

    1. caching – I added caching on disk for some of the web pages. If anything, this should reduce the disk io.
    2. recordsize – I changed some parameters on both zroot/data/pg01-postgres and zroot/data/pg02-postgres

    Looking back at command history, I see:

    [dan@x8dtu:~] $ history | grep recordsize
        7  zfs get recordsize,redundant_metadata,primarycache,logbias zroot/data/pg01-postgres postgres                                  11.6M  63.2G  9.18M  /iocage/jails/x8dtu-pg02/root/var/db/postgres
        8  zfs get recordsize,redundant_metadata,primarycache,logbias zroot/data/pg01-postgres zroot/data/pg02-postgres
        9  sudo zfs set recordsize=8k              redundant_metadata=most              primarycache=metadata              logbias=throughput zroot/data/pg01-postgres zroot/data/pg02-postgres
       10  zfs get recordsize,redundant_metadata,primarycache,logbias zroot/data/pg01-postgres zroot/data/pg02-postgres
      496  history | grep recordsize
    

    These are the current settings for pg01, the current production database server:

    [dan@x8dtu:~] $ zfs get recordsize,redundant_metadata,primarycache,logbias zroot/data/pg01-postgres 
    NAME                      PROPERTY            VALUE         SOURCE
    zroot/data/pg01-postgres  recordsize          8K            local
    zroot/data/pg01-postgres  redundant_metadata  most          local
    zroot/data/pg01-postgres  primarycache        metadata      local
    zroot/data/pg01-postgres  logbias             throughput    local
    [dan@x8dtu:~] $ 
    

    Looking at the parent, these are the likely settings before the change:

    [dan@x8dtu:~] $ zfs get recordsize,redundant_metadata,primarycache,logbias zroot/data
    NAME        PROPERTY            VALUE         SOURCE
    zroot/data  recordsize          128K          default
    zroot/data  redundant_metadata  all           default
    zroot/data  primarycache        all           default
    zroot/data  logbias             latency       default
    [dan@x8dtu:~] $ 
    

    What changed?

    • recordsize – from 128K to 8K
    • redundant_metadata – from all to most
    • primarycache – from all to metadata
    • logbias – from latency to throughput

    top

    An extract from top shows:

     pid: 43619;  load averages:  1.01,  1.39,  1.48           up 34+18:05:39  15:19:57
    195 processes: 9 running, 186 sleeping
    CPU:  1.9% user,  0.0% nice,  6.5% system,  0.0% interrupt, 91.6% idle
    Mem: 338M Active, 720M Inact, 102G Wired, 85G Free
    ARC: 62G Total, 48G MFU, 8620M MRU, 3603K Anon, 759M Header, 5407M Other
         47G Compressed, 117G Uncompressed, 2.47:1 Ratio
    Swap: 4096M Total, 4096M Free

    Lots of free RAM (85G).

    Plenty of ARC in use (62G).

    Theories

    Could the change in recordsize cause this issue? Some records are now 128K, some are 8K?

    If yes, then shutting down the database, and zfs send | zfs recv the entire dataset to a new dataset would make everything 8K.

    The jump in number of backend workers, could that be because queries now take longer and requests from the websites are now over lapping?

    If the cause is one or a combination of redundant_metadata, primarycache, and logbias, changing them back and observing might show a difference.

    Setting changes

    The following commentary is based on a IRC chat with Allan Jude.

    primarycache=metadata means that ZFS will cache only the ZFS metadata, not any PostgreSQL database data. The idea behind this configuration is to have no ZFS data in the ARC, thereby avoiding both PostgreSQL and ZFS caching the same data twice. However, we might be better off letting ZFS cache this data, since the dataset is only 15.6GB and it will all fit into ARC compressed.

    logbias=throughput tells ZFS to optimize for throughput. This will not do synchronous writes to the ZIL. This will make commits to the database take longer. Setting this back to latency is probably better for a database.

    At about Fri Dec 20 15:38 UTC 2019, I ran: zfs set primarycache=all zroot/data/pg01-postgres

    To speed up the caching, I ran this command:

    [dan@x8dtu:~] $ sudo zfs snapshot zroot/data/pg01-postgres@for.caching
    [dan@x8dtu:~] $ sudo zfs send zroot/data/pg01-postgres@for.caching > /dev/null
    [dan@x8dtu:~] $ sudo zfs destroy zroot/data/pg01-postgres@for.caching
    [dan@x8dtu:~] $ 
    

    That will read all the dataset into ARC.

    PostgreSQL database settings

    This is what PostgreSQL is working with:

    shared_buffers = 128MB

    That is spread rather thinly for a 15.6GB database.

    A a wise person recently said: the key to database performance with ZFS is you need exactly one of [database|filesystem] caching the database. 0 of them caching or both of them caching is bad. In years past I’ve had better luck letting the database cache than ZFS caching. But that was before compressed ARC.

    20 minutes later

    Now the graph is trending down. This is the past 6 hours.

    pg01 disk io going down

    pg01 disk io going down

    Nearly two hours later

    Yes, I declare this fixed. The overall diskio has dropped back to what it was before the change.

    The number of PostgreSQL backends has also dropped.

    PostgresQL Backend back to normal

    PostgresQL Backend back to normal

    28 hours later

    I waited until the next day to include this graph. It clearly shows how the IO has dropped back to the same level as before. This graph covers a two-month period.

    two month graph

    two month graph

    Thanks

    Sorry it took so long to get to. Glad I had Allan and Josh to help out.

Sep 272019
 

I was just creating a new jail for working on git & FreshPorts. I was intrigued to see that iocage uses send | receive to create the new jail:

[dan@slocum:~] $ ps auwwx | grep iocage
root      64166    3.7  0.0   12788    4036  1  D+   21:16         0:06.10 zfs send system/iocage/releases/12.0-RELEASE/root@git-dev
root      64167    2.8  0.0   12752    4036  1  S+   21:16         0:03.60 zfs receive -F system/iocage/jails/git-dev/root
root      63910    0.0  0.0   16480    7384  1  I+   21:16         0:00.01 sudo iocage create -r 12.0-RELEASE --thickjail --name git-dev
root      63911    0.0  0.0   53344   42484  1  I+   21:16         0:01.01 /usr/local/bin/python3.6 /usr/local/bin/iocage create -r 12.0-RELEASE --thickjail --name git-dev
dan       67954    0.0  0.0   11288    2732  3  S+   21:18         0:00.00 grep iocage
[dan@slocum:~] $ 

More later, after I get this jail configured.

Edit: 2019-09-28

From Twitter:

Something is being copied, is that a cached version of the jail template?

The answer is a local copy of FreeBSD 12.0-RELEASE:

[dan@slocum:~] $ zfs list -r system/iocage/releases
NAME                                       USED  AVAIL  REFER  MOUNTPOINT
system/iocage/releases                    3.15G  15.9T   176K  /iocage/releases
system/iocage/releases/11.2-RELEASE       1.44G  15.9T   176K  /iocage/releases/11.2-RELEASE
system/iocage/releases/11.2-RELEASE/root  1.44G  15.9T  1.44G  /iocage/releases/11.2-RELEASE/root
system/iocage/releases/12.0-RELEASE       1.71G  15.9T   176K  /iocage/releases/12.0-RELEASE
system/iocage/releases/12.0-RELEASE/root  1.71G  15.9T  1.71G  /iocage/releases/12.0-RELEASE/root
[dan@slocum:~] $ 

What’s in there?

[dan@slocum:~] $ ls /iocage/releases/12.0-RELEASE/root
COPYRIGHT boot      etc       libexec   mnt       proc      root      sys       usr
bin       dev       lib       media     net       rescue    sbin      tmp       var
[dan@slocum:~] $ 
Sep 222019
 

We have the first commit process via git into FreshPorts. Details in this git comment.

Work remaining:

  1. check out that commit into the working copy of the files
  2. run make -V on the working copy to get the refreshed values for the port[s] affected by this commit

The 2nd part – very little code change.

The 1st part is just playing with git.

My thanks to Sergey Kozlov for his code which creates the XML FreshPorts needs for commit processing. That has been a great timesavings to me.

Sep 182019
 

I want to move FreshPorts towards using commit hooks and away from depending upon incoming emails for processing new commits.

Much of the following came from a recent Twitter post.

You might think: why are we using emails? Why? Because we can. They were the easiest and most simple approach. It is a time-proven solution. Look at https://docs.freshports.org/ and you can see the original ideas from 2001. That is over 18 years of providing data.

If email is so good, why stop?

Because we can.

And we won’t stop using email.

Email will stay around as a fall-back position. Commit hooks are tighter dependency upon a third party and requires close cooperation. Should that relationship sour, the cooperation may terminate.

If web-hooks proceed, email processing will be modified to introduce an N-minute delay. After leaving the N-minute queue, the mail will be:

  • ignored if the commit has already been processed
  • processed if the commit is not in the database

How is a commit identified

Email processing is based upon the Message-Id contained within the database. Duplicates are ignored.

I am not sure if we also check the subversion revision number. That might be wise. There is an index, but it is not unique.

If we move to commit-hooks, message-id will not be available. We will have to change to relying upon the revision number or, in the case of git, the commit hash.

ACTIONS:

  • add unique ID to commit_log.svn_revision
  • remove not null constraint on commit_log.message_id
  • add commit_log.commit_hash with a unique index

Commit processing

Regardless of how we get notified of a new commit, we must be able to put our local copy of the repo into the state as of a given commit.

For subversion, we do this:

svn up -r REVISION

After this, various commands, such as make -V, are run to extract the necessary values from the ports tree (as of the commit). This information includes PORTVERSION, PORTREVISION, etc. You can see why is it vital to have everything in our ports tree reflect the repo as of that particular commit.

For git, it is it similar:

git checkout HASH

The same scripts, as describe above, would be run.

Commit hooks

These are the assumptions for a commit hook:

  1. the hook gets triggered exactly once per commit
  2. the hook is fast, so as not to slow down commits

In order to be fast, the basic information has to be passed along to another daemon, which then puts it into a queue, which is then processed by another daemon. This queue must be persistent.

I am using hare and hared here as examples only because I am familiar with them. They won’t actually what I need, but if I was to fork them and modify them for this specific task, I think they would do the job rather well.

My initial thoughts are:

  1. The hook invokes something like hare (see also sysutils/hare) which sends a udp packet to something else. The packet contains the commit revision number (if subversion) or hash (if git).
  2. The udp is received by something like hared (same link as above for hare, but available via sysutils/py-hared).
  3. hared then adds the data to a queue. What type of queue and where it is located is for later design.

Commit iteration

When processing email, the looping through the email is your iteration. When you have no email, you need something to iterate through.

git commit iteration

I think this is the command we want to use when iterating through git commits:

git rev-list ..HEAD

Where is the hash of our most recently processed commit. Most recently is not necessarily the last one we committed. It is the commit with the most recent timestamp. Here is an example:

$ git rev-list ee38cccad8f76b807206165324e7bf771aa981dc..HEAD
0ca46774ac207517724eb48338c04a4dbde0728a
a3361806ab49043fca46f81a0edc2357b7d3947c

Using the above, perhaps the logic for processing commits will be:

detect a new commit
git pull
use git rev-list to get list of new commits
for i = oldest new commit to newest new commit {
  git checkout a commit
  magic
}

subversion commit iteration

With subversion we have a revision id, which is an integer.

The repo can be queried for their highest commit via:

svn log -r head

With that revision number, the code to process the commits is

for i = LastCommitProcess + 1; i <= LatestCommit; i++ {
  svn up -r $i
  process that commit
}

How do we handle gaps in the subversion revision sequence? If we have commits, 5, 7, and 8, where is commit 6? How do we note that commit 6 does not exist and that we need to get it? What if the repo has no commit 6?

Current status

Extracting the required data from the repo instead of the email should be straight forward. It must still be tested and verified.

Iterating the commits is still something which needs to be proven that it works. Hopefully that can start soon.

Sep 032019
 

I’m trying to think of a list of things that FreshPorts can do which might be useful.

I can think of these:

  • provides example dependency line. e.g. p5-XML-RSS>0:textproc/p5-XML-RSS
  • list of dependencies for a port
  • list of ports depending upon this port
  • Default configuration options
  • what packages install a given file (e.g. bin/unzip)
  • what ports does this person maintain?
  • which Makefiles contain a reference to bunzip?
  • search results can be plain-text consisting of a list of foo/bar ports
  • The Maximum Effort checkbox on the search page does nothing.
  • Committers can be notified of sanity test failures after the commit
  • Find a commit, any commit, based on SVN revision number, e.g. : https://www.freshports.org/commit.php?revision=352332

Any more ideas?