This post deals with adding a new feature: A table of which FreeBSD version and architecture has a given package
The goal: parse the packagesite.yaml supplied with each repo and load the results into the FreshPorts database.
A bunch of the background in the above URL, and I don’t want to repeat it here.
This post is mostly my notes on who this is being done.
What’s in the file
The file is labelled as .yaml, which is a superset of JSON. I first started parsing the file with Python but ran into encoding errors:
Things to be determined
There are a number of items, not so urgent, which must be resolved.
- list of valid ABI combinations – we have a starting point
Starting with Python
I started off parsing packagesite.yaml with this Python script. This produced a tab-delimited file which could then be loaded into the database.
The creation of the CSV file took about 3 minutes.
Posting for help
Last night, I posted on social media looking for help:
The next morning brought many suggestions (thank you).
The most elegant solution was the use of textproc/jq (homepage).
The new working proof-of-concepts
This is the one-liner for creating the tab-delimited file:
$ time jq -rc '[1, .origin, .name, .version] | @tsv' < ~/tmp/FreeBSD\:12\:amd64/latest/packagesite.yaml > packagesite.csv real0m1.351s user0m1.295s sys0m0.055s
That is plenty fast for processing a 31491-line file.
Loading the code into the database is still plenty fast as well:
$ time ./import-via-copy-packagesite.py real 0m2.077s user 0m0.126s sys 0m0.016s
The database tables
The data is imported into this table:
freshports.dev=# \d packages_raw Table "public.packages_raw" Column | Type | Collation | Nullable | Default -----------------+---------+-----------+----------+------------------------------ id | bigint | | not null | generated always as identity abi_id | integer | | not null | package_origin | text | | not null | package_name | text | | not null | package_version | text | | not null | Indexes: "packages_raw_pkey" PRIMARY KEY, btree (id) "fki_packages_raw_abi_id_fk" btree (abi_id) "packages_raw_package_name_idx" btree (package_name) Foreign-key constraints: "packages_raw_abi_id_fk" FOREIGN KEY (abi_id) REFERENCES abi(id) NOT VALID freshports.dev=#
That data looks like this:
freshports.dev=# SELECT * FROM packages_raw ORDER BY package_name LIMIT 5; id | abi_id | package_origin | package_name | package_version --------+--------+---------------------------+------------------+----------------- 125562 | 1 | games/0ad | 0ad | 0.0.23b_9 124778 | 1 | security/0d1n | 0d1n | 2.3 124774 | 1 | games/0verkill | 0verkill | 0.16_2 124768 | 1 | security/1password-client | 1password-client | 0.9.3 124760 | 1 | games/2048 | 2048 | 0.9.1 (5 rows) freshports.dev=#
That just allows me to run queries on the data before copying into this table:
freshports.dev=# \d packages Table "public.packages" Column | Type | Collation | Nullable | Default --------------+---------+-----------+----------+------------------------------ id | bigint | | not null | generated always as identity abi_id | integer | | not null | port_id | integer | | not null | version | text | | not null | package_name | text | | not null | Indexes: "packages_pkey" PRIMARY KEY, btree (id) "fki_packages_abi_id_fk" btree (abi_id) "fki_packages_port_id_fk" btree (port_id) "packages_package_name_idx" btree (package_name) Foreign-key constraints: "packages_abi_id_fk" FOREIGN KEY (abi_id) REFERENCES abi(id) NOT VALID "packages_port_id_fk" FOREIGN KEY (port_id) REFERENCES ports(id) NOT VALID freshports.dev=#
That data looks like this:
freshports.dev=# SELECT * FROM packages ORDER BY package_name LIMIT 5; id | abi_id | port_id | version | package_name -------+--------+---------+-----------+------------------ 63781 | 1 | 31186 | 0.0.23b_9 | 0ad 63000 | 1 | 41577 | 2.3 | 0d1n 62996 | 1 | 5241 | 0.16_2 | 0verkill 62990 | 1 | 45139 | 0.9.3 | 1password-client 62982 | 1 | 35082 | 0.9.1 | 2048 (5 rows) freshports.dev=#
Parsing the raw data into relational data
How does data get from packages_raw into packages?
Via this query:
freshports.dev=# EXPLAIN ANALYSE freshports.dev-# INSERT INTO packages(abi_id, port_id, package_name, version) freshports.dev-# SELECT abi_id, freshports.dev-# GetPort(package_origin), freshports.dev-# package_name, freshports.dev-# package_version freshports.dev-# FROM packages_raw freshports.dev-# WHERE GetPort(package_origin) IS NOT NULL; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Insert on packages (cost=0.00..17084.83 rows=31334 width=45) (actual time=34310.269..34310.269 rows=0 loops=1) -> Subquery Scan on "*SELECT*" (cost=0.00..17084.83 rows=31334 width=45) (actual time=13.996..33835.542 rows=31473 loops=1) -> Seq Scan on packages_raw (cost=0.00..16693.16 rows=31334 width=29) (actual time=13.957..33788.113 rows=31473 loops=1) Filter: (getport(package_origin) IS NOT NULL) Rows Removed by Filter: 18 Planning Time: 0.271 ms Trigger for constraint packages_abi_id_fk: time=354.821 calls=31473 Trigger for constraint packages_port_id_fk: time=2243.644 calls=31473 Execution Time: 36916.031 ms (9 rows) freshports.dev=#
Some explaining:
- GetPort(package_origin) converts package_origin (e.g. games/0ad) into a port id – if there is no matching port, that’s something to deal with later. Why no matching port? In my dev database, it can easily occur.
- There might be a better way to do this, but so far so good. Suggestions welcome.
What did not get converted?
This query helps me find out what data is not being copied from the packages_raw table to the packages table:
freshports.dev=# SELECT PR.* FROM packages_raw PR WHERE NOT EXISTS (SELECT * FROM packages P where PR.package_name = P.package_name); id | abi_id | package_origin | package_name | package_version --------+--------+--------------------------------------+--------------------------------+----------------- 96128 | 1 | sysutils/swapusage | swapusage | 1.0.0 100602 | 1 | devel/p5-MooX-Singleton | p5-MooX-Singleton | 1.20 101008 | 1 | devel/py-aspy.yaml | py37-aspy.yaml | 1.3.0 101036 | 1 | devel/py-nodeenv | py37-nodeenv | 1.3.3 101891 | 1 | devel/rubygem-peek-rblineprof-rails5 | rubygem-peek-rblineprof-rails5 | 0.2.0 101953 | 1 | devel/rubygem-peek-gc-rails5 | rubygem-peek-gc-rails5 | 0.0.2 103834 | 1 | devel/py-nodeenv | py27-nodeenv | 1.3.3 104220 | 1 | sysutils/py-leviathan | py37-leviathan | 0.1.1 107173 | 1 | sysutils/atf-rk3399 | atf-rk3399 | v2.1 117527 | 1 | security/picosha2 | picosha2 | 1.0.0.23 99899 | 1 | devel/p5-File-ShareDir-Tarball | p5-File-ShareDir-Tarball | 0.2.2 100063 | 1 | devel/p5-MooseX-MungeHas | p5-MooseX-MungeHas | 0.011 100562 | 1 | devel/py-aspy.yaml | py27-aspy.yaml | 1.3.0 101881 | 1 | devel/rubygem-peek-redis-rails5 | rubygem-peek-redis-rails5 | 1.2.0 101917 | 1 | devel/rubygem-peek-rails5 | rubygem-peek-rails5 | 1.0.1 101941 | 1 | databases/rubygem-peek-mysql2-rails5 | rubygem-peek-mysql2-rails5 | 1.2.0 106471 | 1 | sysutils/py-leviathan | py27-leviathan | 0.1.1 107871 | 1 | www/rubygem-gon-rails5 | rubygem-gon-rails5 | 6.2.1 (18 rows) freshports.dev=#
That query runs quickly, very quickly. There are indexes on package_name in both tables.
What’s next?
Next up is the display of this data on each port page. That’s where port_id comes into play.
freshports.dev-# PKG.package_name, freshports.dev-# PKG.version freshports.dev-# FROM packages PKG, ABI freshports.dev-# WHERE PKG.port_id = 31186 freshports.dev-# AND PKG.abi_id = ABI.id; name | package_name | version ------------------+--------------+----------- FreeBSD:12:amd64 | 0ad | 0.0.23b_9 (1 row) freshports.dev=#
Hmmm, I should rename packages.version to package.package_version for consistency with packages_raw.
packages_raw is designed to hold multiple entries for a given port, for the different ABI (application binary interface). Have a look at https://pkg.freebsd.org to see the various packages sets.
Example screen shot from security/acme.sh