Parsing the URL

When FreshPorts started, it used unique numbers to identify ports. This number was the primary key in the ports table. It was during a talk at a local LUG that I discovered how easy it would be to implement the solution you see now (e.g. sysutils/bacula-server. On the FreshPorts webserver, there is no sysutils directory. So how does FreshPorts parse the URL and know what to display?

It’s PHP magic combined with Apache directives. The root of this magic is this directive in the virtual host for FreshPorts:

ErrorDocument   404 /missing.php

When you browse to http://www.freshports.org/sysutils/bacula-server/, Apache cannot satisify the request and raises a 404 (NOT FOUND) error. The directive shown above tells apache to redirect to a local URL-path (missing.php) to handle the problem. It is there that the real magic occurs.

What Magic Is This?

It was Rasmus’ discussion of how the PHP help works that gave me the inspiration for the solution. Think of something in PHP that you want to look up. Then slap that term on the URL for their website. You’ll get help. You might the topic you want. It’s pretty amazing. That’s what FreshPorts does.

I will now describe how FreshPorts does its particular magic. Why? For two reasons:

  • When I describe something in detail, even something which I am quite familiar with, it reinforces that information in my mind and helps me to understand it even better
  • I want to share how FreshPorts does this

FreshPorts looks at the value supplied in $_SERVER['REDIRECT_URL']. In our example, this value would be /sysutils/bacula-server/. There are some other things it does, none of which are relevant here, but that’s it.

FreshPorts stores the entire source tree in in database. Not the file contents, but the directory and file structure. This is vast overkill. It could be simplier. But then it would be just for ports. The database has always been designed with source in general and ports in specific. Really, what FreshPorts looks for is the directory ports/sysutils/bacula-server. If if finds that in the database, it knows we have a port.

Each directory of the path is referred to as an element. In the path above, the elements are:

  1. ports
  2. sysutils
  3. bacula-server

Every element has an entry in the element table. This is a self referential table defined as:

CREATE TABLE element
(
id serial NOT NULL,
name text NOT NULL,
parent_id int4,
directory_file_flag char(1) NOT NULL,
status char(1) NOT NULL,
CONSTRAINT element_pkey PRIMARY KEY (id),
CONSTRAINT "$1" FOREIGN KEY (parent_id) REFERENCES element (id) ON UPDATE RESTRICT ON DELETE CASCADE,
CONSTRAINT element_directory_file_flag CHECK (directory_file_flag = 'F'::bpchar OR directory_file_flag = 'D'::bpchar),
CONSTRAINT element_status CHECK (status = 'A'::bpchar OR status = 'D'::bpchar)
)

The entries in this table, for the URL in question, are:

freshports.org=# select * from element where id in (1, 218, 204995);
   id   |     name      | parent_id | directory_file_flag | status
--------+---------------+-----------+---------------------+--------
      1 | ports         |           | D                   | A
    218 | sysutils      |         1 | D                   | A
 204995 | bacula-server |       218 | D                   | A
(3 rows)

freshports.org=#

Yes, I looked up the id values and used them in this query. But there, you see what it looks like. And probably more interesting is this output:

freshports.org=# select * from element where id in (1, 218, 204995) or parent_id = 204995;
   id   |         name         | parent_id | directory_file_flag | status
--------+----------------------+-----------+---------------------+--------
      1 | ports                |           | D                   | A
    218 | sysutils             |         1 | D                   | A
 204995 | bacula-server        |       218 | D                   | A
 230518 | pkg-message          |    204995 | F                   | A
 212066 | files                |    204995 | D                   | A
 212065 | distinfo             |    204995 | F                   | A
 205748 | pkg-deinstall        |    204995 | F                   | A
 205000 | pkg-plist.client     |    204995 | F                   | A
 204999 | pkg-plist            |    204995 | F                   | A
 204998 | pkg-install.client   |    204995 | F                   | A
 204997 | pkg-deinstall.client |    204995 | F                   | A
 204996 | Makefile             |    204995 | F                   | A
(12 rows)

freshports.org=#

You can clearly see all the files and directories involved in this port. It’s pretty easy to follow.

Categories, real and virtual

Virtual categories represent a very different problem. Secondary catgories also present a challenge. The categories table is defined as:

CREATE TABLE categories
(
id serial NOT NULL,
is_primary bool NOT NULL,
element_id int4,
name text NOT NULL,
description text,
CONSTRAINT categories_pkey PRIMARY KEY (id),
CONSTRAINT "$1" FOREIGN KEY (element_id) REFERENCES element (id) ON UPDATE CASCADE ON DELETE CASCADE
)

Primary categories exist on disk. sysutils is a primary category. A port exists within a primary category but may also be placed in secondary catgories and in virtual categories. Virtual categories are those that do not exist on disk. They are merely collections of like ports. A secondary category may be either a primary or a virtual category. afterstep is a virtual category.

All categories, virtual and primary, are present in the categories table. Relationships between ports and categories are handled by the ports_categories table:

CREATE TABLE ports_categories
(
port_id int4 NOT NULL,
category_id int4 NOT NULL,
CONSTRAINT ports_categories_pkey PRIMARY KEY (port_id, category_id),
CONSTRAINT "$1" FOREIGN KEY (port_id) REFERENCES ports (id) ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT "$2" FOREIGN KEY (category_id) REFERENCES categories (id) ON UPDATE CASCADE ON DELETE CASCADE
)

Where are the ports?

The ports are in the ports table…. shocking, I know.

freshports.org=# select id, element_id, category_id, short_description from ports where element_id = 204995;
  id   | element_id | category_id |          short_description
-------+------------+-------------+--------------------------------------
 13982 |     204995 |          20 | The network backup solution (server)
(1 row)

What categories is this port in?

freshports.org=# select * from ports_categories where port_id = 13982;
 port_id | category_id
---------+-------------
   13982 |          20
(1 row)

freshports.org=# select * from categories where id = 20;
 id | is_primary | element_id |   name   |   description
----+------------+------------+----------+------------------
 20 | t          |        218 | sysutils | System utilities
(1 row)

A faster way

It was during the investigation of a recent bug concerning virtual categories that I thought about a faster way to do things. The current method makes use of this stored procedure:

select * from elementGet('ports/sysutils/bacula');
   id   |  name  | type | status | iscategory | isport
--------+--------+------+--------+------------+--------
 131289 | bacula | D    | D      | f          | t

That takes about 20-55 ms.

But I was thinking about an easier way that would also work for virtual categories:

SELECT P.id, C.id, EP.name, C.name
  FROM element EP, ports P, categories C, ports_categories PC
WHERE EP.name        = 'bacula-server'
  AND EP.id          = P.element_id
  AND C.name         = 'sysutils'
  AND PC.port_id     = P.id
  AND PC.category_id = C.id;

And that takes about 20ms. About the same time! WOO HOO!

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top