I noticed just now that the database has not been keeping up with the rapid changes to the incoming XML. In general, anything in the XML should eventually get into the database. That is just not happening. Case in port: repo name.
This is now becoming an issue when we link to the SVN repository. Look at this commit and you’ll see that the URL to the repo is:
/base/releng/9.1/sys/amd64/amd64/initcpu.c => http://svnweb.freebsd.org/ports/head//base/releng/9.1/sys/amd64/amd64/initcpu.c?view=log#rev240240
That’s wrong. This is a src commit. It should link to http://svnweb.freebsd.org/base/releng/9.1/sys/amd64/amd64/initcpu.c?view=log#rev240240
Hmm, looking at this more closely, I see the URL is nearly right. It just contains the wrong prefix. Remove /ports/head// and it’s fine.
What about port commits? devel/p4ftpd http://svnweb.freebsd.org/ports/head/head/devel/p4ftpd/Makefile?view=log#rev303919
Yeah. That’s wrong too. It should be http://svnweb.freebsd.org/ports/head/devel/p4ftpd/Makefile?view=log#rev303919 (i.e. remove /head/)
Really, the repo prefix for svnweb is wrong in both cases.
As for doc commits, see index.sgml which links to http://svnweb.freebsd.org/ports/head//doc/head/en_US.ISO8859-1/htdocs/releng/index.sgml?view=log#rev39514 but really needs to have /ports/head// removed from the URL.
I think this calls for a repo table, which points to the correct repo URLs.
CREATE TABLE repo( id serial not null, name text not null, description text not null, svn_hostname text not null, path_to_repo text not null ); create unique index repo_id on repo(id); create unique index repo_name on repo(name); create unique index repo_description on repo(description); create unique index repo_path_to_repo on repo(path_to_repo); insert into repo (name, description, svn_hostname, path_to_repo) values ('ports', 'The FreeBSD Ports tree', 'svnweb.freebsd.org', '/ports'); insert into repo (name, description, svn_hostname, path_to_repo) values ('doc', 'The FreeBSD doc tree', 'svnweb.freebsd.org', '/doc'); insert into repo (name, description, svn_hostname, path_to_repo) values ('src', 'The FreeBSD src tree', 'svnweb.freebsd.org', '/base');
Which gives us:
freshports.org=# select * from repo; id | name | description | svn_hostname | path_to_repo ----+-------+------------------------+--------------------+-------------- 1 | ports | The FreeBSD Ports tree | svnweb.freebsd.org | /ports 2 | doc | The FreeBSD doc tree | svnweb.freebsd.org | /doc 3 | src | The FreeBSD src tree | svnweb.freebsd.org | /base (3 rows) freshports.org=#
Now, let’s link the commit_log table to the repo table:
ALTER TABLE commit_log add repo_id integer; alter table commit_log add foreign key (repo_id) references repo(id) on update cascade on delete restrict;
Next step: populate that field… This should be interesting. Let’s start with the doc tree:
UPDATE commit_log CL SET repo_id = (SELECT id FROM repo WHERE name = 'doc') WHERE CL.id in ( SELECT distinct CL.id FROM commit_log_elements CLE, commit_log CL, element_pathname EP WHERE CLE.element_id = EP.element_id AND CLE.commit_log_id = CL.id AND EP.pathname like '/doc/%' );
That updated 14,849 records. Let’s see about the ports tree next. Hmm, what if a port affects more than one repo…? That should not happen, but let’s just check:
SELECT * FROM ( SELECT distinct CL.id, CL.repo_id FROM commit_log_elements CLE, commit_log CL, element_pathname EP WHERE CLE.element_id = EP.element_id AND CLE.commit_log_id = CL.id AND EP.pathname like '/ports/%' ) AS TMP WHERE TMP.repo_id is not null; id | repo_id -------+--------- 45075 | 2 (1 row)
This finds all commits against the ports tree for which a repo is already assigned. In this case, it must be the doc repo because that is the only assignment we have done so far. Let’s look at that commit.
freshports.org=# select message_id from commit_log where id = 45075; message_id -------------------------------------------------- 200208072021.g77KLCRq064264@freefall.freebsd.org (1 row)
OMG, are you kidding me? There’s three different repos on that commit! Oh well, I’ll just let it be converted to a port commit because I don’t want to special case the following SQL statement, which sets the repo for all the port commmits:
UPDATE commit_log CL SET repo_id = (SELECT id FROM repo WHERE name = 'ports') WHERE CL.id in ( SELECT distinct CL.id FROM commit_log_elements CLE, commit_log CL, element_pathname EP WHERE CLE.element_id = EP.element_id AND CLE.commit_log_id = CL.id AND EP.pathname like '/ports/%' );
That updated 262,478 records.
Now let’s see about the src commits. But first, we’ll check for commits to src, which affected another repo. We did a similar check for ports, after setting the docs commits.
freshports.org=# SELECT * FROM freshports.org-# ( freshports.org(# SELECT distinct CL.id, CL.repo_id freshports.org(# FROM commit_log_elements CLE, commit_log CL, element_pathname EP freshports.org(# WHERE CLE.element_id = EP.element_id freshports.org(# AND CLE.commit_log_id = CL.id freshports.org(# AND EP.pathname like '/base/%' freshports.org(# ) AS TMP freshports.org-# WHERE TMP.repo_id is not null; id | repo_id -------+--------- 27009 | 2 28918 | 2 30685 | 2 32472 | 2 36499 | 2 39683 | 2 (6 rows) freshports.org=# select E.*, element_pathname(id) from element E where id in (select element_id from commit_log_elements where commit_log_id in (27009,28918,30685,32472,36499,39683)); id | name | parent_id | directory_file_flag | status | element_pathname -------+----------------------------+-----------+---------------------+--------+--------------------------------------------------------------- 77514 | param.h | 76996 | F | A | /base/head/sys/sys/param.h 79703 | authors.ent | 77843 | F | A | /doc/head/en_US.ISO8859-1/share/sgml/authors.ent 78787 | article.sgml | 78786 | F | A | /doc/head/en_US.ISO8859-1/articles/contributors/article.sgml 81174 | netchild.key | 77796 | F | D | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/netchild.key 77339 | news.xml | 77009 | F | D | /www/en/news/news.xml 77797 | chapter.sgml | 77796 | F | A | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/chapter.sgml 78449 | calendar.freebsd | 77807 | F | A | /base/head/usr.bin/calendar/calendars/calendar.freebsd 77799 | pgpkeys.ent | 77796 | F | D | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/pgpkeys.ent 83011 | book.sgml | 83010 | F | A | /doc/head/en_US.ISO8859-1/books/porters-handbook/book.sgml 8730 | freebsd.committers.markers | 8729 | F | A | /ports/head/astro/xearth/files/freebsd.committers.markers (10 rows) freshports.org=#
OK, there are all doc. Given that we will be setting src here, I think it’s OK to override those six old commits.
- 200206250815.g5P8FZn39049@freefall.freebsd.org
- 200204031503.g33F3eX91734@freefall.freebsd.org
- 200203211640.g2LGeYH71997@freefall.freebsd.org
- 200205292104.g4TL4QC94947@freefall.freebsd.org
- 200204302354.g3UNs4004251@freefall.freebsd.org
- 200204161217.g3GCHsm65354@freefall.freebsd.org
Now let’s do the src records:
UPDATE commit_log CL SET repo_id = (SELECT id FROM repo WHERE name = 'src') WHERE CL.id in ( SELECT distinct CL.id FROM commit_log_elements CLE, commit_log CL, element_pathname EP WHERE CLE.element_id = EP.element_id AND CLE.commit_log_id = CL.id AND EP.pathname like '/base/%' );
That affected 125,040 records.
How many commits are not assigned to a repo?
freshports.org=# SELECT count(*) FROM commit_log WHERE repo_id IS NULL; count ------- 43077 (1 row) freshports.org=#
Hmmm, where are those files?
freshports.org=# select E.*, element_pathname(id) from element E where id in (select element_id from commit_log_elements where commit_log_id in (select id from commit_log where repo_id IS NULL LIMIT 100)); id | name | parent_id | directory_file_flag | status | element_pathname -------+-----------------------+-----------+---------------------+--------+------------------------------------------------ 90424 | Makefile | 90423 | F | A | /www/en/donations/Makefile 90425 | index.sgml | 90423 | F | A | /www/en/donations/index.sgml 90426 | index.xsl | 77008 | F | A | /www/en/index.xsl 77005 | modules | 77004 | F | A | /CVSROOT/modules 77273 | schedule.sgml | 77272 | F | A | /www/en/releases/4.6R/schedule.sgml 77313 | index.sgml | 77312 | F | A | /www/en/smp/index.sgml 77359 | news.xml | 77358 | F | A | /www/en/gnome/news.xml 78052 | index.xsl | 78003 | F | A | /www/ja/index.xsl 78588 | bsdcon-devsummit.sgml | 78586 | F | A | /www/en/events/2002/bsdcon-devsummit.sgml 78593 | Makefile | 77008 | F | A | /www/en/Makefile 78734 | submitting.sgml | 78730 | F | A | /www/en/docproj/submitting.sgml 78735 | translations.sgml | 78730 | F | A | /www/en/docproj/translations.sgml 78761 | bylaws.sgml | 78742 | F | A | /www/en/internal/bylaws.sgml 79000 | access | 77004 | F | A | /CVSROOT/access 84072 | avail | 77004 | F | A | /CVSROOT/avail 84200 | exclude | 77004 | F | A | /CVSROOT/exclude 90073 | cvsweb.conf-netbsd | 84490 | F | A | /projects/cvsweb/cvsweb.conf-netbsd 90074 | cvsweb.conf-openbsd | 84490 | F | A | /projects/cvsweb/cvsweb.conf-openbsd 84456 | myths.sgml | 84452 | F | A | /www/en/advocacy/myths.sgml 84492 | ChangeLog | 84490 | F | A | /projects/cvsweb/ChangeLog 85014 | cvsweb.cgi | 84490 | F | A | /projects/cvsweb/cvsweb.cgi 85045 | cvsweb.cgi | 79749 | F | A | /www/en/cgi/cvsweb.cgi 85294 | index.sgml | 85293 | F | A | /www/en/projects/c99/index.sgml 86233 | missing_handler.cgi | 79749 | F | A | /www/en/cgi/missing_handler.cgi 86234 | pds.cgi | 79749 | F | A | /www/en/cgi/pds.cgi 86362 | ports.cgi | 79749 | F | A | /www/en/cgi/ports.cgi 89779 | README | 84489 | F | A | /projects/README 89819 | url.cgi | 79749 | F | A | /www/en/cgi/url.cgi 89635 | relnotes.sgml | 77008 | F | A | /www/en/relnotes.sgml 89744 | Makefile | 89743 | F | D | /projects/trustedbsd/www/Makefile 89745 | beastie.page | 89743 | F | D | /projects/trustedbsd/www/beastie.page 89746 | bib-stylesheet.xsl | 89743 | F | D | /projects/trustedbsd/www/bib-stylesheet.xsl 89831 | components.gif | 89743 | F | D | /projects/trustedbsd/www/components.gif 89747 | components.page | 89743 | F | D | /projects/trustedbsd/www/components.page 89832 | componentsgray.gif | 89743 | F | D | /projects/trustedbsd/www/componentsgray.gif 89748 | dev-stylesheet.xsl | 89743 | F | D | /projects/trustedbsd/www/dev-stylesheet.xsl 89749 | developers.dev | 89743 | F | D | /projects/trustedbsd/www/developers.dev 89750 | developers.gif | 89743 | F | D | /projects/trustedbsd/www/developers.gif 89751 | developersgray.gif | 89743 | F | D | /projects/trustedbsd/www/developersgray.gif 89752 | docs.bib | 89743 | F | D | /projects/trustedbsd/www/docs.bib 89753 | docs.gif | 89743 | F | D | /projects/trustedbsd/www/docs.gif 89754 | docsgray.gif | 89743 | F | D | /projects/trustedbsd/www/docsgray.gif 89755 | heading.xsl | 89743 | F | D | /projects/trustedbsd/www/heading.xsl 89756 | home.gif | 89743 | F | D | /projects/trustedbsd/www/home.gif 89757 | home.page | 89743 | F | D | /projects/trustedbsd/www/home.page 89758 | homegray.gif | 89743 | F | D | /projects/trustedbsd/www/homegray.gif 89759 | horizontalbar1.gif | 89743 | F | D | /projects/trustedbsd/www/horizontalbar1.gif 89760 | includes.xsl | 89743 | F | D | /projects/trustedbsd/www/includes.xsl 89761 | legal.page | 89743 | F | D | /projects/trustedbsd/www/legal.page 89762 | mailinglists.gif | 89743 | F | D | /projects/trustedbsd/www/mailinglists.gif 89763 | mailinglists.page | 89743 | F | D | /projects/trustedbsd/www/mailinglists.page 89764 | mailinglistsgray.gif | 89743 | F | D | /projects/trustedbsd/www/mailinglistsgray.gif 89765 | minipemaquidlight.jpg | 89743 | F | D | /projects/trustedbsd/www/minipemaquidlight.jpg 89766 | news.page | 89743 | F | D | /projects/trustedbsd/www/news.page 89767 | page.css | 89743 | F | D | /projects/trustedbsd/www/page.css 89768 | page.xsl | 89743 | F | D | /projects/trustedbsd/www/page.xsl 89769 | pemaquidlight.jpg | 89743 | F | D | /projects/trustedbsd/www/pemaquidlight.jpg 89770 | powerlogo.gif | 89743 | F | D | /projects/trustedbsd/www/powerlogo.gif 89771 | resources.gif | 89743 | F | D | /projects/trustedbsd/www/resources.gif 89772 | sidebar.xml | 89743 | F | D | /projects/trustedbsd/www/sidebar.xml 89773 | trusted.web.mk | 89743 | F | D | /projects/trustedbsd/www/trusted.web.mk 89774 | trustydibmed2.jpg | 89743 | F | D | /projects/trustedbsd/www/trustydibmed2.jpg 89775 | trustyshadow.gif | 89743 | F | D | /projects/trustedbsd/www/trustyshadow.gif 89776 | trustyteenie.gif | 89743 | F | D | /projects/trustedbsd/www/trustyteenie.gif 89777 | verticalbar1.gif | 89743 | F | D | /projects/trustedbsd/www/verticalbar1.gif 89778 | wwwheader.gif | 89743 | F | D | /projects/trustedbsd/www/wwwheader.gif 77339 | news.xml | 77009 | F | D | /www/en/news/news.xml 78769 | statistic.sgml | 78742 | F | A | /www/en/internal/statistic.sgml (68 rows) freshports.org=#
Oh. Now there’s some stuff I forgot we were recording… I’ll look into those and start dealing with them later, but here is a list of the distinct root directory of the files in all those commits:
freshports.org=# select distinct split_part(element_pathname(id), '/', 2) from element E where id in (select element_id from commit_log_elements where commit_log_id in (select id from commit_log where repo_id IS NULL )); split_part ----------------------------------------------- releng svnadmin vendor-crypto vendor-sys COPYRIGHT COPYRIGHT (contents, props changed) COPYRIGHT (props changed) CVSROOT CVSROOT-doc CVSROOT-ports CVSROOT-src Directory Properties: LOCKS (props changed) MAINTAINERS MAINTAINERS (contents, props changed) MAINTAINERS (props changed) Makefile Makefile (contents, props changed) Makefile (props changed) Makefile.inc1 Makefile.inc1 (contents, props changed) Makefile.inc1 (props changed) Makefile.mips NetBSD ObsoleteFiles.inc ObsoleteFiles.inc (contents, props changed) ObsoleteFiles.inc (props changed) README (props changed) Replaced: UPDATING UPDATING (contents, props changed) UPDATING (props changed) bin bind9 byacc cddl clang com_err compiler-rt contrib crypto de_DE.ISO8859-1 distrib dtracetoolkit en_US.ISO8859-1 es_ES.ISO8859-1 etc file flex fr_FR.ISO8859-1 games gcc gnu gperf groff illumos include it_IT.ISO8859-15 ja_JP.eucJP kerberos5 ldns less lib libarchive libc++ libcxxrt libexec libpcap libz llvm mn_MN.UTF-8 netcat nl_NL.ISO8859-1 one-true-awk openpam openresolv pciids ports projects release rescue ru_RU.KOI8-R sbin secure share sys tcpdump tcsh tools tzdata unbound usr.bin usr.sbin www xz-embedded zh_CN.GB2312 (96 rows) freshports.org=#
Now, how about getting repo_id set when the data is coming into the database?
Here is the perl for saving a new commit as it comes in:
$sql = "insert into commit_log (id, message_id, message_date, message_subject, date_added, commit_date, committer, description, system_id, svn_revision, repo_id, encoding_losses) values ( \ $this->{id}, " . $dbh->quote($this->{message_id}) . ", " . $dbh->quote($this->{message_date}) . ", " . $dbh->quote($this->{message_subject}) . ", " . $dbh->quote($this->{date_added}) . ", " . $dbh->quote($this->{commit_date}) . ", " . $dbh->quote($this->{committer}) . ", " . $dbh->quote($this->{description}) . ", $this->{system_id}, " . $dbh->quote($this->{revision}) . ", (SELECT id FROM repo WHERE name = " . $dbh->quote($this->{repo}) . "), $this->{encoding_losses}::boolean)";
And the change to the XML processing code was simple:
$commit_log->{repo} = $Updates{repository};
It’s getting late. I’ll work on the original problem (links to svnweb) later.