Sep 082012
 

I noticed just now that the database has not been keeping up with the rapid changes to the incoming XML. In general, anything in the XML should eventually get into the database. That is just not happening. Case in port: repo name.

This is now becoming an issue when we link to the SVN repository. Look at this commit and you’ll see that the URL to the repo is:

/base/releng/9.1/sys/amd64/amd64/initcpu.c => http://svnweb.freebsd.org/ports/head//base/releng/9.1/sys/amd64/amd64/initcpu.c?view=log#rev240240

That’s wrong. This is a src commit. It should link to http://svnweb.freebsd.org/base/releng/9.1/sys/amd64/amd64/initcpu.c?view=log#rev240240

Hmm, looking at this more closely, I see the URL is nearly right. It just contains the wrong prefix. Remove /ports/head// and it’s fine.

What about port commits? devel/p4ftpd http://svnweb.freebsd.org/ports/head/head/devel/p4ftpd/Makefile?view=log#rev303919

Yeah. That’s wrong too. It should be http://svnweb.freebsd.org/ports/head/devel/p4ftpd/Makefile?view=log#rev303919 (i.e. remove /head/)

Really, the repo prefix for svnweb is wrong in both cases.

As for doc commits, see index.sgml which links to http://svnweb.freebsd.org/ports/head//doc/head/en_US.ISO8859-1/htdocs/releng/index.sgml?view=log#rev39514 but really needs to have /ports/head// removed from the URL.

I think this calls for a repo table, which points to the correct repo URLs.

CREATE TABLE repo(
   id           serial not null,
   name         text   not null,
   description  text   not null,
   svn_hostname text   not null,
   path_to_repo text   not null   
);

create unique index repo_id          on repo(id);
create unique index repo_name        on repo(name);
create unique index repo_description on repo(description);
create unique index repo_path_to_repo   on repo(path_to_repo);

insert into repo (name, description, svn_hostname, path_to_repo) values ('ports', 'The FreeBSD Ports tree', 'svnweb.freebsd.org', '/ports');
insert into repo (name, description, svn_hostname, path_to_repo) values ('doc',   'The FreeBSD doc tree',   'svnweb.freebsd.org', '/doc');
insert into repo (name, description, svn_hostname, path_to_repo) values ('src',   'The FreeBSD src tree',   'svnweb.freebsd.org', '/base');

Which gives us:

freshports.org=# select * from repo;
 id | name  |      description       |    svn_hostname    | path_to_repo 
----+-------+------------------------+--------------------+--------------
  1 | ports | The FreeBSD Ports tree | svnweb.freebsd.org | /ports
  2 | doc   | The FreeBSD doc tree   | svnweb.freebsd.org | /doc
  3 | src   | The FreeBSD src tree   | svnweb.freebsd.org | /base
(3 rows)

freshports.org=#

Now, let’s link the commit_log table to the repo table:

ALTER TABLE commit_log add repo_id integer;
alter table commit_log
    add foreign key (repo_id) 
       references repo(id) on update cascade on delete restrict;

Next step: populate that field… This should be interesting. Let’s start with the doc tree:

UPDATE commit_log CL
   SET repo_id = (SELECT id FROM repo WHERE name = 'doc')
 WHERE CL.id in
(
SELECT distinct CL.id
  FROM commit_log_elements CLE, commit_log CL, element_pathname EP
 WHERE CLE.element_id = EP.element_id
   AND CLE.commit_log_id = CL.id
   AND EP.pathname like '/doc/%'
);

That updated 14,849 records. Let’s see about the ports tree next. Hmm, what if a port affects more than one repo…? That should not happen, but let’s just check:

SELECT * FROM
(
SELECT distinct CL.id, CL.repo_id
  FROM commit_log_elements CLE, commit_log CL, element_pathname EP
 WHERE CLE.element_id = EP.element_id
   AND CLE.commit_log_id = CL.id
   AND EP.pathname like '/ports/%'
) AS TMP
WHERE TMP.repo_id is not null;

  id   | repo_id 
-------+---------
 45075 |       2
(1 row)

This finds all commits against the ports tree for which a repo is already assigned. In this case, it must be the doc repo because that is the only assignment we have done so far. Let’s look at that commit.

freshports.org=# select message_id from commit_log where id = 45075;
                    message_id                    
--------------------------------------------------
 200208072021.g77KLCRq064264@freefall.freebsd.org
(1 row)

OMG, are you kidding me? There’s three different repos on that commit! Oh well, I’ll just let it be converted to a port commit because I don’t want to special case the following SQL statement, which sets the repo for all the port commmits:

UPDATE commit_log CL
   SET repo_id = (SELECT id FROM repo WHERE name = 'ports')
 WHERE CL.id in
(
SELECT distinct CL.id
  FROM commit_log_elements CLE, commit_log CL, element_pathname EP
 WHERE CLE.element_id = EP.element_id
   AND CLE.commit_log_id = CL.id
   AND EP.pathname like '/ports/%'
);

That updated 262,478 records.

Now let’s see about the src commits. But first, we’ll check for commits to src, which affected another repo. We did a similar check for ports, after setting the docs commits.

freshports.org=# SELECT * FROM
freshports.org-# (
freshports.org(# SELECT distinct CL.id, CL.repo_id
freshports.org(#   FROM commit_log_elements CLE, commit_log CL, element_pathname EP
freshports.org(#  WHERE CLE.element_id = EP.element_id
freshports.org(#    AND CLE.commit_log_id = CL.id
freshports.org(#    AND EP.pathname like '/base/%'
freshports.org(# ) AS TMP
freshports.org-# WHERE TMP.repo_id is not null;
  id   | repo_id 
-------+---------
 27009 |       2
 28918 |       2
 30685 |       2
 32472 |       2
 36499 |       2
 39683 |       2
(6 rows)

freshports.org=# select E.*, element_pathname(id) from element E where id in (select element_id from commit_log_elements where commit_log_id in (27009,28918,30685,32472,36499,39683));
  id   |            name            | parent_id | directory_file_flag | status |                       element_pathname                        
-------+----------------------------+-----------+---------------------+--------+---------------------------------------------------------------
 77514 | param.h                    |     76996 | F                   | A      | /base/head/sys/sys/param.h
 79703 | authors.ent                |     77843 | F                   | A      | /doc/head/en_US.ISO8859-1/share/sgml/authors.ent
 78787 | article.sgml               |     78786 | F                   | A      | /doc/head/en_US.ISO8859-1/articles/contributors/article.sgml
 81174 | netchild.key               |     77796 | F                   | D      | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/netchild.key
 77339 | news.xml                   |     77009 | F                   | D      | /www/en/news/news.xml
 77797 | chapter.sgml               |     77796 | F                   | A      | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/chapter.sgml
 78449 | calendar.freebsd           |     77807 | F                   | A      | /base/head/usr.bin/calendar/calendars/calendar.freebsd
 77799 | pgpkeys.ent                |     77796 | F                   | D      | /doc/head/en_US.ISO8859-1/books/handbook/pgpkeys/pgpkeys.ent
 83011 | book.sgml                  |     83010 | F                   | A      | /doc/head/en_US.ISO8859-1/books/porters-handbook/book.sgml
  8730 | freebsd.committers.markers |      8729 | F                   | A      | /ports/head/astro/xearth/files/freebsd.committers.markers
(10 rows)

freshports.org=# 

OK, there are all doc. Given that we will be setting src here, I think it’s OK to override those six old commits.

  1. 200206250815.g5P8FZn39049@freefall.freebsd.org
  2. 200204031503.g33F3eX91734@freefall.freebsd.org
  3. 200203211640.g2LGeYH71997@freefall.freebsd.org
  4. 200205292104.g4TL4QC94947@freefall.freebsd.org
  5. 200204302354.g3UNs4004251@freefall.freebsd.org
  6. 200204161217.g3GCHsm65354@freefall.freebsd.org

Now let’s do the src records:

UPDATE commit_log CL
   SET repo_id = (SELECT id FROM repo WHERE name = 'src')
 WHERE CL.id in
(
SELECT distinct CL.id
  FROM commit_log_elements CLE, commit_log CL, element_pathname EP
 WHERE CLE.element_id = EP.element_id
   AND CLE.commit_log_id = CL.id
   AND EP.pathname like '/base/%'
);

That affected 125,040 records.

How many commits are not assigned to a repo?

freshports.org=# SELECT count(*) FROM commit_log WHERE repo_id IS NULL;
 count 
-------
 43077
(1 row)

freshports.org=#

Hmmm, where are those files?

freshports.org=# select E.*, element_pathname(id) 
  from element E 
 where id in (select element_id 
                from commit_log_elements 
               where commit_log_id in (select id 
                                         from commit_log
                                        where repo_id IS NULL
                                        LIMIT 100));
  id   |         name          | parent_id | directory_file_flag | status |                element_pathname                
-------+-----------------------+-----------+---------------------+--------+------------------------------------------------
 90424 | Makefile              |     90423 | F                   | A      | /www/en/donations/Makefile
 90425 | index.sgml            |     90423 | F                   | A      | /www/en/donations/index.sgml
 90426 | index.xsl             |     77008 | F                   | A      | /www/en/index.xsl
 77005 | modules               |     77004 | F                   | A      | /CVSROOT/modules
 77273 | schedule.sgml         |     77272 | F                   | A      | /www/en/releases/4.6R/schedule.sgml
 77313 | index.sgml            |     77312 | F                   | A      | /www/en/smp/index.sgml
 77359 | news.xml              |     77358 | F                   | A      | /www/en/gnome/news.xml
 78052 | index.xsl             |     78003 | F                   | A      | /www/ja/index.xsl
 78588 | bsdcon-devsummit.sgml |     78586 | F                   | A      | /www/en/events/2002/bsdcon-devsummit.sgml
 78593 | Makefile              |     77008 | F                   | A      | /www/en/Makefile
 78734 | submitting.sgml       |     78730 | F                   | A      | /www/en/docproj/submitting.sgml
 78735 | translations.sgml     |     78730 | F                   | A      | /www/en/docproj/translations.sgml
 78761 | bylaws.sgml           |     78742 | F                   | A      | /www/en/internal/bylaws.sgml
 79000 | access                |     77004 | F                   | A      | /CVSROOT/access
 84072 | avail                 |     77004 | F                   | A      | /CVSROOT/avail
 84200 | exclude               |     77004 | F                   | A      | /CVSROOT/exclude
 90073 | cvsweb.conf-netbsd    |     84490 | F                   | A      | /projects/cvsweb/cvsweb.conf-netbsd
 90074 | cvsweb.conf-openbsd   |     84490 | F                   | A      | /projects/cvsweb/cvsweb.conf-openbsd
 84456 | myths.sgml            |     84452 | F                   | A      | /www/en/advocacy/myths.sgml
 84492 | ChangeLog             |     84490 | F                   | A      | /projects/cvsweb/ChangeLog
 85014 | cvsweb.cgi            |     84490 | F                   | A      | /projects/cvsweb/cvsweb.cgi
 85045 | cvsweb.cgi            |     79749 | F                   | A      | /www/en/cgi/cvsweb.cgi
 85294 | index.sgml            |     85293 | F                   | A      | /www/en/projects/c99/index.sgml
 86233 | missing_handler.cgi   |     79749 | F                   | A      | /www/en/cgi/missing_handler.cgi
 86234 | pds.cgi               |     79749 | F                   | A      | /www/en/cgi/pds.cgi
 86362 | ports.cgi             |     79749 | F                   | A      | /www/en/cgi/ports.cgi
 89779 | README                |     84489 | F                   | A      | /projects/README
 89819 | url.cgi               |     79749 | F                   | A      | /www/en/cgi/url.cgi
 89635 | relnotes.sgml         |     77008 | F                   | A      | /www/en/relnotes.sgml
 89744 | Makefile              |     89743 | F                   | D      | /projects/trustedbsd/www/Makefile
 89745 | beastie.page          |     89743 | F                   | D      | /projects/trustedbsd/www/beastie.page
 89746 | bib-stylesheet.xsl    |     89743 | F                   | D      | /projects/trustedbsd/www/bib-stylesheet.xsl
 89831 | components.gif        |     89743 | F                   | D      | /projects/trustedbsd/www/components.gif
 89747 | components.page       |     89743 | F                   | D      | /projects/trustedbsd/www/components.page
 89832 | componentsgray.gif    |     89743 | F                   | D      | /projects/trustedbsd/www/componentsgray.gif
 89748 | dev-stylesheet.xsl    |     89743 | F                   | D      | /projects/trustedbsd/www/dev-stylesheet.xsl
 89749 | developers.dev        |     89743 | F                   | D      | /projects/trustedbsd/www/developers.dev
 89750 | developers.gif        |     89743 | F                   | D      | /projects/trustedbsd/www/developers.gif
 89751 | developersgray.gif    |     89743 | F                   | D      | /projects/trustedbsd/www/developersgray.gif
 89752 | docs.bib              |     89743 | F                   | D      | /projects/trustedbsd/www/docs.bib
 89753 | docs.gif              |     89743 | F                   | D      | /projects/trustedbsd/www/docs.gif
 89754 | docsgray.gif          |     89743 | F                   | D      | /projects/trustedbsd/www/docsgray.gif
 89755 | heading.xsl           |     89743 | F                   | D      | /projects/trustedbsd/www/heading.xsl
 89756 | home.gif              |     89743 | F                   | D      | /projects/trustedbsd/www/home.gif
 89757 | home.page             |     89743 | F                   | D      | /projects/trustedbsd/www/home.page
 89758 | homegray.gif          |     89743 | F                   | D      | /projects/trustedbsd/www/homegray.gif
 89759 | horizontalbar1.gif    |     89743 | F                   | D      | /projects/trustedbsd/www/horizontalbar1.gif
 89760 | includes.xsl          |     89743 | F                   | D      | /projects/trustedbsd/www/includes.xsl
 89761 | legal.page            |     89743 | F                   | D      | /projects/trustedbsd/www/legal.page
 89762 | mailinglists.gif      |     89743 | F                   | D      | /projects/trustedbsd/www/mailinglists.gif
 89763 | mailinglists.page     |     89743 | F                   | D      | /projects/trustedbsd/www/mailinglists.page
 89764 | mailinglistsgray.gif  |     89743 | F                   | D      | /projects/trustedbsd/www/mailinglistsgray.gif
 89765 | minipemaquidlight.jpg |     89743 | F                   | D      | /projects/trustedbsd/www/minipemaquidlight.jpg
 89766 | news.page             |     89743 | F                   | D      | /projects/trustedbsd/www/news.page
 89767 | page.css              |     89743 | F                   | D      | /projects/trustedbsd/www/page.css
 89768 | page.xsl              |     89743 | F                   | D      | /projects/trustedbsd/www/page.xsl
 89769 | pemaquidlight.jpg     |     89743 | F                   | D      | /projects/trustedbsd/www/pemaquidlight.jpg
 89770 | powerlogo.gif         |     89743 | F                   | D      | /projects/trustedbsd/www/powerlogo.gif
 89771 | resources.gif         |     89743 | F                   | D      | /projects/trustedbsd/www/resources.gif
 89772 | sidebar.xml           |     89743 | F                   | D      | /projects/trustedbsd/www/sidebar.xml
 89773 | trusted.web.mk        |     89743 | F                   | D      | /projects/trustedbsd/www/trusted.web.mk
 89774 | trustydibmed2.jpg     |     89743 | F                   | D      | /projects/trustedbsd/www/trustydibmed2.jpg
 89775 | trustyshadow.gif      |     89743 | F                   | D      | /projects/trustedbsd/www/trustyshadow.gif
 89776 | trustyteenie.gif      |     89743 | F                   | D      | /projects/trustedbsd/www/trustyteenie.gif
 89777 | verticalbar1.gif      |     89743 | F                   | D      | /projects/trustedbsd/www/verticalbar1.gif
 89778 | wwwheader.gif         |     89743 | F                   | D      | /projects/trustedbsd/www/wwwheader.gif
 77339 | news.xml              |     77009 | F                   | D      | /www/en/news/news.xml
 78769 | statistic.sgml        |     78742 | F                   | A      | /www/en/internal/statistic.sgml
(68 rows)

freshports.org=#

Oh. Now there’s some stuff I forgot we were recording… I’ll look into those and start dealing with them later, but here is a list of the distinct root directory of the files in all those commits:

freshports.org=# select distinct split_part(element_pathname(id), '/', 2) 
  from element E 
 where id in (select element_id 
                from commit_log_elements 
               where commit_log_id in (select id 
                                         from commit_log
                                        where repo_id IS NULL
                                        ));
                  split_part                   
-----------------------------------------------
   releng
   svnadmin
   vendor-crypto
   vendor-sys
 COPYRIGHT
 COPYRIGHT   (contents, props changed)
 COPYRIGHT   (props changed)
 CVSROOT
 CVSROOT-doc
 CVSROOT-ports
 CVSROOT-src
 Directory Properties:
 LOCKS   (props changed)
 MAINTAINERS
 MAINTAINERS   (contents, props changed)
 MAINTAINERS   (props changed)
 Makefile
 Makefile   (contents, props changed)
 Makefile   (props changed)
 Makefile.inc1
 Makefile.inc1   (contents, props changed)
 Makefile.inc1   (props changed)
 Makefile.mips
 NetBSD
 ObsoleteFiles.inc
 ObsoleteFiles.inc   (contents, props changed)
 ObsoleteFiles.inc   (props changed)
 README   (props changed)
 Replaced:
 UPDATING
 UPDATING   (contents, props changed)
 UPDATING   (props changed)
 bin
 bind9
 byacc
 cddl
 clang
 com_err
 compiler-rt
 contrib
 crypto
 de_DE.ISO8859-1
 distrib
 dtracetoolkit
 en_US.ISO8859-1
 es_ES.ISO8859-1
 etc
 file
 flex
 fr_FR.ISO8859-1
 games
 gcc
 gnu
 gperf
 groff
 illumos
 include
 it_IT.ISO8859-15
 ja_JP.eucJP
 kerberos5
 ldns
 less
 lib
 libarchive
 libc++
 libcxxrt
 libexec
 libpcap
 libz
 llvm
 mn_MN.UTF-8
 netcat
 nl_NL.ISO8859-1
 one-true-awk
 openpam
 openresolv
 pciids
 ports
 projects
 release
 rescue
 ru_RU.KOI8-R
 sbin
 secure
 share
 sys
 tcpdump
 tcsh
 tools
 tzdata
 unbound
 usr.bin
 usr.sbin
 www
 xz-embedded
 zh_CN.GB2312
(96 rows)

freshports.org=#  

Now, how about getting repo_id set when the data is coming into the database?

Here is the perl for saving a new commit as it comes in:

      $sql = "insert into commit_log (id, message_id, message_date, message_subject, date_added, commit_date, 
                                                                                        committer, description, system_id, svn_revision, repo_id, encoding_losses) values ( \
                                $this->{id},
                                " . $dbh->quote($this->{message_id})      . ",
                                " . $dbh->quote($this->{message_date})    . ",
                                " . $dbh->quote($this->{message_subject}) . ",
                                " . $dbh->quote($this->{date_added})      . ",
                                " . $dbh->quote($this->{commit_date})     . ",
                                " . $dbh->quote($this->{committer})       . ",
                                " . $dbh->quote($this->{description})     . ",
                                $this->{system_id},
                                " . $dbh->quote($this->{revision})        . ",
                                (SELECT id FROM repo WHERE name = " . $dbh->quote($this->{repo}) . "),
                                $this->{encoding_losses}::boolean)";
 

And the change to the XML processing code was simple:

       $commit_log->{repo}                     = $Updates{repository};

It’s getting late. I’ll work on the original problem (links to svnweb) later.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive