Duplicate dependency issues – avoiding duplicate rows

Databases use relational integrity to enforce expected situations. A common scenario is duplicates. Case in point, I present the port_dependencies table:

freshports.org=# \d port_dependencies
                    Table "public.port_dependencies"
         Column         |     Type     | Collation | Nullable | Default 
------------------------+--------------+-----------+----------+---------
 port_id                | integer      |           | not null | 
 port_id_dependent_upon | integer      |           | not null | 
 dependency_type        | character(1) |           | not null | 
Indexes:
    "port_dependencies_pkey" PRIMARY KEY, btree (port_id, port_id_dependent_upon, dependency_type)
Foreign-key constraints:
    "port_dependencies_port_id_dependent_upon_fkey" FOREIGN KEY (port_id_dependent_upon) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE
    "port_dependencies_port_id_fkey" FOREIGN KEY (port_id) REFERENCES ports(id) ON UPDATE CASCADE ON DELETE CASCADE
Triggers:
    port_dependencies_delete_clear_cache AFTER DELETE ON port_dependencies FOR EACH ROW EXECUTE PROCEDURE port_dependencies_delete_clear_cache()
    port_dependencies_insert_clear_cache AFTER INSERT ON port_dependencies FOR EACH ROW EXECUTE PROCEDURE port_dependencies_insert_clear_cache()

freshports.org=# 

For those not familiar with FreeBSD ports, each port (you could also refer to them as a package or application) can have zero or more dependencies. The FreshPorts database extracts and lists these dependencies for convenient viewing.

As you can see in the above table definition, for a specific (port_id), a given dependency type (port_id_dependent_upon, dependency_type) can only occur once. There is no reason for a port to be dependent upon something twice.

Or so you, or I, might think.

Then this happened:

Could not execute SQL SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result ... 
maybe invalid? ERROR:  duplicate key value violates unique constraint "port_dependencies_pkey"

It was part of this commit on net/ceph. Let’s find out where FreshPorts went wrong.

Clearly, something tried to record a duplicate dependency. The code doesn’t look for it, and did not detect the violation.

The log file

FreshPorts logs a lot of stuff, because it can. It has always been useful for situations such as this.

Here is what I found:

depends with this: 'libboost_python.so:devel/boost-python-libs@py27 libboost_python.so:devel/boost-python-libs
libboost_thread.so:devel/boost-libs libleveldb.so:databases/leveldb libnss3.so:security/nss
libcryptopp.so:security/cryptopp libsnappy.so:archivers/snappy libcurl.so:ftp/curl libxml2.so:textproc/libxml2
libexpat.so:textproc/expat2 liblz4.so:archivers/liblz4 libplds4.so:devel/nspr libtcmalloc.so:devel/google-perftools
libfuse.so:sysutils/fusefs-libs libintl.so:devel/gettext-runtime libldap-2.4.so.2:net/openldap24-client'

The next entry referred to:

The 'L' depends are: devel/boost-python-libs@py27 - devel/boost-python-libs - devel/boost-libs - 
databases/leveldb - security/nss - security/cryptopp - archivers/snappy - ftp/curl - textproc/libxml2 - 
textproc/expat2 - archivers/liblz4 - devel/nspr - devel/google-perftools - sysutils/fusefs-libs -
devel/gettext-runtime - net/openldap24-client

Then the code starts processing them, one by one:

adding in devel/boost-python-libs@py27 which converts to 'devel/boost-python-libs'
sql is SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result
result is 1

As you can see, it strips the flavor (@py27) from the dependency. I think I know where this is going.

And then the next one:

adding in devel/boost-python-libs which converts to 'devel/boost-python-libs'
sql is SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result
Could not execute SQL SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result ...
maybe invalid? ERROR:  duplicate key value violates unique constraint "port_dependencies_pkey"
DETAIL:  Key (port_id, port_id_dependent_upon, dependency_type)=(45100, 25346, L) already exists.
CONTEXT:  SQL statement "INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
            VALUES (l_PortID, l_PortIDDependency, p_DependencyType)"
PL/pgSQL function portsdependenciesadd(text,text,text) line 19 at SQL statement
that failed

Ahh, OK, that’s why. I see now.

Let’s go to the source.

The source

This isn’t the source, it’s the Makefile from that commit. Looking at lines 25-28 we see:

25	LIB_DEPENDS=    \
26	        ${PY_BOOST} \
27	        libboost_python.so:devel/boost-python-libs \
28	        libboost_thread.so:devel/boost-libs \

I bet it’s coming from here:

$ make -V PY_BOOST
libboost_python.so:devel/boost-python-libs@py27

Yes, that’s it. Now we know where and why the duplicates are arising.

How to fix this

I could take one of two approaches:

  1. Detect and eliminate duplicates within the code.
  2. Within the database, ignore attempts to add duplicates.

The database is wonderful at doing this. I could write code, and it might be buggy. I trust the database for this, much more than I trust my code. Plus, I’ll have less code to maintain.

I’ll take option #2.

The code

Here is the statement which inserts the tuple into the database:

INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
      VALUES (l_PortID, l_PortIDDependency, p_DependencyType);

There is no detection of duplicate rows.

In different situations, I have used this approach to detect duplicates during insertion:

INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
  SELECT l_PortID, l_PortIDDependency, p_DependencyType
   WHERE NOT EXISTS (SELECT port_id, port_id_dependent_upon, dependency_type
                       FROM port_dependencies
                      WHERE port_id                = l_PortID
                        AND port_id_dependent_upon = l_PortIDDependency
                        AND dependency_type        = p_DependencyType);

The above may be easier to follow if I do the variable substitution:

 INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
  SELECT 45100, 25346, 'L'
   WHERE NOT EXISTS (SELECT port_id, port_id_dependent_upon, dependency_type
                       FROM port_dependencies
                      WHERE port_id                = 45100
                        AND port_id_dependent_upon = 25346
                        AND dependency_type        = 'L');

That works well, exactly as expected. It does what I need.

But there is a better way

Then I posted on Twitter about this, with the above example. My cousin, Scott Walsh, mentioned ON CONFLICT. This feature arrived with PostgreSQL 9.5 (see docs), which was released in early 2016. I haven’t kept up. FreshPorts added dependency support in 2011.

I immediately tried this approach:

INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
freshports.org-#   SELECT 45100, 25346, 'L'
freshports.org-# ON CONFLICT DO NOTHING;
INSERT 0 0

That’s great.

Then I got to thinking… there are other conflicts which could occur, errors which should be brought to my attention. Foreign key violations come to mind.

Reading more of the documentation came up with this solution:

freshports.org=# INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
freshports.org-#   SELECT 45100, 25346, 'L'
freshports.org-# ON CONFLICT ON CONSTRAINT port_dependencies_pkey DO NOTHING;
INSERT 0 0

EDIT 2017.03.17

I see I’ve left my SELECT in that INSERT. The following is what I will use in production:

INSERT INTO port_dependencies (port_id, port_id_dependent_upon, dependency_type)
    VALUES (l_PortID, l_PortIDDependency, p_DependencyType)
    ON CONFLICT ON CONSTRAINT port_dependencies_pkey DO NOTHING;

When running my first test of that code, this is what was found in the logs (two duplicate inserts, no failures):

sql is SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result
NOTICE:   branch is head
NOTICE:   GetPort pathname is=/ports/head/net/ceph
NOTICE:   branch is head
NOTICE:   GetPort pathname is=/ports/head/devel/boost-python-libs
result is 1
adding in devel/boost-python-libs which converts to 'devel/boost-python-libs'
sql is SELECT PortsDependenciesAdd( 'net/ceph', 'devel/boost-python-libs', 'L' ) as result
NOTICE:   branch is head
NOTICE:   GetPort pathname is=/ports/head/net/ceph
NOTICE:   branch is head
NOTICE:   GetPort pathname is=/ports/head/devel/boost-python-libs
result is 1

That’s exactly what I need. Phew.


This code deals specifically with one constraint, port_dependencies_pkey. If that constraint is violated, we have a duplicate key: do nothing, ignore it, as if nothing happened. Please move along, nothing to see here, thank you.

This, for my situation, is exactly what I want. Instead of removing duplicates in the front end code, I can let the database silently handle it. Score.

Thanks to those who helped me find this solution. I hope it helps you too. This is not a one-size-fits-all solution, but it might be applicable to your needs.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top