Replacing a column search with a full text search solution

One of the many useful features on FreshPorts is: what port[s] install[s] this file? That’s the pkg-plist search option. pkg-plist is a file which “lists all the files installed by the port”. However not all ports have a pkg-plist file because the list is compiled automatically. That is why the configure_plist table was created to replace the ports.pkg_plist column. The creation of that table broke the search function because it was then working on outdated data.

Last week, I blogged about how I created a new a new stored procedure for pulling back the pkg-plist information.

The new table

By new, I mean new to this search solution. The table looks like this:

freshports.dev=# \d generate_plist
                                Table "public.generate_plist"
     Column     |  Type   | Collation | Nullable |                  Default                   
----------------+---------+-----------+----------+--------------------------------------------
 id             | bigint  |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id        | integer |           | not null | 
 installed_file | text    |           | not null | 
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)

freshports.dev=# 

For a given port, there will be N entries in this table, one for each item in the pkg-plist. You will notice some indexes on the installed_file column.

A simple search

With the old solution, we could search like this, but it took time:

AND P.pkg_plist ILIKE '%share/examples/acme.sh/dnsapi/dns_1984hosting.sh%'

Over 2.7 seconds in fact.

Let’s try looking in the generate_plist table instead. I was reading the Tables and Indexes section of the PostgreSQL Full Text Search and decided to start with this:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, installed_file)) STORED;

CREATE INDEX generate_plist_textsearch_idx
    ON public.generate_plist USING gin
    (textsearchable_index_col)
    TABLESPACE pg_default;

With that, the following search was pretty damn nice.

freshports.dev=# explain analyse    SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh')
;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on generate_plist  (cost=367.55..2143.46 rows=426 width=4) (actual time=2.116..2.123 rows=2 loops=1)
   Recheck Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..367.44 rows=426 width=0) (actual time=2.095..2.095 rows=2 loops=1)
         Index Cond: (textsearchable_index_col @@ to_tsquery('share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text))
 Planning Time: 0.434 ms
 Execution Time: 2.208 ms
(7 rows)

But this was … umm, what?

freshports.dev=# SELECT
        port_id
    FROM
        generate_plist
    WHERE
        textsearchable_index_col @@ to_tsquery('dns_1984hosting.sh')
;
 port_id 
---------
(0 rows)

freshports.dev=# 

Nothing found? What?

Full text search is not like that

Full text search is not like that. My search on the full path name would have worked just fine with the existing index:

freshports.dev=# explain analyse SELECT
        port_id
    FROM
        generate_plist
    WHERE
        installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh';
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using generate_plist_installed_file_idx on generate_plist  (cost=0.56..28.66 rows=6 width=4) (actual time=0.046..0.052 rows=2 loops=1)
   Index Cond: (installed_file = 'share/examples/acme.sh/dnsapi/dns_1984hosting.sh'::text)
 Planning Time: 0.222 ms
 Execution Time: 0.088 ms
(4 rows)

freshports.dev=# 

In fact, it’s faster.

But that’s not what is needed. Let’s look at the use case.

Searching for file names

We want to search for the full path or the file name. Or just dns_1984hosting.sh or dns_1984hosting.

With help from RhodiumToad on IRC, tsearch is all about matching words (technically lexemes), not arbitrary substrings. We need to think in terms of lexemes. For example:

freshports.dev=# select ts_debug('english', 'foo/bar');
                           ts_debug                           
--------------------------------------------------------------
 (file,"File or path name",foo/bar,{simple},simple,{foo/bar})
(1 row)

freshports.dev=# 

So let’s split the file name into distinct words. Let’s convert the / to space and add a new column for that:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col2 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx2
    ON public.generate_plist USING gin
    (textsearchable_index_col2)
    TABLESPACE pg_default;

Let’s try searching now:

freshports.dev=# EXPLAIN ANALYSE
WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=171.64..25352.59 rows=2981 width=88) (actual time=1.337..1.506 rows=2 loops=1)
   ->  Bitmap Heap Scan on generate_plist  (cost=171.35..12102.68 rows=2981 width=56) (actual time=1.031..1.040 rows=2 loops=1)
         Recheck Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.015..1.015 rows=2 loops=1)
               Index Cond: (textsearchable_index_col2 @@ to_tsquery('dns_1984hosting.sh'::text))
   ->  Index Scan using ports_pkey on ports p  (cost=0.29..4.19 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=2)
         Index Cond: (id = generate_plist.port_id)
 Planning Time: 1.064 ms
 Execution Time: 1.596 ms
(10 rows)

freshports.dev=# 

Umm, that’s fast.

But if we try the file without the .sh we get nothing:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col2 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
 id | element_pathname | installed_file 
----+------------------+----------------
(0 rows)

freshports.dev=# 

Another column for that

Humor me here while I add another column:

ALTER TABLE public.generate_plist
    ADD COLUMN textsearchable_index_col3 tsvector GENERATED ALWAYS AS (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) STORED;

CREATE INDEX generate_plist_textsearch_idx3
    ON public.generate_plist USING gin
    (textsearchable_index_col3)
    TABLESPACE pg_default;

This translates both / and . to a space. Now we have:

freshports.dev=# WITH short_list AS (
    SELECT
        port_id, installed_file
    FROM
        generate_plist
    WHERE
     textsearchable_index_col3 @@ to_tsquery('dns_1984hosting')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
  id   |            element_pathname             |                  installed_file                  
-------+-----------------------------------------+--------------------------------------------------
 59654 | /ports/branches/2020Q3/security/acme.sh | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
 43508 | /ports/head/security/acme.sh            | share/examples/acme.sh/dnsapi/dns_1984hosting.sh
(2 rows)

freshports.dev=# 

Which is still lightning fast.

Doing them all at once

Creating a query which uses all three columns is still performant:

freshports.dev=# explain analyse WITH short_list AS (
    SELECT
        DISTINCT port_id, installed_file
    FROM
        generate_plist
    WHERE
        textsearchable_index_col  @@ to_tsquery('bacula')
     OR textsearchable_index_col2 @@ to_tsquery('bacula')
     OR textsearchable_index_col3 @@ to_tsquery('bacula')
)
  select P.id, element_pathname(P.element_id), SL.installed_file
  FROM ports P, short_list SL
 WHERE P.id = SL.port_id;
                                                                                                       QUERY PLAN                                                                                                       
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=50810.64..53335.38 rows=7790 width=88) (actual time=273.782..520.103 rows=1872 loops=1)
   Hash Cond: (sl.port_id = p.id)
   ->  Subquery Scan on sl  (cost=35602.05..35757.85 rows=7790 width=56) (actual time=14.561..16.078 rows=1872 loops=1)
         ->  HashAggregate  (cost=35602.05..35679.95 rows=7790 width=56) (actual time=14.558..15.495 rows=1872 loops=1)
               Group Key: generate_plist.port_id, generate_plist.installed_file
               ->  Bitmap Heap Scan on generate_plist  (cost=1705.16..35563.02 rows=7806 width=56) (actual time=11.540..12.771 rows=1872 loops=1)
                     Recheck Cond: ((textsearchable_index_col @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col2 @@ to_tsquery('bacula'::text)) OR (textsearchable_index_col3 @@ to_tsquery('bacula'::text)))
                     Heap Blocks: exact=116
                     ->  BitmapOr  (cost=1705.16..1705.16 rows=7808 width=0) (actual time=11.489..11.490 rows=0 loops=1)
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx  (cost=0.00..415.44 rows=426 width=0) (actual time=2.994..2.995 rows=36 loops=1)
                                 Index Cond: (textsearchable_index_col @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx2  (cost=0.00..170.61 rows=2981 width=0) (actual time=1.373..1.374 rows=1857 loops=1)
                                 Index Cond: (textsearchable_index_col2 @@ to_tsquery('bacula'::text))
                           ->  Bitmap Index Scan on generate_plist_textsearch_idx3  (cost=0.00..1113.26 rows=4401 width=0) (actual time=7.117..7.117 rows=1872 loops=1)
                                 Index Cond: (textsearchable_index_col3 @@ to_tsquery('bacula'::text))
   ->  Hash  (cost=14174.04..14174.04 rows=63004 width=8) (actual time=258.282..258.283 rows=63065 loops=1)
         Buckets: 32768  Batches: 4  Memory Usage: 877kB
         ->  Seq Scan on ports p  (cost=0.00..14174.04 rows=63004 width=8) (actual time=0.089..219.485 rows=63065 loops=1)
 Planning Time: 1.227 ms
 Execution Time: 521.244 ms
(20 rows)

freshports.dev=# 

At what cost?

What does the table look like now:

freshports.dev=# \d generate_plist
                                                                          Table "public.generate_plist"
          Column           |   Type   | Collation | Nullable |                                                      Default                                                      
---------------------------+----------+-----------+----------+-------------------------------------------------------------------------------------------------------------------
 id                        | bigint   |           | not null | nextval('generate_plist_id_seq'::regclass)
 port_id                   | integer  |           | not null | 
 installed_file            | text     |           | not null | 
 textsearchable_index_col  | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, installed_file)) stored
 textsearchable_index_col2 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/'::text, ' '::text))) stored
 textsearchable_index_col3 | tsvector |           |          | generated always as (to_tsvector('english'::regconfig, translate(installed_file, '/.'::text, '  '::text))) stored
Indexes:
    "generate_plist_installed_file_gin_idx" gin (to_tsvector('english'::regconfig, installed_file))
    "generate_plist_installed_file_idx" btree (installed_file)
    "generate_plist_pk" btree (id)
    "generate_plist_port_id_idx" btree (port_id)
    "generate_plist_textsearch_idx" gin (textsearchable_index_col)
    "generate_plist_textsearch_idx2" gin (textsearchable_index_col2)
    "generate_plist_textsearch_idx3" gin (textsearchable_index_col3)

How much space does this take:



freshports.dev=# select count(*) from generate_plist;
  count  
---------
 8478513
(1 row)


freshports.dev=# SELECT pg_table_size('generate_plist');
 pg_table_size 
---------------
    3578167296
(1 row)

freshports.dev=# select pg_total_relation_size('generate_plist');
 pg_total_relation_size 
------------------------
             6730162176
(1 row)

That’s 8.5 million rows taking up 3.3 GB but with indexes it’s 6.3 GB.

Disk is cheap. Time is expensive. I say the cost is worth the performance.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top