Postgres full text search with Django

NC Jazz Beat is a web application I created recently. It provides information about live jazz in Central North Carolina, mostly regarding upcoming performances. It has short blurbs on venues and some of the musicians and their recent recordings. It is implemented in Python 3 and Django and Django CMS, along with various other Python packages and Django apps.

I’ve worked with Django CMS before on web applications of different complexity, with widely varying amounts of content. Some of my experiences have lead to concerns about some key aspects of creating non-trivial content in CMS plugins:

  • Leaking of site design considerations into the content creation flow
  • Storing valuable content together with how it is displayed on the site

With NC Jazz Beat I’ve experimented with the use of an Article model which exists separate from the CMS. Use of Article minimizes the amount of data which has its canonical version stored in Django CMS plugins. Articles come in various flavors, including events and record albums and other common article types; CMS plugins are provided to place entire articles or article teasers on a page, as well as to create feeds such as upcoming events or recently updated articles. Articles also have a traditional Django view, such that all active articles have a URL on the site whether or not they are placed on a CMS page using one of the plugins. Many of the individual articles for NC Jazz Beat are created programmatically from importing data either from a spreadsheet or from other resources. This mode of content creation, the use of Article, is suitable for a variety of web applications.

Because of the heavy reliance on the Article model, it is currently feasible to implement site search as a search of the Article model. (Very little visible data exists in other models, such as those for the Text and other common CMS plugins.) Previously I’ve used Haystack in conjunction with Solr or Elasticsearch, but for this simple scenario I wanted to experiment with Postgres full text search and the Django ORM interfaces to it.

The first working version of my search view was simply:

class ArticleSearchResultView(ListView):
    model = Article
    template_name = 'articles/article_search_results.html'
    context_object_name = 'articles'
    paginate_by = 5

    def get_queryset(self):
        qs = Article.objects.filter(visible=True)

        try:
            search_string = self.request.GET['q']
            qs = qs.annotate(
                search=(
                    SearchVector('title') +
                    SearchVector('content') +
                    SearchVector('location') +
                    SearchVector('byline')
                ),
            ).filter(search=search_string)
        except KeyError:
            return Article.objects.none()

        return qs.order_by('starts_at')

(Note that django.contrib.postgres must be added to INSTALLED_APPS in order to use Postgres full text search via the Django ORM.)

Because most articles on this site are events, and past events are automatically made invisible shortly after the event, the ordering by event start time above provides a reasonable “ranking” of matches. (Effort around using proper search “ranking” may be needed in the future.)

This initial search implementation was effective in most cases. However, one of the important jazz venues in the area is Beyù Caffè in Durham, North Carolina. Note the accented characters in the name of the venue, which I try to consistently replicate when creating articles that pertain to Beyù. When the web application user searched for “beyu”, none of these events were found by the initial search implementation. (Even here in the American South, “Beyù” wasn’t the only useful search target with accented characters; another is Baron Tymas’ album Montréal.)

Here is an interactive session showing the lack of search hits, for contrast with a later version:

In [1]: from django.contrib.postgres.search import SearchVector

In [2]: from articles.models import Article

In [3]: search_string = 'beyu'

In [4]: qs = qs.annotate(
   ...:   search=(
   ...:     SearchVector('title') +
   ...:     SearchVector('content') +
   ...:     SearchVector('location') +
   ...:     SearchVector('byline')
   ...:   ),
   ...: ).filter(search=search_string)

In [5]: qs
Out[5]: <QuerySet []>

The solution is to create a Postgres search configuration which ignores accents, and specify that search configuration on the query. Here are example commands to create an appropriate Postgres search configuration named english_unaccent:

$ sudo -u postgres psql my_project_db
psql (9.5.10)
Type "help" for help.

my_project_db=# create extension if not exists unaccent;
CREATE EXTENSION
my_project_db=# create text search configuration english_unaccent(copy=english);
CREATE TEXT SEARCH CONFIGURATION
my_project_db=# alter text search configuration english_unaccent alter mapping for hword, hword_part, word with unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION
my_project_db=#

(Creating the extension requires Postgres SUPERUSER privileges, which my usual db role does not have; thus these commands were run as the system user postgres.)

Specifying the new search configuration on the query:

In [1]: from django.contrib.postgres.search import SearchQuery, SearchVector

In [2]: from articles.models import Article

In [3]: search_string = 'beyu'

In [4]: qs = Article.objects.filter(visible=True).annotate(
   ...:   search=(
   ...:     SearchVector('title', config='english_unaccent') +
   ...:     SearchVector('content', config='english_unaccent') +
   ...:     SearchVector('location', config='english_unaccent') +
   ...:     SearchVector('byline', config='english_unaccent')
   ...:   ),
   ...: ).filter(search=SearchQuery(search_string, config='english_unaccent'))

In [5]: qs
Out[5]: <QuerySet [<Article: BEYÙ CAFFÈ (venue)>, <Article: Brett's OpenMic w/The Usual Suspects (Event at Jan 03 2018, 08:00 PM)>]>

Are we done? Not quite; there are a couple of further considerations. As the Article and its search view are part of a reusable Django app, I need to control the use of a particular search configuration, if any, from the consuming project. Also, the extra database configuration needs to be handled during the deploy of the project.

A Django setting is used to optionally specify the name of a search configuration. Here is the setting used in the NC Jazz Beat project, corresponding to the code shown previously:

ARTICLE_SEARCH_SETTINGS = {
   'config': 'english_unaccent',
}

The new code for the search view is:

class ArticleSearchResultView(ListView):
   model = Article
   template_name = 'articles/article_search_results.html'
   context_object_name = 'articles'
   paginate_by = 5

   def __init__(self, *args, **kwargs):
       super().__init__(*args, **kwargs)
       self.config_kwargs = {}
       search_settings = getattr(settings, 'ARTICLE_SEARCH_SETTINGS', {})
       if 'config' in search_settings:
           self.config_kwargs['config'] = search_settings['config']

   def get_queryset(self):
       qs = Article.objects.filter(visible=True)

       try:
           search_string = self.request.GET['q']
           qs = qs.annotate(
               search=(
                   SearchVector('title', **self.config_kwargs) +
                   SearchVector('content', **self.config_kwargs) +
                   SearchVector('location', **self.config_kwargs) +
                   SearchVector('byline', **self.config_kwargs)
               ),
           ).filter(search=SearchQuery(search_string, **self.config_kwargs))
       except KeyError:
           return Article.objects.none()

       return qs.order_by('starts_at')

In the project’s Ansible deploy script, the extension is set up using Ansible’s postgresql_ext module. Next, this script is run via psql to set up the search configuration, if it does not already exist:

BEGIN;
   DO $$ BEGIN
       IF NOT EXISTS (SELECT cfgname FROM pg_ts_config WHERE cfgname = 'english_unaccent') THEN
           CREATE TEXT SEARCH CONFIGURATION english_unaccent(COPY=english);
           ALTER TEXT SEARCH CONFIGURATION english_unaccent ALTER MAPPING FOR hword, hword_part, word WITH unaccent, english_stem;
       END IF;
   END $$;
COMMIT;

A remaining glitch was found in my scripting to load the production database into my development environment. The script runs without Postgres SUPERUSER privileges, so the unescape extension was not created when the database was created. Thus, this command was added to enable the extension after the database was created from the production database dump:

$ sudo -u postgres psql jazzbeat -c "create extension if not exists unaccent;"

Are we done now? Yes, for the time being. Depending on how content is created in the future, I may yet move to a normal Haystack-driven search with Solr or Elasticsearch.

References: