Handling a huge amount of fulltext searches part 2 – the internals

by Ochronus on February 7, 2009

In part one, I generally drafted the ingredients needed for the system, now it’s time to have a deeper look at the internal workings, processes and connections.

The system consists of two components, the so-called backend and the frontend. The frontend is a layer between the users and the data – in the traditional model-view-controller (MVC) paradigm, it covers both the controller and the view part. It receives user input (the searches), gets the data from the backend, processes it, and transforms the result to a form which is user-, or further processing friendly.
The backend does the heavy lifting (full text searching, caching). Now let’s see how they build up.

Backend

The first and most importand piece of software here is sphinx. Download and compile it (yes, go for the development versions), then comes configuration. Sphinx’s config is basically based on source-index pairs and some global stuff. You tell it what to index (the source part), this can be the resultset of a database query, text files, etc.. We want to index mysql, so let’s configure it accordingly:
(test case scenario: we have a huge database of books)

source books {
type = mysql
sql_host = localhost
sql_user = some_user
sql_pass = some_pass
sql_port = 3306 # optional, default is 3306
# unicode ftw! – we can specify multiple queries which are to be executed before the real data query
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET CHARACTER_SET_SERVER=utf8
sql_query_pre = SET CHARACTER_SET_CLIENT=utf8
# now go for the main query – select all the fields we want to have indexed or as an attribute – id is the id returned by default
sql_query = SELECT Id, AuthorId, UNIX_TIMESTAMP(PubDate) as PubTs, Title, Excerpt, ToC FROM LotsaBooks
# now define the attributes
sql_attr_uint = AuthorId # reference to another table with authors
sql_attr_timestamp = PubTs
}

# now define the index, look at sphinx docs for info
index books {
docinfo = extern
mlock = 0
morphology = none
min_word_len = 2
charset_type = utf-8
html_strip = 1
path = /opt/local/sphinx/data/
}

.
.
.

# there’s more config to it, but that’s mostly default.

Now that we have sphinx configured, let’s index the data : run indexer –all, then fire up searchd.
The first item is checked on our list.

The next weapon is memcached. Install it, configure it (memory settings according to the estimated size of cached data), then fire it up. Now we need to come up with some strategy to map sphinx results to cacheable items. Sphinx returns result ids and attributes, in our case the id corresponds to the Id field in our table, so mapping is straightforward – the memchace key for each book entry has to contain the Id of the book in some way. The keys can be the id’s themselves, or if we look a bit more forward, let them be more distinguishable: call them BOOK-$id, like BOOK-2344. For each book we want to return it’s author, it’s title, the date of publication, the table of contents and a small excerpt, thus we need to cache these. Do we? Not all of them. We already have the AuthorId, and the timestamp of the publication date as sphinx attributes. What if we cache the authors’ names in memcache, not for every book? We save much memory by doing so. So now we need another array of memcache keys: AUTHOR-123 and alike. This way the keys are fully distinct and can be generated by knowing what we want to get from memcache (book or author, plus an id). We now have the concept, let’s write a small script which loads data from our tables to memcache. It’s easy, quick and dirty. I don’t feel the need to specify such a script, but for the curious: it involves two iterations: the first over the result of a SELECT …. FROM Books, generating hashes and storing them to memcache, the second doing the same on the resultset of a SELECT … FROM Authors. One small tip, which can also save much memory: memcache apis store data serialized – which means that the keys of our hashes are to be counted as data to, so the length of the keys’ names matter! Therefore try to come up with small key names, like ‘i’ for an id, ‘n’ for a name, ‘e’ for excerpt, etc.. Also look very carefully not to include data redundantly – remember, we already get the ids (bookid and authorid) and the publication date from sphinx for free!
Actually we’re done on the backend. We have memcache and sphinx searchd filled with data and running.

Frontend

Now comes the part with the most coding.
In part #3 :)

Previous post:

Next post: