Handling a huge amount of fulltext searches
How do you handle a massive number of fulltext searches? MySQL? Been there, done that. It’s a no-go for average servers. PostgreSql with Tsearch2?
See a nice solution cooked from ruby, thin, memcached and sphinx.
PostgreSql: I can’t say anything about it, I haven’t tried it yet on live servers, though it’s praised a lot.
Let’s forget about databases for a moment, though. Do we really need them for a simple, non-combined fulltext search-data retrieval cycle? Of course not.
Index the data, which in this case comes from a database, may it be from a simple table or the result of a complex multi-join query, store the index, query the standalone search daemon, cache the data for results, use a high-troughput (or should I say low-overhead?) http server, and voilá, it’s done. So much for the recipe. Now look at the components more closely
First ingredient: MySQL/PostgreSql/some other data source. That’s easy. We already have our precious and massive amount of data there.
Second ingredient: Sphinx. Get it here. Yes, I know, it’s new, it’s fresh, it’s immature. You’re free to use any well-tested, mature indexer. Sphinx worked for me fine, though. And it’s fast. Very fast. Tell it to index your data, choose the primary key it returns with the results, choose your attributes, then do the indexing. When done, launch searchd.
Third ingredient: Memcached. Our old friend can be found here. Everybody knows memcached. Everybody loves memcached. If you don’t know it yet, don’t waste your time with my blog post, go and get to know it. It can become your very best friend in no time. Now use it as a mirror of our indexed data – store the data by the same key sphinx returns for its search hits.
Fourth ingredient: Ah, at last, the super-fast webserver. If you’ve seen my blog before, you have probably already guessed what’s coming: Thin. I won’t say much about it now, read my older post or
use ye olde faithful google. This is good. Thin is really simple. Thin is fast. Thin scales well with multiple cores, processors, servers. Thin won’t be the bottleneck of your software architecture. Thin can be the nexus which connects your users’ searches, sphinx and memcache.
Spice and stuff: a good ruby sphinx api – try this. You’ll also need a decent memcache api in ruby – go for the creators’ recommendation, and use the new libmemcached client from here. Why use this instead of other apis? Because it’s much faster (c-based), fully supported and recommended by the creators of memcached, and it supports multiple GET-s! Multiple GET-s make the whole sw architecture faster because of the elimination of the overhead created by the several GET operations when using classic approaches. Gather your keys in an array, GET the data, iterate through the resulting array, and there you go!
Check back later for some implementation details ;)
Related posts:
3 comments
Trackbacks/Pingbacks
- Handling a huge amount of fulltext searches part 2 - the internals | blog@iamnolegend.com - [...] In part one, I generally drafted the ingredients needed for the system, now it’s time to have a deeper ...
Leave a Reply
Additional comments powered by BackType


Generally speaking, the appropriate answer for something like this is a compiled language with efficient machine-near datastructures, such as C++ or Forth, since then you can just multihash the words and keep a large bucket histograph. If you’re trying to handle a “huge” amount of searches, you shouldn’t be stubbing yourself by working with general-case tools.
You’re likely to see two orders of magnitude improvement with a well chosen string hash and a non-generalized hash table. The C++ unordered_multimap in Boost is almost certainly what you’re looking for.
[Reply]
@John Haugeland
Generally speaking, you’re absolutely right, but consider development time. In about two hours I managed to put together a system which handles around 120 req/s with a concurrency of 50 on low-grade hardware. Of course I’m not talking about facebook or twitter size load :)
[Reply]
Try SOLR.
[Reply]