Tuesday, September 16, 2008

Day 2 - Session 2 - of Haystacks & Needles

Presented by Derick Rethans (dr@ez.no)
Slides at: http://derickrethans.nl/talks.php


Before searching, you must index, which requires:
  • Finding documents to index (crawl)
  • Separate the docs into indexable units (tokenizing)
  • Massage the found units (stemming)
Crawling:
  • Domain specific: file system, CMS, Google, etc.
  • Should have different fields of a document: title, description, meta-tags, body.
  • Crawling strategy must be determined based on domain.
  • Text:
    • global, whitespace (explode on space), continuous letters (like whitespace, but includes special chars)
    • Define stop-words that won't be included (the, of, and, or, etc)
    • define synonyms (ie, British vs American words)
    • normalize text (remove special chars w/regular chars)
    • Japanese/Chinese texts are difficult, and require special tools to interpret.
  • Stemming:
    • Porter stemming
    • Language dependent
    • Many algorithm's exist.
    • ex: arrival -> arrive, skies -> sky
    • Alternatively can use soundex or metaphone
  • Types of searching:
    • words, phrases, boolean: airplane, "red wine", "wine - red"
    • facetted(categorized) search: limit results by categories defined to found results, usually an iterative process. Can be "document type", et al.
  • MySQL FullText searching: use MATCH() and AGAINST(). Also has a lot of limitations.
Apache Lucene:
  • Implemented in Java
  • Powerful query types
  • Ranked searching
  • fielded searching
  • Proximity queries (search for words close to another word)
  • Zend Lucene port to PHP, but not as feature rich (although growing).
    • Keywords not tokenized or stemmed.
    • UnIndexed fields
    • Binary fields
    • Text/Tokenized Fields
    • UnStored, tokenized, but only indexed
Apache SOLR: Lucene access via a webservice.


So, this session is actually rather boring... I was expecting more theory rather than discussing the tools that are mostly used. I'm interested in Lucene, sure, but in reality I'd like to know how search engines work... Surely Google doesn't use Lucene?

Oh well, onto the next session!

- Spaz

No comments: