KillerSpaz @ ZendCon: Day 2 - Session 2

Presented by Derick Rethans (dr@ez.no)
Slides at: http://derickrethans.nl/talks.php

Before searching, you must index, which requires:

Finding documents to index (crawl)
Separate the docs into indexable units (tokenizing)
Massage the found units (stemming)

Crawling:

Domain specific: file system, CMS, Google, etc.
Should have different fields of a document: title, description, meta-tags, body.
Crawling strategy must be determined based on domain.
Text:

global, whitespace (explode on space), continuous letters (like whitespace, but includes special chars)
Define stop-words that won't be included (the, of, and, or, etc)
define synonyms (ie, British vs American words)
normalize text (remove special chars w/regular chars)
Japanese/Chinese texts are difficult, and require special tools to interpret.

Stemming:

Porter stemming
Language dependent
Many algorithm's exist.
ex: arrival -> arrive, skies -> sky
Alternatively can use soundex or metaphone

Types of searching:

words, phrases, boolean: airplane, "red wine", "wine - red"
facetted(categorized) search: limit results by categories defined to found results, usually an iterative process. Can be "document type", et al.

MySQL FullText searching: use MATCH() and AGAINST(). Also has a lot of limitations.

Apache Lucene:

Implemented in Java
Powerful query types
Ranked searching
fielded searching
Proximity queries (search for words close to another word)
Zend Lucene port to PHP, but not as feature rich (although growing).

Keywords not tokenized or stemmed.
UnIndexed fields
Binary fields
Text/Tokenized Fields
UnStored, tokenized, but only indexed

Apache SOLR: Lucene access via a webservice.

So, this session is actually rather boring... I was expecting more theory rather than discussing the tools that are mostly used. I'm interested in Lucene, sure, but in reality I'd like to know how search engines work... Surely Google doesn't use Lucene?

Oh well, onto the next session!

- Spaz

KillerSpaz @ ZendCon

Tuesday, September 16, 2008

Day 2 - Session 2 - of Haystacks & Needles

No comments:

Session Info

About Me