Tuesday, March 3, 2009

Highlighting features with latest Lucene Domain Index 2.4.0.1.0 - Part I - lhighlight() ancillary operator

Once of the features more wanted of Lucene Domain Index was highlighting. Highlighting features in Lucene core distributions is provided from a long time ago, but the question was how to integrate it easy in the SQL syntax.
I decided to implement two kinds of highlighting in this release (2.4.0.1.0), which is a maintenance release for Lucene 2.4.0 core.
First is by using lhighlight(NUMBER):VARCHAR2 ancillary operator, ancillary operators are functions bounded to the current lcontains() execution, these are connected with the correlation ID, the NUMBER argument of lhighlight() and last argument of lcontains(), for example:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject
2 FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

Above query will return rows with a VARCHAR2 value having the matching text in bodytext column highlighted using a tag <B>security</B>, the rest of the row will be the score and the subject of the email.
There are two important points here, the piece of text highlighted is returned as VARCHAR2, it means his length will be less than 32K, this is not a big issue because usually the highlighted text is an small part of the whole text showed to users in order to provide more information for manually disambiguation. On the other hand ancillary operators receives from the RDBMS engine the text to highlight, not all SQL types are supported, current implementation is built for VARCHAR2, CLOB amd XMLType columns.
Unlike Lucene highlighting implementations columns to be highlighted are not stored in Lucene index structure, this is because the RDBMS engine automatically loads and transfers the content to highlight to the ancillary operator implementation.
The limitation of lhighlight() function is that it can only works with the master column of the index and Lucene Domain Index can index multiple columns at once, for example for an index created with the WikiPedia Spanish dump:
create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:INFO;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title",
extractValue(object_value,''/page/revision/comment'') "comment",
extract(object_value,''/page/revision/text/text()'') "text",
extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
In the above index definition the master column is not indexed as is, only the virtual columns title, comment, text and revisionDate are indexed, but highlighting features still evaluated with the whole row info which is of type XMLType, so an example query will look like:
SQL> select /*+ DOMAIN_INDEX_SORT */ lhighlight(1),extractValue(object_value,'/page/title') from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica tango rock)',1)>0;

<page xmlns="http://www.mediawiki.org/xml/export-0.3/" >
<title> <B>Música</B> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su [[<B>música</B>]].

Se encuentra una gran... argentinos, un instrumento tradicional andino]]
Aún se mantiene la <B>música</B> de los [[Indígenas_en_Argentina... de grandes corrientes de [[inmigración|inmigrantes]] europeos, la <B>música</B> argentina se enriqueció
Música de Argentina
musical emparentado con la [[habanera]] y el [[<B>tango</B> (<B>música</B>)|<B>tango</B>]].

==Diferencias con el <B>tango</B>==

Aunque tanto la milonga como el <B>tango</B> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de la milonga están distribuidas en 3 + 3 + 2 en cambio el <B>tango</B> posee un ritmo más «cuadrado». Las letras...]] criticó en algún momento el <B>tango</B> y prefirió la milonga, que no trasmite la melancolía
Milonga (género musical)
Text in italic are only remarked to see the differences between the highlighted text and column title which is part of result.
On next post I'll talk about highlighting features using pipeline table functions implemented in Java, a big deal because there is not too much information and no native support for ANYDataSet results, stay tuned.