Friday, March 27, 2009

Doing facets with latest Lucene Domain Index 2.4.1.1.0

As usual, I am using the announce of new release of Lucene Domain Index to comments some internal detail of Oracle ODCI API.
Today, I will show how to use and works Lucene Domain Index facets.
Latest release implements an aggregate function lfacets() which performs bit set and grouping with Lucene. I am used this article as a jump start for doing facets.
Also I found another post on how to implement an aggregated function using ODCI API, but as with the pipeline table function implementation there is no documentation nor examples on how to do it with Java inside the database.
I am using same approach as the pipeline table function, a mix of PLSQL code plus an internal implementation using Java.
As you can see on the code there is an object type named facets_agg_type which is the implementation of lfacets() aggregated function, under the hood it creates the structure of the return type and call to the Java class which is doing a bit-set-AND operations on Lucene filters.
As an example let me show how to do facets with WikiPedia Spanish dump uploaded to Oracle XMLDB. I can query PAGES table with lcontains() operator as:
select /*+ DOMAIN_INDEX_SORT */ extractValue(object_value,'/page/title')
from pages where
lcontains(object_value,
'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO 20070718]')>0;
But faceting is for doing grouping similar to
so using some categories already defined into WikiPedia Spanish web site.
I can create a simple table with the criteria for categorizing WikiPedia corpus.
Here a table with parent relation ship to organize the categories:
create table wiki_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_WIKI_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES wiki_categories (cat_code)
);
/

insert into wiki_categories values (1,'Ciencias naturales y formales',null);
insert into wiki_categories values (2,'Ciencias humanas y sociales',null);
insert into wiki_categories values (3,'Artes',null);
insert into wiki_categories values (4,'Sociedad',null);
insert into wiki_categories values (5,'Tecnologías',null);

insert into wiki_categories values (6,'Astronomía',1);
insert into wiki_categories values (7,'Biología',1);
....
insert into wiki_categories values (38,'Exploración espacial',5);
insert into wiki_categories values (39,'Biotecnología',5);
insert into wiki_categories values (40,'Informática',5);
insert into wiki_categories values (41,'Ingeniería',5);
insert into wiki_categories values (42,'Medicina',5);
So with these categories I can use lfacets() aggregated function as:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_code,level
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_code,level;
or:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
getting a result for the second case as:
text:(Ciencias naturales y formales),text:(Astronomía)(1280),text:(Ciencias naturales y formales),text:(Biología)(5144),text:(Ciencias naturales y formales),text:(Matemática)(4729),text:(Ciencias naturales y formales),text:(Física)(8708),text:(Ciencias naturales y formales),text:(Química)(4745),text:(Ciencias naturales y formales),text:(Botánica)(3697)

text:(Ciencias humanas y sociales),text:(Comunicación)(7319),text:(Ciencias humanas y sociales),text:(Filosofía)(6393),text:(Ciencias humanas y sociales),text:(Sociología)(2004),text:(Ciencias humanas y sociales),text:(Religión)(6429),text:(Ciencias humanas y sociales),text:(Psicología)(2508),text:(Ciencias humanas y sociales),text:(Política)(27356),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Lingüística)(2611),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Historia)(35727),text:(Ciencias humanas y sociales),text:(Egiptología)(91),text:(Ciencias humanas y sociales),text:(Derecho)(18586),text:(Ciencias humanas y sociales),text:(Economía)(9755)

text:(Artes),text:(Anime y manga)(1604),text:(Artes),text:(Arquitectura)(1981),text:(Artes),text:(Literatura)(2154),text:(Artes),text:(Teatro)(2818),text:(Artes),text:(Televisión)(1861),text:(Artes),text:(Pintura)(3323),text:(Artes),text:(Danza)(876),text:(Artes),text:(Música)(3643),text:(Artes),text:(Cine)(2235)
text:(Sociedad),text:(Gastronomía)(443),text:(Sociedad),text:(Videojuegos)(360),text:(Sociedad),text:(LGBT)(267),text:(Sociedad),text:(Deporte)(3647)

text:(Tecnologías),text:(Exploración espacial)(1375),text:(Tecnologías),text:(Biotecnología)(232),text:(Tecnologías),text:(Ingeniería)(1654),text:(Tecnologías),text:(Informática)(2017),text:(Tecnologías),text:(Medicina)(1085)

text:(Ciencias naturales y formales)(62689),text:(Tecnologías)(13772),text:(Ciencias humanas y sociales)(80857),text:(Sociedad)(25372),text:(Artes)(14515)


Note that I am using ljoin function which converts agg_tbl type (TABLE OF) to a coma separated string.
The computation of the whole WikiPedia dump, around 1 million of documents in my notebook took:
INFO: key '96' main category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Astronomía)'
INFO: sub category 'text:(Biología)'
INFO: sub category 'text:(Matemática)'
INFO: sub category 'text:(Física)'
INFO: sub category 'text:(Química)'
INFO: sub category 'text:(Botánica)'
INFO: Terminating category: 'text:(Ciencias naturales y formales)' Elapsed time: 58 millisecond.
INFO: key '97' main category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Comunicación)'
INFO: sub category 'text:(Filosofía)'
INFO: sub category 'text:(Sociología)'
INFO: sub category 'text:(Religión)'
INFO: sub category 'text:(Psicología)'
INFO: sub category 'text:(Política)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Lingüística)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Historia)'
INFO: sub category 'text:(Egiptología)'
INFO: sub category 'text:(Derecho)'
INFO: sub category 'text:(Economía)'
INFO: Terminating category: 'text:(Ciencias humanas y sociales)' Elapsed time: 115 millisecond.
INFO: key '98' main category 'text:(Artes)'
INFO: sub category 'text:(Anime y manga)'
INFO: sub category 'text:(Arquitectura)'
INFO: sub category 'text:(Literatura)'
INFO: sub category 'text:(Teatro)'
INFO: sub category 'text:(Televisión)'
INFO: sub category 'text:(Pintura)'
INFO: sub category 'text:(Danza)'
INFO: sub category 'text:(Música)'
INFO: sub category 'text:(Cine)'
INFO: Terminating category: 'text:(Artes)' Elapsed time: 97 millisecond.
INFO: key '99' main category 'text:(Sociedad)'
INFO: sub category 'text:(Gastronomía)'
INFO: sub category 'text:(Videojuegos)'
INFO: sub category 'text:(LGBT)'
INFO: sub category 'text:(Deporte)'
INFO: Terminating category: 'text:(Sociedad)' Elapsed time: 18 millisecond.
INFO: key '100' main category 'text:(Tecnologías)'
INFO: sub category 'text:(Exploración espacial)'
INFO: sub category 'text:(Biotecnología)'
INFO: sub category 'text:(Ingeniería)'
INFO: sub category 'text:(Informática)'
INFO: sub category 'text:(Medicina)'
INFO: Terminating category: 'text:(Tecnologías)' Elapsed time: 18 millisecond.
INFO: key '101' main category '-ALL-'
INFO: sub category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Tecnologías)'
INFO: sub category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Sociedad)'
INFO: sub category 'text:(Artes)'
INFO: Terminating category: '-ALL-' Elapsed time: 111 millisecond.
Not so bad considering that I am using a notebook and is not a server optimized for databases purpose. But Oracle have another good chance to speed up this query, a materialized view, let do that:
CREATE MATERIALIZED VIEW wiki_facets
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE AS
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
which obviously produce same result:
SQL> set timing on
SQL> select * from wiki_facets;
... same output here ....
Elapsed: 00:00:00.00
SQL>
but with no efforts!!!
You can refresh your materialized view at any point in time depend on your applications constraints.
Under the hood an aggregated function call to the ODCI API first to get the bit set (array of bits marked to 1, one for each document that match) of the first category, for example text:(Ciencias naturales y formales), then it will iterate over all his sub categories:
text:(Astronomía)(1280),
text:(Biología)(5144)
text:(Matemática)(4729)
text:(Física)(8708)
text:(Química)(4745)
text:(Botánica)(3697)
but on each iteration the main category is already computed. Also for caching purposes the Lucene Filter created to get a bit set with the matching documents is stored in Lucene Domain Index memory structures to be re-used by countHits function or lcontains() operator.
So if you execute a query like:
select LuceneDomainIndex.countHits('PAGES_LIDX_ALL','text:(Astronomía)') from dual;

A Lucene Filter required to get the number of hits is already filled by lfacets() function.
I have updated Lucene Domain Index documentation with latest functionality you can read on-line or download as PDF from SF.net web site, and remember that I'll talking about Java inside the database and some internal details of Oracle-Lucene integration next month at CLOUG 2009, see you there....