As usual, I am using the announce of new release of
Lucene Domain Index to comments some internal detail of
Oracle ODCI API.
Today, I will show how to use and works Lucene Domain Index facets.
Latest release implements an aggregate function
lfacets() which performs bit set and grouping with Lucene. I am used
this article as a jump start for doing facets.
Also I found another post on how to implement an
aggregated function using ODCI API, but as with the
pipeline table function implementation there is no documentation nor examples on how to do it with Java inside the database.
I am using same approach as the pipeline table function, a mix of
PLSQL code plus an
internal implementation using Java.
As you can see on the code there is an object type named
facets_agg_type which is the implementation of
lfacets() aggregated function, under the hood it creates the structure of the return type and call to the Java class which is doing a bit-set-AND operations on Lucene filters.
As an example let me show how to do facets with
WikiPedia Spanish dump uploaded to Oracle XMLDB. I can query PAGES table with lcontains() operator as:
select /*+ DOMAIN_INDEX_SORT */ extractValue(object_value,'/page/title')
from pages where
lcontains(object_value,
'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO 20070718]')>0;
But faceting is for doing grouping similar to

so using some categories already defined into
WikiPedia Spanish web site.
I can create a simple table with the criteria for categorizing WikiPedia corpus.
Here a table with parent relation ship to organize the categories:
create table wiki_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_WIKI_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES wiki_categories (cat_code)
);
/
insert into wiki_categories values (1,'Ciencias naturales y formales',null);
insert into wiki_categories values (2,'Ciencias humanas y sociales',null);
insert into wiki_categories values (3,'Artes',null);
insert into wiki_categories values (4,'Sociedad',null);
insert into wiki_categories values (5,'Tecnologías',null);
insert into wiki_categories values (6,'Astronomía',1);
insert into wiki_categories values (7,'Biología',1);
....
insert into wiki_categories values (38,'Exploración espacial',5);
insert into wiki_categories values (39,'Biotecnología',5);
insert into wiki_categories values (40,'Informática',5);
insert into wiki_categories values (41,'Ingeniería',5);
insert into wiki_categories values (42,'Medicina',5);
So with these categories I can use lfacets() aggregated function as:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_code,level
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_code,level;
or:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
getting a result for the second case as:
text:(Ciencias naturales y formales),text:(Astronomía)(1280),text:(Ciencias naturales y formales),text:(Biología)(5144),text:(Ciencias naturales y formales),text:(Matemática)(4729),text:(Ciencias naturales y formales),text:(Física)(8708),text:(Ciencias naturales y formales),text:(Química)(4745),text:(Ciencias naturales y formales),text:(Botánica)(3697)
text:(Ciencias humanas y sociales),text:(Comunicación)(7319),text:(Ciencias humanas y sociales),text:(Filosofía)(6393),text:(Ciencias humanas y sociales),text:(Sociología)(2004),text:(Ciencias humanas y sociales),text:(Religión)(6429),text:(Ciencias humanas y sociales),text:(Psicología)(2508),text:(Ciencias humanas y sociales),text:(Política)(27356),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Lingüística)(2611),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Historia)(35727),text:(Ciencias humanas y sociales),text:(Egiptología)(91),text:(Ciencias humanas y sociales),text:(Derecho)(18586),text:(Ciencias humanas y sociales),text:(Economía)(9755)
text:(Artes),text:(Anime y manga)(1604),text:(Artes),text:(Arquitectura)(1981),text:(Artes),text:(Literatura)(2154),text:(Artes),text:(Teatro)(2818),text:(Artes),text:(Televisión)(1861),text:(Artes),text:(Pintura)(3323),text:(Artes),text:(Danza)(876),text:(Artes),text:(Música)(3643),text:(Artes),text:(Cine)(2235)
text:(Sociedad),text:(Gastronomía)(443),text:(Sociedad),text:(Videojuegos)(360),text:(Sociedad),text:(LGBT)(267),text:(Sociedad),text:(Deporte)(3647)
text:(Tecnologías),text:(Exploración espacial)(1375),text:(Tecnologías),text:(Biotecnología)(232),text:(Tecnologías),text:(Ingeniería)(1654),text:(Tecnologías),text:(Informática)(2017),text:(Tecnologías),text:(Medicina)(1085)
text:(Ciencias naturales y formales)(62689),text:(Tecnologías)(13772),text:(Ciencias humanas y sociales)(80857),text:(Sociedad)(25372),text:(Artes)(14515)
Note that I am using
ljoin function which converts
agg_tbl type (TABLE OF) to a coma separated string.
The computation of the whole
WikiPedia dump, around 1 million of documents in my notebook took:
INFO: key '96' main category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Astronomía)'
INFO: sub category 'text:(Biología)'
INFO: sub category 'text:(Matemática)'
INFO: sub category 'text:(Física)'
INFO: sub category 'text:(Química)'
INFO: sub category 'text:(Botánica)'
INFO: Terminating category: 'text:(Ciencias naturales y formales)' Elapsed time: 58 millisecond.
INFO: key '97' main category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Comunicación)'
INFO: sub category 'text:(Filosofía)'
INFO: sub category 'text:(Sociología)'
INFO: sub category 'text:(Religión)'
INFO: sub category 'text:(Psicología)'
INFO: sub category 'text:(Política)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Lingüística)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Historia)'
INFO: sub category 'text:(Egiptología)'
INFO: sub category 'text:(Derecho)'
INFO: sub category 'text:(Economía)'
INFO: Terminating category: 'text:(Ciencias humanas y sociales)' Elapsed time: 115 millisecond.
INFO: key '98' main category 'text:(Artes)'
INFO: sub category 'text:(Anime y manga)'
INFO: sub category 'text:(Arquitectura)'
INFO: sub category 'text:(Literatura)'
INFO: sub category 'text:(Teatro)'
INFO: sub category 'text:(Televisión)'
INFO: sub category 'text:(Pintura)'
INFO: sub category 'text:(Danza)'
INFO: sub category 'text:(Música)'
INFO: sub category 'text:(Cine)'
INFO: Terminating category: 'text:(Artes)' Elapsed time: 97 millisecond.
INFO: key '99' main category 'text:(Sociedad)'
INFO: sub category 'text:(Gastronomía)'
INFO: sub category 'text:(Videojuegos)'
INFO: sub category 'text:(LGBT)'
INFO: sub category 'text:(Deporte)'
INFO: Terminating category: 'text:(Sociedad)' Elapsed time: 18 millisecond.
INFO: key '100' main category 'text:(Tecnologías)'
INFO: sub category 'text:(Exploración espacial)'
INFO: sub category 'text:(Biotecnología)'
INFO: sub category 'text:(Ingeniería)'
INFO: sub category 'text:(Informática)'
INFO: sub category 'text:(Medicina)'
INFO: Terminating category: 'text:(Tecnologías)' Elapsed time: 18 millisecond.
INFO: key '101' main category '-ALL-'
INFO: sub category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Tecnologías)'
INFO: sub category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Sociedad)'
INFO: sub category 'text:(Artes)'
INFO: Terminating category: '-ALL-' Elapsed time: 111 millisecond.
Not so bad considering that I am using a notebook and is not a server optimized for databases purpose. But Oracle have another good chance to speed up this query, a
materialized view, let do that:
CREATE MATERIALIZED VIEW wiki_facets
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE AS
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
which obviously produce same result:
SQL> set timing on
SQL> select * from wiki_facets;
... same output here ....
Elapsed: 00:00:00.00
SQL>
but with
no efforts!!!
You can refresh your materialized view at any point in time depend on your applications constraints.
Under the hood an aggregated function call to the
ODCI API first to get the bit set (array of bits marked to 1, one for each document that match) of the first category, for example text:(Ciencias naturales y formales), then it will iterate over all his sub categories:
text:(Astronomía)(1280),
text:(Biología)(5144)
text:(Matemática)(4729)
text:(Física)(8708)
text:(Química)(4745)
text:(Botánica)(3697)
but on each iteration the main category is already computed. Also for caching purposes the Lucene Filter created to get a bit set with the matching documents is stored in Lucene Domain Index memory structures to be re-used by
countHits function or
lcontains() operator.
So if you execute a query like:
select LuceneDomainIndex.countHits('PAGES_LIDX_ALL','text:(Astronomía)') from dual;
A Lucene Filter required to get the number of hits is already filled by
lfacets() function.
I have updated
Lucene Domain Index documentation with latest functionality you can read on-line or download as PDF from SF.net web site, and remember that I'll talking about Java inside the database and some internal details of Oracle-Lucene integration next month at
CLOUG 2009, see you there....