Tuesday, October 6, 2009

Experiences on Java programming within the databases, tips & tricks, tools, open source libraries and more

This is the topic of my talk during unconference this year (Tuesday - Oct 13 - 10am - Overlook II) but, What is an unconference?
An unconference is a conference where the content of the sessions is driven and created by the participants, generally day-by-day during the course of the event, rather than by a single organizer, or small group of organizers, in advance (Source: Wikipedia). Unconferences came from the realization that "the sum of the expertise of the people in the audience is greater than the sum of expertise of the people on stage" (Source: Dave Winer).
During this talk I would like to transfer my experience on Java programming withing the Database since 1999.
Specially looking for three scenarios of work:
  • Traditional Java Stored Procedure programming
  • HTPP/HTTPS services through the XMLDB Servlet connector
  • Domain Index and pipe-line table function
This scenarios are covered by different Open Source projects used as practical examples DBPrism CMS, XMLDB Restlet Connector and Lucene Domain Index.
Each projects included several Open Source Libraries used and many tools for doing deployment and installing.
I hope I can see you all there next week... Marcelo

Wednesday, September 30, 2009

New release of Lucene Domain Index based on Lucene 2.9.0

A new binary distribution of Lucene Domain Index (2.9.0.1.0) for Oracle 10g/11g has been released.
Lucene Domain Index is integration of Lucene Project running inside the Oracle JVM and integrated to the SQL layer by adding a new index type.
This new version uses latest Lucene 2.9.0 core libraries and introduces some of the changes on API.
Here complete list of changes:
  • Tested with Oracle 11gR2, 11gR1 and 10.2 databases.
  • DefaultUserDataStore do a SAX parsing to get text nodes and attributes from an XMLType value.
  • A SimpleLRUCache is used to load rowids and his associated Lucene doc id, this reduce memory consumption when querying very big tables. A new parameter has been added, CachedRowIdSize by default 10000 to control the size of the LRU cache.
  • Lucene Domain Index core was updated to use TopFieldCollector and to avoid computation time when lscore() is not used.
  • Two new parameter has been added NormalizeScore which control when to track the Max Score and when querying, both parameters are consequence of new Lucene Collector API and boost the performance when querying.
  • A table alias L$MT is defined for the master table associated to the index to be used in complex queries associating columns from master tables and columns from dependant tables.
Full documentation is at Google Doc.
Download binary version for Oracle 10g and 11g.
One of the biggest changed introduced into Lucene core libraries are in the TopCollector API, it introduces optimizations when you don't need to compute the score, track maximum score and preserve the Lucene document ID in order.
This changes are reflected in Lucene Domain Index through the usage of lscore() ancilliary operator. For example:
create table emails (
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB)
/
and an index created as:
create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate "emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');
-- required to Sort by subject
alter index emailbodyText parameters('FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),emailTo(NOT_ANALYZED)');
-- do not track max score
alter index emailbodyText parameters('NormalizeScore:false');
this queries reflect when score is computed or not.
SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1) sc,subject
FROM emails where emailfrom like '%@gmail.com' and
lcontains(bodytext,'subject:lucene injection',1)>0
order by lscore(1) DESC;
  • Lucene score is computed and do not track maximum score, the result will be returned using relevance order descendant, default Lucene ordering no extra effort.
SELECT /*+ DOMAIN_INDEX_SORT */ subject
FROM emails
where lcontains(bodytext,'subject:lucene injection','subject:ASC',1)>0;
  • Lucene score is not computed and obviously the maximum score is not tracked.
Note that if we are querying using the optimizer hint DOMAIN_INDEX_SORT and lcontains(..,'subject:ASC',..) the score value is not relevant the result will be returned using the column subject ascending.
Similar result can be obtained using order by SQL, but to do that the RDBMS first collects all the rows that match to lcontains() operator and then do an order by, the difference in this simple test is an optimizer cost 3 over 2.
The NormalizeScore parameter is by default true, to get back compatibility, applications which assume an score in range 0..1 will run without any change, but if you want to get faster response time by avoiding max score computation you can change it by using alter index parameters DDL command.
To see which really means NormalizeScore parameter look a this example using WikiPedia English dump:
select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */ lscore(1),
extractValue(object_value,'/page/title'),
extractValue(object_value,'/page/revision/timestamp') "revisionDate"
from pages where lcontains(object_value,'rownum:[1 TO 2] AND sport','revisionDate:ASC',1)>0;
with NormalizeScore:false returns:
1.56775963306427001953125 SCCA 25-FEB-02 03.43.11.000000000 PM +00:00
1.79172527790069580078125 Sports utility vehicle 04-APR-02 10.31.41.000000000 PM +00:00
with NormalizeScore:true returns:
0.875 SCCA 25-FEB-02 03.43.11.000000000 PM +00:00
1 Sports utility vehicle 04-APR-02 10.31.41.000000000 PM +00:00
Both results are logically because we are looking for a result order by revisionDate:ASC, but the score computed is different.
Well, I will be at Oracle Open World 09, if anybody want to know more about this kind of internal implementation I will talk at the Oracle Unconference and obviously we can meet at the OTN Lounge, see you there Marcelo.

Wednesday, June 17, 2009

Unleash Oracle ODCI API - OOW09 Voting Session

Oracle Open World Voting session is a new way to create the conference session agenda.
I have submited two speaker session, one named "Unleash Oracle ODCI API" that is ready for voting at Oracle Mix comunity.
Oracle Data Cartridge API is provided to implement many powerful functionality such as new Domain Indexes, pipeline tables, and aggregated functions.
The presentation will include an introduction to this API showing many of his features using as example the code of a Lucene Domain Index which is a mix between Java running inside the OJVM and Oracle Object types.
Lucene Domain Index is an open source project which integrates the Apache Lucene IR library as a new Domain Index, providing features has free text searching, faceting, highlighting, filtering at index level, multi table/column indexes and more for 10g/11g databases.
Basically I would like to introduce this exciting API which allows developers to interact directly with the RDBMS engine and adding some examples in Java that are not included into the Oracle Documentation.
Well if you want to see this session at OOW09 please click here, see you there....

Tuesday, April 7, 2009

Oracle Web Services Manager book

Recently I received a copy of the book Oracle Web Services Manager as courtesy of Pack Publishing editorial.
As everybody knows, I am more familiar with REST WS than SOAP and WS-* security standards, so I have read entirely the book :)
This book have a very good introduction to all security related stuff and also a comprehensive introduction to the product Oracle Web Services Manager.
For anybody who want to deploy secure WS using Oracle WS Manager the book will introduce you gradually ,and by examples, on all the steps related to authorization, authentication, digital signatures and many other topics.
On the client side the examples are in .Net technology (I prefer Java) but its a minor disappointment with the complete book content.
For anybody who want to take a closer look before buy the book, here a sample chapter on
Oracle Web services Manager-Authentication and Authorization
for downloading.
Enjoy the book....

Friday, March 27, 2009

Doing facets with latest Lucene Domain Index 2.4.1.1.0

As usual, I am using the announce of new release of Lucene Domain Index to comments some internal detail of Oracle ODCI API.
Today, I will show how to use and works Lucene Domain Index facets.
Latest release implements an aggregate function lfacets() which performs bit set and grouping with Lucene. I am used this article as a jump start for doing facets.
Also I found another post on how to implement an aggregated function using ODCI API, but as with the pipeline table function implementation there is no documentation nor examples on how to do it with Java inside the database.
I am using same approach as the pipeline table function, a mix of PLSQL code plus an internal implementation using Java.
As you can see on the code there is an object type named facets_agg_type which is the implementation of lfacets() aggregated function, under the hood it creates the structure of the return type and call to the Java class which is doing a bit-set-AND operations on Lucene filters.
As an example let me show how to do facets with WikiPedia Spanish dump uploaded to Oracle XMLDB. I can query PAGES table with lcontains() operator as:
select /*+ DOMAIN_INDEX_SORT */ extractValue(object_value,'/page/title')
from pages where
lcontains(object_value,
'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO 20070718]')>0;
But faceting is for doing grouping similar to
so using some categories already defined into WikiPedia Spanish web site.
I can create a simple table with the criteria for categorizing WikiPedia corpus.
Here a table with parent relation ship to organize the categories:
create table wiki_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_WIKI_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES wiki_categories (cat_code)
);
/

insert into wiki_categories values (1,'Ciencias naturales y formales',null);
insert into wiki_categories values (2,'Ciencias humanas y sociales',null);
insert into wiki_categories values (3,'Artes',null);
insert into wiki_categories values (4,'Sociedad',null);
insert into wiki_categories values (5,'Tecnologías',null);

insert into wiki_categories values (6,'Astronomía',1);
insert into wiki_categories values (7,'Biología',1);
....
insert into wiki_categories values (38,'Exploración espacial',5);
insert into wiki_categories values (39,'Biotecnología',5);
insert into wiki_categories values (40,'Informática',5);
insert into wiki_categories values (41,'Ingeniería',5);
insert into wiki_categories values (42,'Medicina',5);
So with these categories I can use lfacets() aggregated function as:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_code,level
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_code,level;
or:
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
getting a result for the second case as:
text:(Ciencias naturales y formales),text:(Astronomía)(1280),text:(Ciencias naturales y formales),text:(Biología)(5144),text:(Ciencias naturales y formales),text:(Matemática)(4729),text:(Ciencias naturales y formales),text:(Física)(8708),text:(Ciencias naturales y formales),text:(Química)(4745),text:(Ciencias naturales y formales),text:(Botánica)(3697)

text:(Ciencias humanas y sociales),text:(Comunicación)(7319),text:(Ciencias humanas y sociales),text:(Filosofía)(6393),text:(Ciencias humanas y sociales),text:(Sociología)(2004),text:(Ciencias humanas y sociales),text:(Religión)(6429),text:(Ciencias humanas y sociales),text:(Psicología)(2508),text:(Ciencias humanas y sociales),text:(Política)(27356),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Lingüística)(2611),text:(Ciencias humanas y sociales),text:(Mitología)(2672),text:(Ciencias humanas y sociales),text:(Historia)(35727),text:(Ciencias humanas y sociales),text:(Egiptología)(91),text:(Ciencias humanas y sociales),text:(Derecho)(18586),text:(Ciencias humanas y sociales),text:(Economía)(9755)

text:(Artes),text:(Anime y manga)(1604),text:(Artes),text:(Arquitectura)(1981),text:(Artes),text:(Literatura)(2154),text:(Artes),text:(Teatro)(2818),text:(Artes),text:(Televisión)(1861),text:(Artes),text:(Pintura)(3323),text:(Artes),text:(Danza)(876),text:(Artes),text:(Música)(3643),text:(Artes),text:(Cine)(2235)
text:(Sociedad),text:(Gastronomía)(443),text:(Sociedad),text:(Videojuegos)(360),text:(Sociedad),text:(LGBT)(267),text:(Sociedad),text:(Deporte)(3647)

text:(Tecnologías),text:(Exploración espacial)(1375),text:(Tecnologías),text:(Biotecnología)(232),text:(Tecnologías),text:(Ingeniería)(1654),text:(Tecnologías),text:(Informática)(2017),text:(Tecnologías),text:(Medicina)(1085)

text:(Ciencias naturales y formales)(62689),text:(Tecnologías)(13772),text:(Ciencias humanas y sociales)(80857),text:(Sociedad)(25372),text:(Artes)(14515)


Note that I am using ljoin function which converts agg_tbl type (TABLE OF) to a coma separated string.
The computation of the whole WikiPedia dump, around 1 million of documents in my notebook took:
INFO: key '96' main category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Astronomía)'
INFO: sub category 'text:(Biología)'
INFO: sub category 'text:(Matemática)'
INFO: sub category 'text:(Física)'
INFO: sub category 'text:(Química)'
INFO: sub category 'text:(Botánica)'
INFO: Terminating category: 'text:(Ciencias naturales y formales)' Elapsed time: 58 millisecond.
INFO: key '97' main category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Comunicación)'
INFO: sub category 'text:(Filosofía)'
INFO: sub category 'text:(Sociología)'
INFO: sub category 'text:(Religión)'
INFO: sub category 'text:(Psicología)'
INFO: sub category 'text:(Política)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Lingüística)'
INFO: sub category 'text:(Mitología)'
INFO: sub category 'text:(Historia)'
INFO: sub category 'text:(Egiptología)'
INFO: sub category 'text:(Derecho)'
INFO: sub category 'text:(Economía)'
INFO: Terminating category: 'text:(Ciencias humanas y sociales)' Elapsed time: 115 millisecond.
INFO: key '98' main category 'text:(Artes)'
INFO: sub category 'text:(Anime y manga)'
INFO: sub category 'text:(Arquitectura)'
INFO: sub category 'text:(Literatura)'
INFO: sub category 'text:(Teatro)'
INFO: sub category 'text:(Televisión)'
INFO: sub category 'text:(Pintura)'
INFO: sub category 'text:(Danza)'
INFO: sub category 'text:(Música)'
INFO: sub category 'text:(Cine)'
INFO: Terminating category: 'text:(Artes)' Elapsed time: 97 millisecond.
INFO: key '99' main category 'text:(Sociedad)'
INFO: sub category 'text:(Gastronomía)'
INFO: sub category 'text:(Videojuegos)'
INFO: sub category 'text:(LGBT)'
INFO: sub category 'text:(Deporte)'
INFO: Terminating category: 'text:(Sociedad)' Elapsed time: 18 millisecond.
INFO: key '100' main category 'text:(Tecnologías)'
INFO: sub category 'text:(Exploración espacial)'
INFO: sub category 'text:(Biotecnología)'
INFO: sub category 'text:(Ingeniería)'
INFO: sub category 'text:(Informática)'
INFO: sub category 'text:(Medicina)'
INFO: Terminating category: 'text:(Tecnologías)' Elapsed time: 18 millisecond.
INFO: key '101' main category '-ALL-'
INFO: sub category 'text:(Ciencias naturales y formales)'
INFO: sub category 'text:(Tecnologías)'
INFO: sub category 'text:(Ciencias humanas y sociales)'
INFO: sub category 'text:(Sociedad)'
INFO: sub category 'text:(Artes)'
INFO: Terminating category: '-ALL-' Elapsed time: 111 millisecond.
Not so bad considering that I am using a notebook and is not a server optimized for databases purpose. But Oracle have another good chance to speed up this query, a materialized view, let do that:
CREATE MATERIALIZED VIEW wiki_facets
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE AS
select ljoin(lfacets('PAGES_LIDX_ALL,'||
case level when 1 then 'text:('|| cat_name||')'
ELSE 'text:('|| PRIOR cat_name||'),text:('|| cat_name||')'
END
)), cat_parent
FROM wiki_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
which obviously produce same result:
SQL> set timing on
SQL> select * from wiki_facets;
... same output here ....
Elapsed: 00:00:00.00
SQL>
but with no efforts!!!
You can refresh your materialized view at any point in time depend on your applications constraints.
Under the hood an aggregated function call to the ODCI API first to get the bit set (array of bits marked to 1, one for each document that match) of the first category, for example text:(Ciencias naturales y formales), then it will iterate over all his sub categories:
text:(Astronomía)(1280),
text:(Biología)(5144)
text:(Matemática)(4729)
text:(Física)(8708)
text:(Química)(4745)
text:(Botánica)(3697)
but on each iteration the main category is already computed. Also for caching purposes the Lucene Filter created to get a bit set with the matching documents is stored in Lucene Domain Index memory structures to be re-used by countHits function or lcontains() operator.
So if you execute a query like:
select LuceneDomainIndex.countHits('PAGES_LIDX_ALL','text:(Astronomía)') from dual;

A Lucene Filter required to get the number of hits is already filled by lfacets() function.
I have updated Lucene Domain Index documentation with latest functionality you can read on-line or download as PDF from SF.net web site, and remember that I'll talking about Java inside the database and some internal details of Oracle-Lucene integration next month at CLOUG 2009, see you there....

Monday, March 23, 2009

APEX proxy user support with DBPrism, aka middle tier authentication

FYI a post which I had started in October 2008, now is ready but it was stored in this link.
Basically it covers a step by step guide to use APEX with proxy user support, aka middle tier authentication, very important if you want to use APEX with VPD database.

Monday, March 9, 2009

New binary release of Lucene Domain Index for 10g-11g

Lucene Dev team just released a new production version (2.4.1) of the Lucene IR library, it was not announce yet, but its available at Apache download area.
So We released a new binary distribution compiled and tested with this new release.
There is only one change which fix an incompatibility issue between 10g and 11g database inside the pipeline table functions used for highlighting, the problem was related to how the internal JDBC driver return a column of SQL Date as object, 11g returns a java.sql.Timestamp and 10g returns a java.sql.Date. This affects the run-time behavior of the pipeline table functions which on the fly translates the Java Result Set to PLSQL execution layer.
I have made some test with Lucene 2.9 code base, available through CVS, and Lucene Domain Index works fine a fastest than previous releases, as usual with any Lucene code release.
This compilation can be done following the instruction of Lucene Domain Index documentation, section 2.3 - Install instruction for compiling from sources.
Finally remember that Lucene Domain Index is an updated solution for Oracle applications written in any language, Phyton, Ruby, .Net, PHP, PLSQL,... ;)