So, I decided to use pipeline table functions implemented in Java, another exciting functionality provided by ODCI API, but the problem is that the example and the documentation do not provides any hint on how to deal with transients types (AnyData, AnyDataSet and AnyType), AnyDataSet and AnyType allows you to create a function which returns any kind of queries, a required condition for a general purpose highlighting function.
Digging the web, I found only one valuable example of pipeline table function working with AnyType and AnyDataSet in PLSQL, the article from oracle-developer.net by Adrian Billington was a good jump-start for the solution.
According to Oracle 11g JDBC documentation there is support for AnyData, AnyDataSet and AnyType but are not allowed in Call Spec definition :( also this is only valid for 11g and I want to still supporting Lucene Domain Index for 10g databases.
Finally the solution was to use a mix between PLSQL and Java. Developers who are interested to see how can do that could take look at the code on line, CreatePipelineFunction.sql and Highlighter.java, the trick was to return from Java execution environment one row at time in object type having a place holder for each base SQL type, this type is rt_fetch_attributes and a collection RowInfo. If anyone have a better option please drop me an email or put a comment here!!.
OK, after this in depth introduction lets see some example of using highlighting features by using pipeline table functions, there are two options:
- Phighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN VARCHAR2) RETURN ANYDATASET
- Rhighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET
SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select lscore(1) sc,e.* from eMails e
where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'
));
- EMAILBODYTEXT is a Lucene Domain Index defined for table eMails as:
create table eMails (
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB)
create index emailbodyText on eMails(bodyText)
indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;
ExtraCols:emailDate "emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');
- 'lucene OR mysql' is Lucene Query used for highlighting, it should be equals to second argument of lcontains() to be consistent, except for the Lucene Domain Index extension rownum:[n TO m] AND used for pagination purpose.
- 'SUBJECT,BODYTEXT' are a coma separated list of columns to be highlighted, any other columns of the query result will be passed as is. Note that I am using uppercase because I am not defining alias at the SQL query, also only columns returned by JDBC as type String or CLOB will be highlighted.
Last argument of phighlight() is any SQL query to be parsed by DBMS_SQL package, this query must return columns of JDBC types String, BigDecimal, Timestamp, CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ. Remember to use double single quotes ('') for escaping the SQL quote (') inside this argument.
SUBJECT column have results highlighted as "Re: <B>lucene</B> injection", ..., "<B>lucene</B> injection".
BODYTEXT column have values as "On Dec 21, 2006, at 4:56 AM, Deepan wrote:> I am bothered about security problems with <B>lucene</B>. Is it vulnerable to> any kind of injection like <B>mysql</B> injection? many times the query from> user is passed to <B>lucene</B> for search without validating.Rest easy. There are no known security issues with <B>Lucene</B>, and ithas even undergone a recent static code analysis by Fortify (see thelucene-dev e-mail list",..,"I am bothered about security problems with <B>lucene</B>. Is it vulnerable to any kind of injection like <B>mysql</B> injection? many times the query fromuser is passed to <B>lucene</B> for search without validating.".
A variant of phighlight() is rhighlight() which, in my opinion, is better than the above because eliminate the risk of SQL Injection by passing an insecure SQL string, here an example:
SELECT * FROM
TABLE(rhighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'EMAILRSET',
CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)
));
- 'EMAILRSET' is collection of EMAILR type an object type defined to hold the result of the query passed as CURSOR(..). A SYS_REFCURSOR could be passed as argument from JDBC as .setObject(n,java.sql.ResultSet) method. This collection look like:
CREATE TYPE EMAILR AS OBJECT
(
sc NUMBER,
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB
);
CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;
- last argument uses CURSOR function to pass an SQL query as SYS_REFCURSOR.
As you can see pipeline table functions are a powerful solution to do special treatment to queries. I mean some time you can not express all functionality in pure SQL or a functionality like highlighting can not be implemented in PLSQL, in that cases implementing pipeline table functions in Java using Oracle OJVM is the only solution.
At last, Lucene is close to get a next production release 2.4.1, my next post will be when this release is public and I'll upload a new version of Lucene Domain Index compiled a tested with it.
Also Lucene Domain Index and many other tips using Java inside the Database will be explained in my planned presentation at CLOUG 2009, see you there...