Tuesday, December 18, 2007

Uploading WikiPedia Dumps to Oracle databases

WikiPedia Dumps are a good set of documents if you want to test, for example, searching capabilities.
WikiMedia Download Web Site have XML dumps for all WikiPedia Web sites especially they have a very big compressed dumps of all the pages in English which his size is around 3Gbytes.
So with this very big XML file, We would like to load to a regular Oracle tables.
There is a plenty of utilities for uploading WikiPedia Dump to relational databases, but none of them are tailored for Oracle.
So, We can solve this missing part Creating XMLType Tables and Columns Based on XMLSchema and a SAX upload utility to upload the XML content.
Note: When you register an XML schema with Oracle XML DB, you can specify the storage model to use for the XML instance documents that reference the XML schema. If that specified storage model is binary XML, then you can only use that XML schema to create XMLType tables and columns that are stored as binary XML. Conversely, if the storage model specified for the registered XML schema is not binary XML, then you can only use that XML schema to create XMLType tables and columns that are not stored as binary XML, that is, stored object-relationally or as CLOB instances.

First, We need to annotate WikiPedia XSD export file to tell Oracle how to transform a page tag into an Object Type stored in a Object-Relational table, you can see a complete annotated XSD file here. It basically creates a new top element named "page" to give SAXLoader application a table to store all meta-data pages entries. It also creates two XMLDB self managed master detail relationship from pages to revisions and pages to uploads one to many relations.

You can register and creates all tables with:

BEGIN
DBMS_XMLSCHEMA.registerURI(
'http://www.mediawiki.org/xml/export-0.3/',
'/public/export-0.3.xsd',
LOCAL=>TRUE,GENTYPES=>TRUE, GENBEAN=>FALSE, GENTABLES=>TRUE);
END;
/

Before execute this PLSQL block upload export-0.3.xsd to XMLDB repository either by using ftp or WebDAV protocols to /public directory.
After executing registerURI procedure 3 new tables will be at your Oracle schema:
SQL> desc pages
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "page") STORAGE Object-relational TYPE "PageType"

SQL> desc pages_revisions
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "revision") STORAGE Object-relational TYPE "RevisionType"

SQL> desc pages_uploads
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "upload") STORAGE Object-relational TYPE "UploadType"

With his underlying types:
SQL> desc "PageType"
"PageType" is NOT FINAL
Name
Null?
Type
SYS_XDBPD$
XDB.XDB$RAW_LIST_T
title

VARCHAR2(4000 CHAR)
id

NUMBER(38)
restrictions

VARCHAR2(4000 CHAR)
revision

XDB.XDB$XMLTYPE_REF_LIST_T
upload

XDB.XDB$XMLTYPE_REF_LIST_T

SQL> desc "RevisionType"
"RevisionType" is NOT FINAL

Name
Null?
Type
SYS_XDBPD$
XDB.XDB$RAW_LIST_T
id

NUMBER(38)
timestamp
VARCHAR2(4000 CHAR)
contributor
ContributorType
comment
VARCHAR2(4000 CHAR)
text

CLOB

SQL> desc "UploadType"
"UploadType" is NOT FINAL

Name
Null?
Type
SYS_XDBPD$
XDB.XDB$RAW_LIST_T
timestamp
VARCHAR2(4000 CHAR)
contributor
ContributorType
comment
VARCHAR2(4000 CHAR)
filename
VARCHAR2(4000 CHAR)
src

VARCHAR2(4000 CHAR)
size

NUMBER(38)

By using 3 tables We can choose to index some columns with any supported index provided by Oracle.
Also annotated schema tells to Oracle that use a CLOB column for storing user edited text (column text of table pages_revisions).
Shearing WikiDumps XML files with an XMLDB annotated schema provides an object-relational storage for an unstructured data.
Finally for uploading any WikiPedia Dump XML file using OTN example SAXLoader We use a modified configuration file like:
<?xml version='1.0' encoding='windows-1252'?>
<Connection>
<Driver>OCI</Driver>
<Hostname>localhost</Hostname>
<Port>1521</Port>
<ServiceName>orcl</ServiceName>
<SID>orcl</SID>
<ServerMode>DEDICATED</ServerMode>
<Schema>eswiki</Schema>
<Password>eswiki</Password>
<SourceXML>/home/backup/eswiki-20071202-pages-articles.xml</SourceXML>
<Element>page</Element>
<SchemaURL>http://www.mediawiki.org/xml/export-0.3/</SchemaURL>
<Table>PAGES</Table>
<ThreadCount>2</ThreadCount>
<CommitCharge>25</CommitCharge>
<Logger>FILE</Logger>
</Connection>

Using a modest hardware and uploading Spanish WikiPedia Dump We can see upload output as:

[mochoa@p1 build]$ more TraceOutputFile-20074317184301.log
Connection Settings =
<?xml version = '1.0' encoding = 'windows-1252'?>
<Connection>
<Driver>OCI</Driver>
<Hostname>localhost</Hostname>
<Port>1521</Port>
<ServiceName>orcl</ServiceName>
<SID>orcl</SID>
<ServerMode>DEDICATED</ServerMode>
<Schema>eswiki</Schema>
<Password>eswiki</Password>
<SourceXML>/home/backup/eswiki-20071202-pages-articles.xml</SourceXML>
<Element>page</Element>
<SchemaURL>http://www.mediawiki.org/xml/export-0.3/</SchemaURL>
<Table>PAGES</Table>
<ThreadCount>2</ThreadCount>
<CommitCharge>25</CommitCharge>
<Logger>FILE</Logger>
<StartCount>0</StartCount>
<EndCount>10</EndCount>
</Connection>

Driver = OCI
Hostname = localhost
Port = 1521
ServiceName = orcl
Server = DEDICATED
Schema = eswiki
Password = eswiki
Logger = FILE
Database URL = jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521))(connect_data=(service_name=orcl)(server=DEDI
CATED)))
BaseApplication.getConnection(): Connecting as eswiki/eswiki@jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521)
)(connect_data=(service_name=orcl)(server=DEDICATED)))
BaseApplication.getConnection(): Database Connection Established
BaseApplication.getConnection(): Connecting as eswiki/eswiki@jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521)
)(connect_data=(service_name=orcl)(server=DEDICATED)))
Thread Writer_000001 started at Dec 17, 2007 6:43:01 PM ART
Thread Writer_000001 using SQL : insert into PAGES values (xmltype(?).createSchemaBasedXML('http://www.mediawiki.org/xml/export-0.3/'))
BaseApplication.getConnection(): Database Connection Established
Thread Writer_000002 started at Dec 17, 2007 6:43:02 PM ART
Thread Writer_000002 using SQL : insert into PAGES values (xmltype(?).createSchemaBasedXML('http://www.mediawiki.org/xml/export-0.3/'))
Thread Writer_000002 committed after 25 records.

..............
Thread Writer_000002 committed after 25 records.
Thread Writer_000001 completed at Dec 17, 2007 6:43:01 PM ART
Thread Writer_000002 completed at Dec 17, 2007 6:43:02 PM ART
Documents Queued = 583141. Docuuments De-Queued = 583141. Documents Written = 583141.
Writer_000001,Dec 17, 2007 6:43:01 PM ART,Dec 17, 2007 8:12:12 PM ART,293139,805,110,181
Writer_000002,Dec 17, 2007 6:43:02 PM ART,Dec 17, 2007 8:12:12 PM ART,290002,806,416,309

It took around of one hour and a half for uploading 583141 pages and 1.61Gbytes.

Note: this importing example works only on 11g databases because many WikiPedia pages have more than 64K of XML size and this is a limitation on 10g databases.


Adding more SQL like functionality to WikiDump XML information.

Adding a primary key constraint using XPath /page/id information:

alter table pages add CONSTRAINT pages_pkey PRIMARY KEY (XMLDATA."id");

This constraint will be helpful for queries like:

SELECT extract(object_value,'/page/revision/text/text()') "text"
FROM ESWIKI.PAGES where extractValue(object_value,'/page/id')=10

which produces an execution plan like:

Optimizer Cost Cardinality Bytes Partition Start Partition Stop Partition Id ACCESS PREDICATES FILTER PREDICATES
SELECT STATEMENT ALL_ROWS 1 1 43




SORT(AGGREGATE)

1 2069




NESTED LOOPS








NESTED LOOPS
3 1 2069




TABLE ACCESS(BY INDEX ROWID) ESWIKI.SYS_NTQY2zg8+I6PngQAB/AQAq6g==
2 1 37




INDEX(RANGE SCAN) ESWIKI.SYS_C007197
2 1



"NESTED_TABLE_ID"=:B1
INDEX(UNIQUE SCAN) ESWIKI.SYS_C007203
0 1



"SYS_ALIAS_0"."COLUMN_VALUE"="PAGES_REVISIONS"."SYS_NC_OID$"
TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES_REVISIONS
1 1 2032



SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1
TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES
1 1 43



SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1
INDEX(UNIQUE SCAN) ESWIKI.PAGES_PKEY ANALYZED 0 1



"PAGES"."SYS_NC00010$"=10


and a unique constraint to prevent duplicate revision pages and faster access to an specific revision:

alter table pages_revisions add CONSTRAINT pages_revisions_ukey UNIQUE (XMLDATA."id");

pages_revisions table already have a primary key generated by the master detail relation discovered by the XSD to object-relational transformation, but with this unique constraint you can query directly pages_revisions table if you want to look-up an specific revision content:

SELECT extract(object_value,'/revision/text/text()') "text"
FROM ESWIKI.PAGES_REVISIONS where extractValue(object_value,'/revision/id')=13207462

which produces this execution plan:


Optimizer Cost Cardinality Bytes Partition Start Partition Stop Partition Id ACCESS PREDICATES FILTER PREDICATES
SELECT STATEMENT ALL_ROWS 1 1 2035




TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES_REVISIONS
1 1 2035



SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1
INDEX(UNIQUE SCAN) ESWIKI.PAGES_REVISIONS_UKEY ANALYZED 0 1



"PAGES_REVISIONS"."SYS_NC00009$"=13207462


Finally We can index with Lucene Domain Index using:

create index pages_lidx on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('LogLevel:INFO;Stemmer:Spanish;ExtraCols:extractValue(object_value,''/page/title'') "title",extract(object_value,''/page/revision/text/text()'') "text";IncludeMasterColumn:false;SyncMode:OnLine;BatchCount:1000;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

Here a detailed parameter usage at Lucene Domain Index:

  • LogLevel:INFO, causes that Lucene Domain Index sent logging information to Oracle .trc files
  • Stemmer:Spanish, as you can see at SAXLoader configuration file We had uploaded Spanish WikiPedia Dump.
  • ExtraCols:extractValue(object_value,''/page/title'') "title",extract(object_value,''/page/revision/text/text()'') "text", We would like to index only title and text of XML page node.
  • IncludeMasterColumn:false, We don't want to index all XML content of PAGES's object type row, but We would like to be notified by ODCI API when this row is changed.
  • SyncMode:OnLine, Lucene Domain index will performs a full scan on PAGES table en-queueing for insert all his rowids and returns a control to the application, then DBMS_AQ CallBack will index PAGES table in background.
  • BatchCount:1000, to reduce Oracle databases resources consumed by DBMS_AQ Callback during initial indexing process batch process takes 1000 rows index them and exits.
  • LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING, Oracle 11g specific parameter for using Secure LOB.
Immediately after create index command returns, you can start querying Lucene Domain Index because read-only access is permitted during background build process. For example:

SQL> select LuceneDomainIndex.countHits('ESWIKI.PAGES_LIDX','title:Argentina') from dual;

LUCENEDOMAININDEX.COUNTHITS('ESWIKI.PAGES_LIDX','TITLE:ARGENTINA')
------------------------------------------------------------------
117

After each batch of 1000 rows is done this value will be increased. Here other example:

select lscore(1),extractValue(object_value,'/page/title') from pages
where lcontains(object_value,'rownum:[1 TO 5] AND title:Argentina AND text:indigenas',1)>0

1 Argentina
0.641046464443206787109375 Noroeste argentino
0.625011622905731201171875 Paraná (Argentina)
0.623125135898590087890625 Córdoba (Argentina)
0.601441204547882080078125 Categoría:Indígenas de Argentina

With his execution plan:

Optimizer Cost Cardinality Bytes Partition Start Partition Stop Partition Id ACCESS PREDICATES FILTER PREDICATES
SELECT STATEMENT ALL_ROWS 3658 28633 281891885




TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES
3658 28633 281891885



SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1
DOMAIN INDEX ESWIKI.PAGES_LIDX






"LUCENE"."LCONTAINS"(SYS_MAKEXML('418347400648C734E040A8C0030115A3',4889,"XMLEXTRA","XMLDATA"),'rownum:[1 TO 5] AND title:Argentina AND text:indigenas',1)>0



Note: We are querying for pages which contains "indigenas" at text node, but this node is stored at PAGES_REVISIONS table but the query can be resolved without accessing to details table, all the information is stored at Lucene inverted structure and lcontains operator returns the correct set of rowids to Oracle engine to performs a fetch operation by index rowid which is a faster execution look-up plan. Also, We are using pagination information injected at lcontains operators which eliminates usage of Top-N Oracle syntax.




7 comments:

marco.gralike said...

Marcelo, Great Stuff!

Now that the procedure stands - are you going to evaluate Oracle Binary XML storage contra OR storage as well (same goes for Lucine contra XMLIndex)... ?

I would be very interested in the results...

As said,

Thanks!

Marcelo Ochoa said...

Hi Marco:
Sure; I am loading WikiDumps to get a good set of free text documents to see performance and comparing then to Text for example.
Now I am indexing docs... But as I set at the post, I have a modest hardware so It will take a long time.
More on this, my SGA is not fine tuned to 2Gb RAM machine :( I have to shutdown and startup Oracle to prevent that SGA being swapin/swapout by the OS.
Luckily Lucene Domain Index have scheduled all docs for indexing, I can shutdown the database at any time and the process will start automatically after startup and continue at last checkpoint.
XMLDB storage used is almost binary, except for text node which is CLOB, but I'll not use XPath extraction at free edited text ;)
I'll do some comparison with Oracle Text but some functionality that Lucene provides such as range query or Lucene Domain Index functional columns are not available on latest Text implementation.
Sure Text will index all docs faster than Lucene Domain Index, but I think that for complex queries Lucene will do the job better.
Best regards, Marcelo.

barb michelen said...
This comment has been removed by a blog administrator.
Мария said...

Take a look at WikiXMLDB

http://wikixmldb.dyndns.org/

It provides a way to query Wikipedia in XQuery.

Wikipedia dump was parsed into XML and loaded into Sedna XML database. Now you have the flexibility and power of XQuery applied to rich Wikipedia content!

Мария said...

Take a look at WikiXMLDB

http://wikixmldb.dyndns.org/

It provides a way to query Wikipedia in XQuery.

Wikipedia dump was parsed into XML and loaded into Sedna XML database. Now you have the flexibility and power of XQuery applied to rich Wikipedia content!

Marco Gralike said...

For people who are still in need for the original SAX Loader sources, they now can also be downloaded here:

http://www.liberidu.com/blog/?p=470

Marcelo Ochoa said...

Hi Marco:
Thanks for adding a download place to get a SAXLoader utility.
I know many people which use them.
Actually, going up a manually a few of step in OTN path there is a link to download the code from OTN too:
http://www.oracle.com/technology/sample_code/tech/xml/xmldb/index.html
Best regards, Marcelo.