WikiMedia Download Web Site have XML dumps for all WikiPedia Web sites especially they have a very big compressed dumps of all the pages in English which his size is around 3Gbytes.
So with this very big XML file, We would like to load to a regular Oracle tables.
There is a plenty of utilities for uploading WikiPedia Dump to relational databases, but none of them are tailored for Oracle.
So, We can solve this missing part Creating XMLType Tables and Columns Based on XMLSchema and a SAX upload utility to upload the XML content.
Note: When you register an XML schema with Oracle XML DB, you can specify the storage model to use for the XML instance documents that reference the XML schema. If that specified storage model is binary XML, then you can only use that XML schema to create XMLType tables and columns that are stored as binary XML. Conversely, if the storage model specified for the registered XML schema is not binary XML, then you can only use that XML schema to create XMLType tables and columns that are not stored as binary XML, that is, stored object-relationally or as CLOB instances.
First, We need to annotate WikiPedia XSD export file to tell Oracle how to transform a page tag into an Object Type stored in a Object-Relational table, you can see a complete annotated XSD file here. It basically creates a new top element named "page" to give SAXLoader application a table to store all meta-data pages entries. It also creates two XMLDB self managed master detail relationship from pages to revisions and pages to uploads one to many relations.
You can register and creates all tables with:
BEGIN
DBMS_XMLSCHEMA.registerURI(
'http://www.mediawiki.org/xml/export-0.3/',
'/public/export-0.3.xsd',
LOCAL=>TRUE,GENTYPES=>TRUE, GENBEAN=>FALSE, GENTABLES=>TRUE);
END;
/
After executing registerURI procedure 3 new tables will be at your Oracle schema:
SQL> desc pages
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "page") STORAGE Object-relational TYPE "PageType"
SQL> desc pages_revisions
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "revision") STORAGE Object-relational TYPE "RevisionType"
SQL> desc pages_uploads
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE of SYS.XMLTYPE(XMLSchema "http://www.mediawiki.org/xml/export-0.3/" Element "upload") STORAGE Object-relational TYPE "UploadType"
SQL> desc "PageType"
"PageType" is NOT FINAL
Name Null? Type SYS_XDBPD$ XDB.XDB$RAW_LIST_T title VARCHAR2(4000 CHAR) id NUMBER(38) restrictions VARCHAR2(4000 CHAR) revision XDB.XDB$XMLTYPE_REF_LIST_T upload XDB.XDB$XMLTYPE_REF_LIST_T
SQL> desc "RevisionType"
"RevisionType" is NOT FINAL
Name Null? Type SYS_XDBPD$ XDB.XDB$RAW_LIST_T id NUMBER(38) timestamp VARCHAR2(4000 CHAR) contributor ContributorType comment VARCHAR2(4000 CHAR) text CLOB
SQL> desc "UploadType"
"UploadType" is NOT FINAL
Name Null? Type SYS_XDBPD$ XDB.XDB$RAW_LIST_T timestamp VARCHAR2(4000 CHAR) contributor ContributorType comment VARCHAR2(4000 CHAR) filename VARCHAR2(4000 CHAR) src VARCHAR2(4000 CHAR) size NUMBER(38)
Also annotated schema tells to Oracle that use a CLOB column for storing user edited text (column text of table pages_revisions).
Shearing WikiDumps XML files with an XMLDB annotated schema provides an object-relational storage for an unstructured data.
Finally for uploading any WikiPedia Dump XML file using OTN example SAXLoader We use a modified configuration file like:
<?xml version='1.0' encoding='windows-1252'?>
<Connection>
<Driver>OCI</Driver>
<Hostname>localhost</Hostname>
<Port>1521</Port>
<ServiceName>orcl</ServiceName>
<SID>orcl</SID>
<ServerMode>DEDICATED</ServerMode>
<Schema>eswiki</Schema>
<Password>eswiki</Password>
<SourceXML>/home/backup/eswiki-20071202-pages-articles.xml</SourceXML>
<Element>page</Element>
<SchemaURL>http://www.mediawiki.org/xml/export-0.3/</SchemaURL>
<Table>PAGES</Table>
<ThreadCount>2</ThreadCount>
<CommitCharge>25</CommitCharge>
<Logger>FILE</Logger>
</Connection>
Using a modest hardware and uploading Spanish WikiPedia Dump We can see upload output as:
[mochoa@p1 build]$ more TraceOutputFile-20074317184301.log
Connection Settings =
<?xml version = '1.0' encoding = 'windows-1252'?>
<Connection>
<Driver>OCI</Driver>
<Hostname>localhost</Hostname>
<Port>1521</Port>
<ServiceName>orcl</ServiceName>
<SID>orcl</SID>
<ServerMode>DEDICATED</ServerMode>
<Schema>eswiki</Schema>
<Password>eswiki</Password>
<SourceXML>/home/backup/eswiki-20071202-pages-articles.xml</SourceXML>
<Element>page</Element>
<SchemaURL>http://www.mediawiki.org/xml/export-0.3/</SchemaURL>
<Table>PAGES</Table>
<ThreadCount>2</ThreadCount>
<CommitCharge>25</CommitCharge>
<Logger>FILE</Logger>
<StartCount>0</StartCount>
<EndCount>10</EndCount>
</Connection>
Driver = OCI
Hostname = localhost
Port = 1521
ServiceName = orcl
Server = DEDICATED
Schema = eswiki
Password = eswiki
Logger = FILE
Database URL = jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521))(connect_data=(service_name=orcl)(server=DEDI
CATED)))
BaseApplication.getConnection(): Connecting as eswiki/eswiki@jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521)
)(connect_data=(service_name=orcl)(server=DEDICATED)))
BaseApplication.getConnection(): Database Connection Established
BaseApplication.getConnection(): Connecting as eswiki/eswiki@jdbc:oracle:oci8:@(description=(address=(host=localhost)(protocol=tcp)(port=1521)
)(connect_data=(service_name=orcl)(server=DEDICATED)))
Thread Writer_000001 started at Dec 17, 2007 6:43:01 PM ART
Thread Writer_000001 using SQL : insert into PAGES values (xmltype(?).createSchemaBasedXML('http://www.mediawiki.org/xml/export-0.3/'))
BaseApplication.getConnection(): Database Connection Established
Thread Writer_000002 started at Dec 17, 2007 6:43:02 PM ART
Thread Writer_000002 using SQL : insert into PAGES values (xmltype(?).createSchemaBasedXML('http://www.mediawiki.org/xml/export-0.3/'))
Thread Writer_000002 committed after 25 records.
..............
Thread Writer_000002 committed after 25 records.
Thread Writer_000001 completed at Dec 17, 2007 6:43:01 PM ART
Thread Writer_000002 completed at Dec 17, 2007 6:43:02 PM ART
Documents Queued = 583141. Docuuments De-Queued = 583141. Documents Written = 583141.
Writer_000001,Dec 17, 2007 6:43:01 PM ART,Dec 17, 2007 8:12:12 PM ART,293139,805,110,181
Writer_000002,Dec 17, 2007 6:43:02 PM ART,Dec 17, 2007 8:12:12 PM ART,290002,806,416,309
It took around of one hour and a half for uploading 583141 pages and 1.61Gbytes.
Note: this importing example works only on 11g databases because many WikiPedia pages have more than 64K of XML size and this is a limitation on 10g databases.
Adding more SQL like functionality to WikiDump XML information.
Adding a primary key constraint using XPath /page/id information:
alter table pages add CONSTRAINT pages_pkey PRIMARY KEY (XMLDATA."id");
This constraint will be helpful for queries like:
SELECT extract(object_value,'/page/revision/text/text()') "text"
FROM ESWIKI.PAGES where extractValue(object_value,'/page/id')=10
which produces an execution plan like:
| Optimizer | Cost | Cardinality | Bytes | Partition Start | Partition Stop | Partition Id | ACCESS PREDICATES | FILTER PREDICATES | |
|---|---|---|---|---|---|---|---|---|---|
| SELECT STATEMENT | ALL_ROWS | 1 | 1 | 43 | |||||
| SORT(AGGREGATE) | 1 | 2069 | |||||||
| NESTED LOOPS | |||||||||
| NESTED LOOPS | 3 | 1 | 2069 | ||||||
| TABLE ACCESS(BY INDEX ROWID) ESWIKI.SYS_NTQY2zg8+I6PngQAB/AQAq6g== | 2 | 1 | 37 | ||||||
| INDEX(RANGE SCAN) ESWIKI.SYS_C007197 | 2 | 1 | "NESTED_TABLE_ID"=:B1 | ||||||
| INDEX(UNIQUE SCAN) ESWIKI.SYS_C007203 | 0 | 1 | "SYS_ALIAS_0"."COLUMN_VALUE"="PAGES_REVISIONS"."SYS_NC_OID$" | ||||||
| TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES_REVISIONS | 1 | 1 | 2032 | SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1 | |||||
| TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES | 1 | 1 | 43 | SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1 | |||||
| INDEX(UNIQUE SCAN) ESWIKI.PAGES_PKEY | ANALYZED | 0 | 1 | "PAGES"."SYS_NC00010$"=10 |
and a unique constraint to prevent duplicate revision pages and faster access to an specific revision:
alter table pages_revisions add CONSTRAINT pages_revisions_ukey UNIQUE (XMLDATA."id");
pages_revisions table already have a primary key generated by the master detail relation discovered by the XSD to object-relational transformation, but with this unique constraint you can query directly pages_revisions table if you want to look-up an specific revision content:
SELECT extract(object_value,'/revision/text/text()') "text"
FROM ESWIKI.PAGES_REVISIONS where extractValue(object_value,'/revision/id')=13207462
which produces this execution plan:
| Optimizer | Cost | Cardinality | Bytes | Partition Start | Partition Stop | Partition Id | ACCESS PREDICATES | FILTER PREDICATES | |
|---|---|---|---|---|---|---|---|---|---|
| SELECT STATEMENT | ALL_ROWS | 1 | 1 | 2035 | |||||
| TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES_REVISIONS | 1 | 1 | 2035 | SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1 | |||||
| INDEX(UNIQUE SCAN) ESWIKI.PAGES_REVISIONS_UKEY | ANALYZED | 0 | 1 | "PAGES_REVISIONS"."SYS_NC00009$"=13207462 |
Finally We can index with Lucene Domain Index using:
create index pages_lidx on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('LogLevel:INFO;Stemmer:Spanish;ExtraCols:extractValue(object_value,''/page/title'') "title",extract(object_value,''/page/revision/text/text()'') "text";IncludeMasterColumn:false;SyncMode:OnLine;BatchCount:1000;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
Here a detailed parameter usage at Lucene Domain Index:
- LogLevel:INFO, causes that Lucene Domain Index sent logging information to Oracle .trc files
- Stemmer:Spanish, as you can see at SAXLoader configuration file We had uploaded Spanish WikiPedia Dump.
- ExtraCols:extractValue(object_value,''/page/title'') "title",extract(object_value,''/page/revision/text/text()'') "text", We would like to index only title and text of XML page node.
- IncludeMasterColumn:false, We don't want to index all XML content of PAGES's object type row, but We would like to be notified by ODCI API when this row is changed.
- SyncMode:OnLine, Lucene Domain index will performs a full scan on PAGES table en-queueing for insert all his rowids and returns a control to the application, then DBMS_AQ CallBack will index PAGES table in background.
- BatchCount:1000, to reduce Oracle databases resources consumed by DBMS_AQ Callback during initial indexing process batch process takes 1000 rows index them and exits.
- LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING, Oracle 11g specific parameter for using Secure LOB.
SQL> select LuceneDomainIndex.countHits('ESWIKI.PAGES_LIDX','title:Argentina') from dual;
LUCENEDOMAININDEX.COUNTHITS('ESWIKI.PAGES_LIDX','TITLE:ARGENTINA')
------------------------------------------------------------------
117
After each batch of 1000 rows is done this value will be increased. Here other example:
select lscore(1),extractValue(object_value,'/page/title') from pages
where lcontains(object_value,'rownum:[1 TO 5] AND title:Argentina AND text:indigenas',1)>01 Argentina
0.641046464443206787109375 Noroeste argentino
0.625011622905731201171875 Paraná (Argentina)
0.623125135898590087890625 Córdoba (Argentina)
0.601441204547882080078125 Categoría:Indígenas de Argentina
With his execution plan:
| Optimizer | Cost | Cardinality | Bytes | Partition Start | Partition Stop | Partition Id | ACCESS PREDICATES | FILTER PREDICATES | |
|---|---|---|---|---|---|---|---|---|---|
| SELECT STATEMENT | ALL_ROWS | 3658 | 28633 | 281891885 | |||||
| TABLE ACCESS(BY INDEX ROWID) ESWIKI.PAGES | 3658 | 28633 | 281891885 | SYS_CHECKACL("ACLOID","OWNERID",xmltype(''))=1 | |||||
| DOMAIN INDEX ESWIKI.PAGES_LIDX | "LUCENE"."LCONTAINS"(SYS_MAKEXML('418347400648C734E040A8C0030115A3',4889,"XMLEXTRA","XMLDATA"),'rownum:[1 TO 5] AND title:Argentina AND text:indigenas',1)>0 |
Note: We are querying for pages which contains "indigenas" at text node, but this node is stored at PAGES_REVISIONS table but the query can be resolved without accessing to details table, all the information is stored at Lucene inverted structure and lcontains operator returns the correct set of rowids to Oracle engine to performs a fetch operation by index rowid which is a faster execution look-up plan. Also, We are using pagination information injected at lcontains operators which eliminates usage of Top-N Oracle syntax.
7 comments:
Marcelo, Great Stuff!
Now that the procedure stands - are you going to evaluate Oracle Binary XML storage contra OR storage as well (same goes for Lucine contra XMLIndex)... ?
I would be very interested in the results...
As said,
Thanks!
Hi Marco:
Sure; I am loading WikiDumps to get a good set of free text documents to see performance and comparing then to Text for example.
Now I am indexing docs... But as I set at the post, I have a modest hardware so It will take a long time.
More on this, my SGA is not fine tuned to 2Gb RAM machine :( I have to shutdown and startup Oracle to prevent that SGA being swapin/swapout by the OS.
Luckily Lucene Domain Index have scheduled all docs for indexing, I can shutdown the database at any time and the process will start automatically after startup and continue at last checkpoint.
XMLDB storage used is almost binary, except for text node which is CLOB, but I'll not use XPath extraction at free edited text ;)
I'll do some comparison with Oracle Text but some functionality that Lucene provides such as range query or Lucene Domain Index functional columns are not available on latest Text implementation.
Sure Text will index all docs faster than Lucene Domain Index, but I think that for complex queries Lucene will do the job better.
Best regards, Marcelo.
Take a look at WikiXMLDB
http://wikixmldb.dyndns.org/
It provides a way to query Wikipedia in XQuery.
Wikipedia dump was parsed into XML and loaded into Sedna XML database. Now you have the flexibility and power of XQuery applied to rich Wikipedia content!
Take a look at WikiXMLDB
http://wikixmldb.dyndns.org/
It provides a way to query Wikipedia in XQuery.
Wikipedia dump was parsed into XML and loaded into Sedna XML database. Now you have the flexibility and power of XQuery applied to rich Wikipedia content!
For people who are still in need for the original SAX Loader sources, they now can also be downloaded here:
http://www.liberidu.com/blog/?p=470
Hi Marco:
Thanks for adding a download place to get a SAXLoader utility.
I know many people which use them.
Actually, going up a manually a few of step in OTN path there is a link to download the code from OTN too:
http://www.oracle.com/technology/sample_code/tech/xml/xmldb/index.html
Best regards, Marcelo.
Post a Comment