14 November, 2021

Indexing Documents in Oracle

 In the previous post, I had demonstrated how to load documents into an Oracle Database.  

(Note : I am referring to documents as in PDF files, PowerPoint presentations, Word documents and text files in the traditional sense, not "JSON documents").

Here I demonstrate how to use Oracle Text to create indexes and make such documents searchable.

I assume that Oracle Text (the "CTXSYS" schema) has been installed when the database was created.

First, I grant the appropriate privileges.



SQL> alter session set container=pdb1;

Session altered.

SQL> GRANT EXECUTE ON CTXSYS.CTX_CLS TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_DDL TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_DOC TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_OUTPUT TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_QUERY TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_REPORT TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_THES TO MYDOCDB;

Grant succeeded.

SQL> GRANT EXECUTE ON CTXSYS.CTX_ULEXER TO MYDOCDB;

Grant succeeded.

SQL>


Next, I create a CONTEXT Index on my documents table.

SQL> connect mydocdb/mydocdb@pdb1
Connected.
SQL> CREATE INDEX my_documents_ctx_idx ON my_documents(file_content)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  /

Index created.

SQL>


My documents are now indexed and searchable. The syntax for a context index query is slightly different.

SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'DBA Role') > 0
  4  /

DOC_TITLE                                  DOC_DESCRIPTION                            FILE_TYPE
------------------------------------------ ------------------------------------------ ----------------
Oracle 12c Security                        Security Mechanisms in Oracle 12c          PDF
Monitoring and Diagnostics without OEM     Presentation at Oracle SOS Conference      PDF

SQL>
SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'sqlplus hemant/hemant@orclpdb1') > 0
  4  /

DOC_TITLE                                  DOC_DESCRIPTION                            FILE_TYPE
------------------------------------------ ------------------------------------------ ----------------
Flashback Database and DG DR Testing       How to use Flasback Database in DataGuard  DOCX
                                           for DR Testing


SQL>
SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'Flex ASM') > 0
  4  /

DOC_TITLE                                  DOC_DESCRIPTION                            FILE_TYPE
------------------------------------------ ------------------------------------------ ----------------
Oracle 12c Availability                    High Availability Options in Oracle 12c    PDF

SQL>
SQL>
SQL> -- unfortunately, some queries on my plain-text SQL file don't work as expected
SQL>
SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'doc_number number generated always as identity,') > 0
  4  /
select doc_title, doc_description, file_type
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 48


SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'doc_number number generated always as identity') > 0
  4  /

no rows selected

SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'doc_number number generated always') > 0
  4  /

no rows selected

SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, '(doc_number number generated always') > 0
  4  /
select doc_title, doc_description, file_type
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 36


SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'drop table my_documents') > 0
  4  /

no rows selected

SQL>


Also, apparently the inclusion of "(" and "," and such are tricky when creatimg a query.

Note : The "CONTAINS ... >0" clause is to check the SCORE returned by the filter in the CONTAINS section and to see that it is greater than zero.

Here I have created a CTXSYS.CONTEXT index and it works well for the PDF and DOCX files.  This type of Index is *not* automatically updated when you load a new document.  To update the Index you must call the "CTX_DDL.SYNC_INDEX" procedure.

Using the same method as in my previous blog post, I load a new PPT file (callled "Flashback.ppt").


C:\load_docs_to_oracle>sqlldr mydocdb/mydocdb@pdb1 control=my_documents.ctl

SQL*Loader: Release 19.0.0.0.0 - Production on Sun Nov 14 17:04:55 2021
Version 19.12.0.0.0

Copyright (c) 1982, 2021, Oracle and/or its affiliates.  All rights reserved.

Path used:      Conventional
Commit point reached - logical record count 1

Table MY_DOCUMENTS:
  1 Row successfully loaded.

Check the log file:
  my_documents.log
for more information about the load.

C:\load_docs_to_oracle>


And then sync the index and query again

SQL> select doc_number, doc_title, file_type
  2  from my_documents
  3  order by doc_number
  4  /

DOC_NUMBER DOC_TITLE                                  FILE_TYPE
---------- ------------------------------------------ ----------------
         1 Oracle 12c Availability                    PDF
         2 Oracle 12c Security                        PDF
         3 Monitoring and Diagnostics without OEM     PDF
         4 Setting up Data Guard in 11gR2             DOCX
         5 Flashback Database and DG DR Testing       DOCX
         6 Create Docs Table                          SQL
         7 Flashback                                  PPT

7 rows selected.

SQL> exec CTX_DDL.SYNC_INDEX('my_documents_ctx_idx','4M');

PL/SQL procedure successfully completed.

SQL>
SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'Thomas Kyte') > 0
  4  /

DOC_TITLE                                  DOC_DESCRIPTION                            FILE_TYPE
------------------------------------------ ------------------------------------------ ----------------
Flashback                                  Tom Kyte's Presentation on Flashback       PPT

SQL> select doc_title, doc_description, file_type
  2  from my_documents
  3  where contains(file_content, 'Rowids therefore will change') > 0
  4  /

DOC_TITLE                                  DOC_DESCRIPTION                            FILE_TYPE
------------------------------------------ ------------------------------------------ ----------------
Flashback                                  Tom Kyte's Presentation on Flashback       PPT

SQL>


The call to CTX_DDL.SYNC_INDEX contains a memory parameter '4M' specifying the memory to be used for synchronization.

Besides the CTXSYS.CONTEXT IndexType, Oracle also allows a "CTXSYS.CTXCAT" IndexType which can be used to index small documents or text fragements.  The CTXSYS.CTXCAT Index is updated synchronously when a new document is loaded into the table (i.e. a call to CTX_DDL.SYNC_INDEX is not required).

Typically, with a document database with moderate to large documents, you would use CTXSYS.CONTEXT IndexType and configure a scheduler job to call CTX_DDL.SYNC_INDEX periodically (e.g. every 30minutes or every 4hours) to update the Index for new documents that are loaded infrequently and do not need to be searched immediately.

The IndexType automically excludes certain popular words called "STOPWORD"s that may be in "STOPLIST"s.  These are very commonly used words that do not need to be indexed (e.g. words like "the" or "this" or "that".  Such STOPLISTs can be customised as well.
So, your implementation of the indexes is customisable.


See the Oracle Text Application Developers Guide and  Oracle Text Reference for more examples and a detailed study.


No comments: