In the previous post, I had demonstrated how to load documents into an Oracle Database.
(Note : I am referring to documents as in PDF files, PowerPoint presentations, Word documents and text files in the traditional sense, not "JSON documents").
Here I demonstrate how to use Oracle Text to create indexes and make such documents searchable.
I assume that Oracle Text (the "CTXSYS" schema) has been installed when the database was created.
First, I grant the appropriate privileges.
SQL> alter session set container=pdb1; Session altered. SQL> GRANT EXECUTE ON CTXSYS.CTX_CLS TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_DDL TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_DOC TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_OUTPUT TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_QUERY TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_REPORT TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_THES TO MYDOCDB; Grant succeeded. SQL> GRANT EXECUTE ON CTXSYS.CTX_ULEXER TO MYDOCDB; Grant succeeded. SQL>
Next, I create a CONTEXT Index on my documents table.
SQL> connect mydocdb/mydocdb@pdb1 Connected. SQL> CREATE INDEX my_documents_ctx_idx ON my_documents(file_content) 2 INDEXTYPE IS CTXSYS.CONTEXT 3 / Index created. SQL>
My documents are now indexed and searchable. The syntax for a context index query is slightly different.
SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'DBA Role') > 0 4 / DOC_TITLE DOC_DESCRIPTION FILE_TYPE ------------------------------------------ ------------------------------------------ ---------------- Oracle 12c Security Security Mechanisms in Oracle 12c PDF Monitoring and Diagnostics without OEM Presentation at Oracle SOS Conference PDF SQL> SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'sqlplus hemant/hemant@orclpdb1') > 0 4 / DOC_TITLE DOC_DESCRIPTION FILE_TYPE ------------------------------------------ ------------------------------------------ ---------------- Flashback Database and DG DR Testing How to use Flasback Database in DataGuard DOCX for DR Testing SQL> SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'Flex ASM') > 0 4 / DOC_TITLE DOC_DESCRIPTION FILE_TYPE ------------------------------------------ ------------------------------------------ ---------------- Oracle 12c Availability High Availability Options in Oracle 12c PDF SQL> SQL> SQL> -- unfortunately, some queries on my plain-text SQL file don't work as expected SQL> SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'doc_number number generated always as identity,') > 0 4 / select doc_title, doc_description, file_type * ERROR at line 1: ORA-29902: error in executing ODCIIndexStart() routine ORA-20000: Oracle Text error: DRG-50901: text query parser syntax error on line 1, column 48 SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'doc_number number generated always as identity') > 0 4 / no rows selected SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'doc_number number generated always') > 0 4 / no rows selected SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, '(doc_number number generated always') > 0 4 / select doc_title, doc_description, file_type * ERROR at line 1: ORA-29902: error in executing ODCIIndexStart() routine ORA-20000: Oracle Text error: DRG-50901: text query parser syntax error on line 1, column 36 SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'drop table my_documents') > 0 4 / no rows selected SQL>
Also, apparently the inclusion of "(" and "," and such are tricky when creatimg a query.
Note : The "CONTAINS ... >0" clause is to check the SCORE returned by the filter in the CONTAINS section and to see that it is greater than zero.
And then sync the index and query again
The call to CTX_DDL.SYNC_INDEX contains a memory parameter '4M' specifying the memory to be used for synchronization.
Here I have created a CTXSYS.CONTEXT index and it works well for the PDF and DOCX files. This type of Index is *not* automatically updated when you load a new document. To update the Index you must call the "CTX_DDL.SYNC_INDEX" procedure.
Using the same method as in my previous blog post, I load a new PPT file (callled "Flashback.ppt").
C:\load_docs_to_oracle>sqlldr mydocdb/mydocdb@pdb1 control=my_documents.ctl SQL*Loader: Release 19.0.0.0.0 - Production on Sun Nov 14 17:04:55 2021 Version 19.12.0.0.0 Copyright (c) 1982, 2021, Oracle and/or its affiliates. All rights reserved. Path used: Conventional Commit point reached - logical record count 1 Table MY_DOCUMENTS: 1 Row successfully loaded. Check the log file: my_documents.log for more information about the load. C:\load_docs_to_oracle>
And then sync the index and query again
SQL> select doc_number, doc_title, file_type 2 from my_documents 3 order by doc_number 4 / DOC_NUMBER DOC_TITLE FILE_TYPE ---------- ------------------------------------------ ---------------- 1 Oracle 12c Availability PDF 2 Oracle 12c Security PDF 3 Monitoring and Diagnostics without OEM PDF 4 Setting up Data Guard in 11gR2 DOCX 5 Flashback Database and DG DR Testing DOCX 6 Create Docs Table SQL 7 Flashback PPT 7 rows selected. SQL> exec CTX_DDL.SYNC_INDEX('my_documents_ctx_idx','4M'); PL/SQL procedure successfully completed. SQL> SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'Thomas Kyte') > 0 4 / DOC_TITLE DOC_DESCRIPTION FILE_TYPE ------------------------------------------ ------------------------------------------ ---------------- Flashback Tom Kyte's Presentation on Flashback PPT SQL> select doc_title, doc_description, file_type 2 from my_documents 3 where contains(file_content, 'Rowids therefore will change') > 0 4 / DOC_TITLE DOC_DESCRIPTION FILE_TYPE ------------------------------------------ ------------------------------------------ ---------------- Flashback Tom Kyte's Presentation on Flashback PPT SQL>
The call to CTX_DDL.SYNC_INDEX contains a memory parameter '4M' specifying the memory to be used for synchronization.
Besides the CTXSYS.CONTEXT IndexType, Oracle also allows a "CTXSYS.CTXCAT" IndexType which can be used to index small documents or text fragements. The CTXSYS.CTXCAT Index is updated synchronously when a new document is loaded into the table (i.e. a call to CTX_DDL.SYNC_INDEX is not required).
Typically, with a document database with moderate to large documents, you would use CTXSYS.CONTEXT IndexType and configure a scheduler job to call CTX_DDL.SYNC_INDEX periodically (e.g. every 30minutes or every 4hours) to update the Index for new documents that are loaded infrequently and do not need to be searched immediately.
The IndexType automically excludes certain popular words called "STOPWORD"s that may be in "STOPLIST"s. These are very commonly used words that do not need to be indexed (e.g. words like "the" or "this" or "that". Such STOPLISTs can be customised as well.
So, your implementation of the indexes is customisable.
See the Oracle Text Application Developers Guide and Oracle Text Reference for more examples and a detailed study.
No comments:
Post a Comment