21 September, 2025

AI / LLM : Semantic Similarity Search using Hugging Face with Vectors in an Oracle 23ai Database

 As I have recently gone through a course on LLMs for Chatbots and RAG on OCI (Oracle Cloud Infrastructure), I thought about putting together a quick demo that uses 

A The Hugging Face all-MiniLM-L12-v2 model (with some augmentations done by Oracle for download)

B  The ONNX (Open Neural Network Exchange) Runtime Engine that runs in Oracle 23ai (and, yes, in the 23ai Free downloadable edition, not requiring the Oracle Cloud)

C  Vectors stored in an Oracle Database using the Vector Datatype

D  Use the Hugging Face model to generate Vectors for Similarity / Semantic Searches



but DOES NOT USE Oracle Cloud Infrastructure.  (To use OCI for Embedding -- generating "Vectors" from "Tokens", I have to book a  Dedicated AI Cluster to "Fine Tune" an existing model, so I am currently staying away from this expensive method)

[Similarly, I cannot currently build a Generative AI as I have to setup a model and server resources]


The sequence is :

1. Setup an Oracle 23ai database (either in the Oracle Cloud OR on an Exadata machine  OR the Free edition for On-Premises testing)

2. Create a User to hold the source data and vectors

3. Import the augmented Hugging Face model

4. Load the Source Data as words or sentences or chunks -- here I use a few sentences

5. Compute Vectors for the target column

6.  Optional but recommended : Build Vector Index -- for this demo with very few rows I haven't built a Vector Index but if and/or when I do build a demo with a few hundred or thousand sentences or chunks of data, I will create an Index on the Vectors

7. Use Oracle's VECTOR_DISTANCE  Function for a Semantic Similarity Search


So here goes :


User and Grants :
Using the 23ai FREEPDB  downloadable VM, I created the user with these grants :

grant connect to vector_demo identified by password ;

alter user vector_demo default tablespace users ;

alter user vector_demo quota unlimited on users ;

grant DB_DEVELOPER_ROLE to vector_demo ;

grant CREATE MINING MODEL to vector_demo;

CREATE OR REPLACE DIRECTORY DM_DUMP as '/tmp/Load_Model';  -- on the Linux VM

GRANT READ ON DIRECTORY dm_dump TO vector_demo;

GRANT WRITE ON DIRECTORY dm_dump TO vector_demo;

Download and Import the Model :

Download the ZIP file from https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip It also contains a README Place the extracted ONNX file on the Linux Server at /tmp/Load_Model SQL> !ls -l /tmp/Load_Model total 249932 -rw-r--r--. 1 oracle oinstall 122597346 Jul 15 2024 all_MiniLM_L12_v2_augmented.zip -rw-r--r--. 1 oracle oinstall 133322334 Jul 15 2024 all_MiniLM_L12_v2.onnx -rw-r--r--. 1 oracle oinstall 4232 Jul 15 2024 README-ALL_MINILM_L12_V2-augmented.txt SQL> If necessary, delete the previously imported model from the database (logged in to the SQL command line as vector_demo) : exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'ALL_MINILM_L12_V2_AUGMENTED', force => true); Import the model file : SQL> EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL(- directory=>'DM_DUMP',- file_name=>'all_MiniLM_L12_v2.onnx',- model_name=>'ALL_MINILM_L12_V2_AUGMENTED'- metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')- ); The "model_name" is the User Defined Name (i.e. defined by the DBA). In many online demos and the online demo, I see the model name specified as "doc_model". I preferred to use the name "ALL_MINILM_L12_V2_AUGMENTED". You could use any meaningful name Verify the imported model with these queries : SQL> l 1 SELECT model_name, attribute_name, attribute_type, data_type, vector_info 2 FROM user_mining_model_attributes 3 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 4* ORDER BY ATTRIBUTE_NAME SQL> / MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TY DATA_TYPE -------------------------------- ---------------- ------------ ---------------- VECTOR_INFO -------------------------------------------------------- ALL_MINILM_L12_V2_AUGMENTED DATA TEXT VARCHAR2 ALL_MINILM_L12_V2_AUGMENTED ORA$ONNXTARGET VECTOR VECTOR VECTOR(384,FLOAT32) SQL> SQL> l 1 SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM, 2 ALGORITHM_TYPE, MODEL_SIZE 3 FROM user_mining_models 4 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 5* ORDER BY MODEL_NAME SQL> / MODEL_NAME MINING_FUNCTION ALGORITHM ALGORITHM_ MODEL_SIZE -------------------------------- ------------------------------ ------------------------------ ---------- ---------- ALL_MINILM_L12_V2_AUGMENTED EMBEDDING ONNX NATIVE 133322334 SQL> SQL>SELECT * FROM DM$VMALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME ; NAME VALUE ------------------ ------------------------------------------------ Graph Description Graph combining tokenizer and main_graph tokenizer main_graph Graph Name tokenizer_main_graph Input[0] input:string[?] Output[0] embedding:float32[?,384] Producer Name onnx.compose.merge_models Version 1 6 rows selected. SQL> SQL> SELECT * FROM DM$VPALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME; NAME VALUE ------------------ ------------------------------------------------ embeddingOutput embedding function embedding SQL> SQL> SELECT * FROM DM$VJALL_MINILM_L12_V2_AUGMENTED; METADATA -------------------------------------------------------------------------------- {"function":"embedding","embeddingOutput":"embedding","input":{"input":["DATA"]} SQL>

Test generation of Embeddings with the Model :


SQL> SELECT VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA) AS embedding;

EMBEDDING
------------------------------------------------------------------------------------------------------------------------------------
[1.65517051E-002,3.19098569E-002,-1.96293015E-002,-3.56926955E-002,

SQL>
SQL> SELECT VECTOR_DIMS(VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA)) as Dimensions_Count;

DIMENSIONS_COUNT
----------------
             384

SQL>

note : The output is actually a list of 384 dimensions, I copied the whole output from SQL Developer put it into a Python List and 
then did a LEN(list) to count the number of elements

Create the Table for the source data (sentences of text in my demo)  and load some data (sentences)


SQL> drop table my_data_source purge;

Table dropped.

SQL> create table my_data_source(id number primary key, my_sentence varchar2(4000));

Table created.

SQL> insert into my_data_source values (1,'The quick brown fox jumped over the lazy dog');

1 row created.

SQL> insert into my_data_source values (2,'she sells sea-shells at the seashore');

1 row created.

SQL> insert into my_data_source values (3,'the fox and dog are brown friends');

1 row created.

SQL> insert into my_data_source values (4,'the elephant knows the fox and dog');

1 row created.

SQL> insert into my_data_source values (5,'the fox, dog and elephant live together');

1 row created.

SQL> insert into my_data_source values (6,'aeroplanes fly in the sky');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL> select * from my_data_source;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog

         2
she sells sea-shells at the seashore

         3
the fox and dog are brown friends

         4
the elephant knows the fox and dog

         5
the fox, dog and elephant live together

         6
aeroplanes fly in the sky


6 rows selected.

SQL>



Create the Table to hold the computed Vectors, using the model that has been loaded


SQL> drop table my_data_vectors;

Table dropped.

SQL> create table my_data_vectors (id number primary key, my_sentence varchar2(4000), sentence_vector vector);

Table created.

SQL> 
SQL> insert into my_data_vectors
select ds.id as id, my_sentence, to_vector(et.embed_vector) sentence_vector
from
    my_data_source ds,
    dbms_vector_chain.utl_to_embeddings(
       dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(ds.my_sentence), json('{"normalize":"all"}')),
       json('{"provider":"database", "model":"ALL_MINILM_L12_V2_AUGMENTED"}')) t,
    JSON_TABLE(t.column_value, '$[*]' COLUMNS (embed_id NUMBER PATH '$.embed_id', embed_data VARCHAR2(4000) PATH '$.embed_data', embed_vector CLOB PATH '$.embed_vector')) et;
  2    3    4    5    6    7    8
6 rows created.

SQL>
SQL> commit
  2  /

Commit complete.

SQL> select * from my_data_vectors;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
SENTENCE_VECTOR
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog
[1.70537992E-003,5.95331714E-002,-2.32173726E-002,3.02353837E-002,

         2
she sells sea-shells at the seashore
[1.2790652E-002,5.97022101E-002,8.24511051E-002,3.93822305E-002,

         3
the fox and dog are brown friends
[7.67209902E-002,-7.82399923E-002,-2.40087509E-002,-2.37390138E-002,

         4
the elephant knows the fox and dog
[4.04452085E-002,-4.15055361E-003,-1.71641614E-002,2.63017584E-002,

         5
the fox, dog and elephant live together
[9.44276601E-002,-4.74944711E-002,2.31287945E-002,2.9055763E-002,

         6
aeroplanes fly in the sky
[1.1147093E-001,2.18752325E-002,-4.58196662E-002,-2.64751501E-002,


6 rows selected.

SQL>


Do a Semantic Similarity Search for certain phrases :

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quicker yellow fox jumped' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   1.76127E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'seashells' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.81778E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'moon elephants' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   7.51297E-001

SQL>

--- "the quicker yellow fox jumped" is not very far away from "the quick brown fox jumped"
--- "moon elephants" is far away from the "the quick brown fox jumped".
--- "seashells" is much further away.
--- likely because the prebuilt model recognises tnat "fox" and "elephant" have some similarity being animals

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'houshold electricals' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.97313E-001

SQL>
-- "household" electricals is further away


SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                3.65937E-001
           6 aeroplanes fly in the sky                           8.43676E-001

SQL>
SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                1.46406E-001
           6 aeroplanes fly in the sky                           9.26212E-001

SQL>
--- If I refine my query from "sea-shells" to "she sells seashells", I can get a better match, with the Vector Distance dropping from 3.6*10^-1  to 1.4*10^-1

SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           1 The quick brown fox jumped over the lazy dog        3.22912E-001
           4 the elephant knows the fox and dog                  5.21092E-001

SQL>
--- "quick fox" has a better similarity with sentence 1 than sentence 4


What LLMs do when "generating" the next word (known as "token") is to find the "token" that has a higher probability -- a lower COSINE Distance

COSINE Distance measures the Angle between the Vectors
There are other Methods such as DOT PRODUCT, EUCLIDEAN (L1/L2) etc


Some Useful References :

Oracle Docs and Examples :




Oracle Blog :



External Blog :



Hugging Face :



ONNX :


Oracle SQL Reference :


21 August, 2025

Cross-Instance Deadlock in RAC

 I've created a video demonstration of a Cross-Instance Deadlock in a 2-node RAC environment and how to read the trace file.

Note : Only the SQL that has been returned the "deadlock" error is failed, the session and transaction are still valid.


06 August, 2025

An introduction to Oracle Cloud Infrastructure

 A 1hour video


Description:

This session will cover the fundamentals of Oracle Cloud Infrastructure (OCI) including cloud concepts, OCI structures, and core OCI services. OCI service areas that will be discussed include Identity and Access Management (IAM), networking, compute, storage, and database. You will also learn about Users, Groups, and Policies, along with the core tenets of virtual network design that will help set the foundations for application deployment to OCI. We will also cover terms and definitions for each of the topic areas to establish a foundation for additional sessions.