21 September, 2025

AI / LLM : Semantic Similarity Search using Hugging Face with Vectors in an Oracle 23ai Database

 As I have recently gone through a course on LLMs for Chatbots and RAG on OCI (Oracle Cloud Infrastructure), I thought about putting together a quick demo that uses 

A The Hugging Face all-MiniLM-L12-v2 model (with some augmentations done by Oracle for download)

B  The ONNX (Open Neural Network Exchange) Runtime Engine that runs in Oracle 23ai (and, yes, in the 23ai Free downloadable edition, not requiring the Oracle Cloud)

C  Vectors stored in an Oracle Database using the Vector Datatype

D  Use the Hugging Face model to generate Vectors for Similarity / Semantic Searches



but DOES NOT USE Oracle Cloud Infrastructure.  (To use OCI for Embedding -- generating "Vectors" from "Tokens", I have to book a  Dedicated AI Cluster to "Fine Tune" an existing model, so I am currently staying away from this expensive method)

[Similarly, I cannot currently build a Generative AI as I have to setup a model and server resources]


The sequence is :

1. Setup an Oracle 23ai database (either in the Oracle Cloud OR on an Exadata machine  OR the Free edition for On-Premises testing)

2. Create a User to hold the source data and vectors

3. Import the augmented Hugging Face model

4. Load the Source Data as words or sentences or chunks -- here I use a few sentences

5. Compute Vectors for the target column

6.  Optional but recommended : Build Vector Index -- for this demo with very few rows I haven't built a Vector Index but if and/or when I do build a demo with a few hundred or thousand sentences or chunks of data, I will create an Index on the Vectors

7. Use Oracle's VECTOR_DISTANCE  Function for a Semantic Similarity Search


So here goes :


User and Grants :
Using the 23ai FREEPDB  downloadable VM, I created the user with these grants :

grant connect to vector_demo identified by password ;

alter user vector_demo default tablespace users ;

alter user vector_demo quota unlimited on users ;

grant DB_DEVELOPER_ROLE to vector_demo ;

grant CREATE MINING MODEL to vector_demo;

CREATE OR REPLACE DIRECTORY DM_DUMP as '/tmp/Load_Model';  -- on the Linux VM

GRANT READ ON DIRECTORY dm_dump TO vector_demo;

GRANT WRITE ON DIRECTORY dm_dump TO vector_demo;

Download and Import the Model :

Download the ZIP file from https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip It also contains a README Place the extracted ONNX file on the Linux Server at /tmp/Load_Model SQL> !ls -l /tmp/Load_Model total 249932 -rw-r--r--. 1 oracle oinstall 122597346 Jul 15 2024 all_MiniLM_L12_v2_augmented.zip -rw-r--r--. 1 oracle oinstall 133322334 Jul 15 2024 all_MiniLM_L12_v2.onnx -rw-r--r--. 1 oracle oinstall 4232 Jul 15 2024 README-ALL_MINILM_L12_V2-augmented.txt SQL> If necessary, delete the previously imported model from the database (logged in to the SQL command line as vector_demo) : exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'ALL_MINILM_L12_V2_AUGMENTED', force => true); Import the model file : SQL> EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL(- directory=>'DM_DUMP',- file_name=>'all_MiniLM_L12_v2.onnx',- model_name=>'ALL_MINILM_L12_V2_AUGMENTED'- metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')- ); The "model_name" is the User Defined Name (i.e. defined by the DBA). In many online demos and the online demo, I see the model name specified as "doc_model". I preferred to use the name "ALL_MINILM_L12_V2_AUGMENTED". You could use any meaningful name Verify the imported model with these queries : SQL> l 1 SELECT model_name, attribute_name, attribute_type, data_type, vector_info 2 FROM user_mining_model_attributes 3 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 4* ORDER BY ATTRIBUTE_NAME SQL> / MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TY DATA_TYPE -------------------------------- ---------------- ------------ ---------------- VECTOR_INFO -------------------------------------------------------- ALL_MINILM_L12_V2_AUGMENTED DATA TEXT VARCHAR2 ALL_MINILM_L12_V2_AUGMENTED ORA$ONNXTARGET VECTOR VECTOR VECTOR(384,FLOAT32) SQL> SQL> l 1 SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM, 2 ALGORITHM_TYPE, MODEL_SIZE 3 FROM user_mining_models 4 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 5* ORDER BY MODEL_NAME SQL> / MODEL_NAME MINING_FUNCTION ALGORITHM ALGORITHM_ MODEL_SIZE -------------------------------- ------------------------------ ------------------------------ ---------- ---------- ALL_MINILM_L12_V2_AUGMENTED EMBEDDING ONNX NATIVE 133322334 SQL> SQL>SELECT * FROM DM$VMALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME ; NAME VALUE ------------------ ------------------------------------------------ Graph Description Graph combining tokenizer and main_graph tokenizer main_graph Graph Name tokenizer_main_graph Input[0] input:string[?] Output[0] embedding:float32[?,384] Producer Name onnx.compose.merge_models Version 1 6 rows selected. SQL> SQL> SELECT * FROM DM$VPALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME; NAME VALUE ------------------ ------------------------------------------------ embeddingOutput embedding function embedding SQL> SQL> SELECT * FROM DM$VJALL_MINILM_L12_V2_AUGMENTED; METADATA -------------------------------------------------------------------------------- {"function":"embedding","embeddingOutput":"embedding","input":{"input":["DATA"]} SQL>

Test generation of Embeddings with the Model :


SQL> SELECT VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA) AS embedding;

EMBEDDING
------------------------------------------------------------------------------------------------------------------------------------
[1.65517051E-002,3.19098569E-002,-1.96293015E-002,-3.56926955E-002,

SQL>
SQL> SELECT VECTOR_DIMS(VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA)) as Dimensions_Count;

DIMENSIONS_COUNT
----------------
             384

SQL>

note : The output is actually a list of 384 dimensions, I copied the whole output from SQL Developer put it into a Python List and 
then did a LEN(list) to count the number of elements

Create the Table for the source data (sentences of text in my demo)  and load some data (sentences)


SQL> drop table my_data_source purge;

Table dropped.

SQL> create table my_data_source(id number primary key, my_sentence varchar2(4000));

Table created.

SQL> insert into my_data_source values (1,'The quick brown fox jumped over the lazy dog');

1 row created.

SQL> insert into my_data_source values (2,'she sells sea-shells at the seashore');

1 row created.

SQL> insert into my_data_source values (3,'the fox and dog are brown friends');

1 row created.

SQL> insert into my_data_source values (4,'the elephant knows the fox and dog');

1 row created.

SQL> insert into my_data_source values (5,'the fox, dog and elephant live together');

1 row created.

SQL> insert into my_data_source values (6,'aeroplanes fly in the sky');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL> select * from my_data_source;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog

         2
she sells sea-shells at the seashore

         3
the fox and dog are brown friends

         4
the elephant knows the fox and dog

         5
the fox, dog and elephant live together

         6
aeroplanes fly in the sky


6 rows selected.

SQL>



Create the Table to hold the computed Vectors, using the model that has been loaded


SQL> drop table my_data_vectors;

Table dropped.

SQL> create table my_data_vectors (id number primary key, my_sentence varchar2(4000), sentence_vector vector);

Table created.

SQL> 
SQL> insert into my_data_vectors
select ds.id as id, my_sentence, to_vector(et.embed_vector) sentence_vector
from
    my_data_source ds,
    dbms_vector_chain.utl_to_embeddings(
       dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(ds.my_sentence), json('{"normalize":"all"}')),
       json('{"provider":"database", "model":"ALL_MINILM_L12_V2_AUGMENTED"}')) t,
    JSON_TABLE(t.column_value, '$[*]' COLUMNS (embed_id NUMBER PATH '$.embed_id', embed_data VARCHAR2(4000) PATH '$.embed_data', embed_vector CLOB PATH '$.embed_vector')) et;
  2    3    4    5    6    7    8
6 rows created.

SQL>
SQL> commit
  2  /

Commit complete.

SQL> select * from my_data_vectors;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
SENTENCE_VECTOR
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog
[1.70537992E-003,5.95331714E-002,-2.32173726E-002,3.02353837E-002,

         2
she sells sea-shells at the seashore
[1.2790652E-002,5.97022101E-002,8.24511051E-002,3.93822305E-002,

         3
the fox and dog are brown friends
[7.67209902E-002,-7.82399923E-002,-2.40087509E-002,-2.37390138E-002,

         4
the elephant knows the fox and dog
[4.04452085E-002,-4.15055361E-003,-1.71641614E-002,2.63017584E-002,

         5
the fox, dog and elephant live together
[9.44276601E-002,-4.74944711E-002,2.31287945E-002,2.9055763E-002,

         6
aeroplanes fly in the sky
[1.1147093E-001,2.18752325E-002,-4.58196662E-002,-2.64751501E-002,


6 rows selected.

SQL>


Do a Semantic Similarity Search for certain phrases :

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quicker yellow fox jumped' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   1.76127E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'seashells' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.81778E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'moon elephants' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   7.51297E-001

SQL>

--- "the quicker yellow fox jumped" is not very far away from "the quick brown fox jumped"
--- "moon elephants" is far away from the "the quick brown fox jumped".
--- "seashells" is much further away.
--- likely because the prebuilt model recognises tnat "fox" and "elephant" have some similarity being animals

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'houshold electricals' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.97313E-001

SQL>
-- "household" electricals is further away


SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                3.65937E-001
           6 aeroplanes fly in the sky                           8.43676E-001

SQL>
SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                1.46406E-001
           6 aeroplanes fly in the sky                           9.26212E-001

SQL>
--- If I refine my query from "sea-shells" to "she sells seashells", I can get a better match, with the Vector Distance dropping from 3.6*10^-1  to 1.4*10^-1

SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           1 The quick brown fox jumped over the lazy dog        3.22912E-001
           4 the elephant knows the fox and dog                  5.21092E-001

SQL>
--- "quick fox" has a better similarity with sentence 1 than sentence 4


What LLMs do when "generating" the next word (known as "token") is to find the "token" that has a higher probability -- a lower COSINE Distance

COSINE Distance measures the Angle between the Vectors
There are other Methods such as DOT PRODUCT, EUCLIDEAN (L1/L2) etc


Some Useful References :
https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/sql-quick-start-using-vector-embedding-model-uploaded-database.html  

https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/import-pretrained-models-onnx-format-vector-generation-database.html

https://blogs.oracle.com/machinelearning/post/use-our-prebuilt-onnx-model-now-available-for-embedding-generation-in-oracle-database-23ai

https://adrianotanaka.com.br/index.php/2025/04/16/loading-an-onnx-model-for-oracle-ai-vector-search

https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2

https://onnx.ai/

https://docs.oracle.com/en-us/iaas/Content/generative-ai/home.htm

https://docs.oracle.com/en-us/iaas/Content/generative-ai-agents/home.htm

https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/vector_distance.html

https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/vector_embedding.html

No comments: