Hemant's Oracle DBA Blog: AI / LLM : Semantic Similarity Search using Hugging Face with Vectors in an Oracle 23ai Database

As I have recently gone through a course on LLMs for Chatbots and RAG on OCI (Oracle Cloud Infrastructure), I thought about putting together a quick demo that uses

A The Hugging Face all-MiniLM-L12-v2 model (with some augmentations done by Oracle for download)

B The ONNX (Open Neural Network Exchange) Runtime Engine that runs in Oracle 23ai (and, yes, in the 23ai Free downloadable edition, not requiring the Oracle Cloud)

C Vectors stored in an Oracle Database using the Vector Datatype

D Use the Hugging Face model to generate Vectors for Similarity / Semantic Searches

but DOES NOT USE Oracle Cloud Infrastructure. (To use OCI for Embedding -- generating "Vectors" from "Tokens", I have to book a Dedicated AI Cluster to "Fine Tune" an existing model, so I am currently staying away from this expensive method)

[Similarly, I cannot currently build a Generative AI as I have to setup a model and server resources]

The sequence is :

1. Setup an Oracle 23ai database (either in the Oracle Cloud OR on an Exadata machine OR the Free edition for On-Premises testing)

2. Create a User to hold the source data and vectors

3. Import the augmented Hugging Face model

4. Load the Source Data as words or sentences or chunks -- here I use a few sentences

5. Compute Vectors for the target column

6. Optional but recommended : Build Vector Index -- for this demo with very few rows I haven't built a Vector Index but if and/or when I do build a demo with a few hundred or thousand sentences or chunks of data, I will create an Index on the Vectors

7. Use Oracle's VECTOR_DISTANCE Function for a Semantic Similarity Search

So here goes :

User and Grants :

Using the 23ai FREEPDB  downloadable VM, I created the user with these grants :

grant connect to vector_demo identified by password ;

alter user vector_demo default tablespace users ;

alter user vector_demo quota unlimited on users ;

grant DB_DEVELOPER_ROLE to vector_demo ;

grant CREATE MINING MODEL to vector_demo;

CREATE OR REPLACE DIRECTORY DM_DUMP as '/tmp/Load_Model';  -- on the Linux VM

GRANT READ ON DIRECTORY dm_dump TO vector_demo;

GRANT WRITE ON DIRECTORY dm_dump TO vector_demo;

Download and Import the Model :


Download the ZIP file from 
https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip

It also contains a README

Place the extracted ONNX file on the Linux Server at /tmp/Load_Model
SQL> !ls -l /tmp/Load_Model
total 249932
-rw-r--r--. 1 oracle oinstall 122597346 Jul 15  2024 all_MiniLM_L12_v2_augmented.zip
-rw-r--r--. 1 oracle oinstall 133322334 Jul 15  2024 all_MiniLM_L12_v2.onnx
-rw-r--r--. 1 oracle oinstall      4232 Jul 15  2024 README-ALL_MINILM_L12_V2-augmented.txt

SQL>

If necessary, delete the previously imported model from the database (logged in to the SQL command line as vector_demo) :
exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'ALL_MINILM_L12_V2_AUGMENTED', force => true);


Import the model file :
SQL> EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL(-
  directory=>'DM_DUMP',-
  file_name=>'all_MiniLM_L12_v2.onnx',-
  model_name=>'ALL_MINILM_L12_V2_AUGMENTED'-
  metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')-
   );

The "model_name" is the User Defined Name (i.e. defined by the DBA).
In many online demos and the online demo, I see the model name specified as "doc_model".  
I preferred to use the name "ALL_MINILM_L12_V2_AUGMENTED".  You could use any meaningful name


Verify the imported model with these queries :
SQL> l
  1  SELECT model_name, attribute_name, attribute_type, data_type, vector_info
  2  FROM user_mining_model_attributes
  3  WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED'
  4* ORDER BY ATTRIBUTE_NAME
SQL> /

MODEL_NAME                       ATTRIBUTE_NAME   ATTRIBUTE_TY DATA_TYPE
-------------------------------- ---------------- ------------ ----------------
VECTOR_INFO
--------------------------------------------------------
ALL_MINILM_L12_V2_AUGMENTED      DATA             TEXT         VARCHAR2


ALL_MINILM_L12_V2_AUGMENTED      ORA$ONNXTARGET   VECTOR       VECTOR
VECTOR(384,FLOAT32)


SQL>
SQL> l
  1  SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM,
  2  ALGORITHM_TYPE, MODEL_SIZE
  3  FROM user_mining_models
  4  WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED'
  5* ORDER BY MODEL_NAME
SQL> /

MODEL_NAME                       MINING_FUNCTION                ALGORITHM                      ALGORITHM_ MODEL_SIZE
-------------------------------- ------------------------------ ------------------------------ ---------- ----------
ALL_MINILM_L12_V2_AUGMENTED      EMBEDDING                      ONNX                           NATIVE      133322334

SQL>
SQL>SELECT * FROM DM$VMALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME ;

NAME               VALUE
------------------ ------------------------------------------------
Graph Description  Graph combining tokenizer and main_graph
                   tokenizer



                   main_graph


Graph Name         tokenizer_main_graph
Input[0]           input:string[?]
Output[0]          embedding:float32[?,384]
Producer Name      onnx.compose.merge_models
Version            1

6 rows selected.

SQL>
SQL> SELECT * FROM DM$VPALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME;

NAME               VALUE
------------------ ------------------------------------------------
embeddingOutput    embedding
function           embedding

SQL>
SQL> SELECT * FROM DM$VJALL_MINILM_L12_V2_AUGMENTED;

METADATA
--------------------------------------------------------------------------------
{"function":"embedding","embeddingOutput":"embedding","input":{"input":["DATA"]}

SQL>

Test generation of Embeddings with the Model :

SQL> SELECT VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA) AS embedding;

EMBEDDING
------------------------------------------------------------------------------------------------------------------------------------
[1.65517051E-002,3.19098569E-002,-1.96293015E-002,-3.56926955E-002,

SQL>
SQL> SELECT VECTOR_DIMS(VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA)) as Dimensions_Count;

DIMENSIONS_COUNT
----------------
             384

SQL>

note : The output is actually a list of 384 dimensions, I copied the whole output from SQL Developer put it into a Python List and 
then did a LEN(list) to count the number of elements

Create the Table for the source data (sentences of text in my demo) and load some data (sentences)

SQL> drop table my_data_source purge;

Table dropped.

SQL> create table my_data_source(id number primary key, my_sentence varchar2(4000));

Table created.

SQL> insert into my_data_source values (1,'The quick brown fox jumped over the lazy dog');

1 row created.

SQL> insert into my_data_source values (2,'she sells sea-shells at the seashore');

1 row created.

SQL> insert into my_data_source values (3,'the fox and dog are brown friends');

1 row created.

SQL> insert into my_data_source values (4,'the elephant knows the fox and dog');

1 row created.

SQL> insert into my_data_source values (5,'the fox, dog and elephant live together');

1 row created.

SQL> insert into my_data_source values (6,'aeroplanes fly in the sky');

1 row created.

SQL> commit;

Commit complete.

SQL>
SQL> select * from my_data_source;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog

         2
she sells sea-shells at the seashore

         3
the fox and dog are brown friends

         4
the elephant knows the fox and dog

         5
the fox, dog and elephant live together

         6
aeroplanes fly in the sky


6 rows selected.

SQL>

Create the Table to hold the computed Vectors, using the model that has been loaded

SQL> drop table my_data_vectors;

Table dropped.

SQL> create table my_data_vectors (id number primary key, my_sentence varchar2(4000), sentence_vector vector);

Table created.

SQL> 
SQL> insert into my_data_vectors
select ds.id as id, my_sentence, to_vector(et.embed_vector) sentence_vector
from
    my_data_source ds,
    dbms_vector_chain.utl_to_embeddings(
       dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(ds.my_sentence), json('{"normalize":"all"}')),
       json('{"provider":"database", "model":"ALL_MINILM_L12_V2_AUGMENTED"}')) t,
    JSON_TABLE(t.column_value, '$[*]' COLUMNS (embed_id NUMBER PATH '$.embed_id', embed_data VARCHAR2(4000) PATH '$.embed_data', embed_vector CLOB PATH '$.embed_vector')) et;
  2    3    4    5    6    7    8
6 rows created.

SQL>
SQL> commit
  2  /

Commit complete.

SQL> select * from my_data_vectors;

        ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
SENTENCE_VECTOR
------------------------------------------------------------------------------------------------------------------------------------
         1
The quick brown fox jumped over the lazy dog
[1.70537992E-003,5.95331714E-002,-2.32173726E-002,3.02353837E-002,

         2
she sells sea-shells at the seashore
[1.2790652E-002,5.97022101E-002,8.24511051E-002,3.93822305E-002,

         3
the fox and dog are brown friends
[7.67209902E-002,-7.82399923E-002,-2.40087509E-002,-2.37390138E-002,

         4
the elephant knows the fox and dog
[4.04452085E-002,-4.15055361E-003,-1.71641614E-002,2.63017584E-002,

         5
the fox, dog and elephant live together
[9.44276601E-002,-4.74944711E-002,2.31287945E-002,2.9055763E-002,

         6
aeroplanes fly in the sky
[1.1147093E-001,2.18752325E-002,-4.58196662E-002,-2.64751501E-002,


6 rows selected.

SQL>

Do a Semantic Similarity Search for certain phrases :

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quicker yellow fox jumped' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   1.76127E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'seashells' as DATA),
  4* COSINE) as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.81778E-001

SQL>
SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'moon elephants' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   7.51297E-001

SQL>

--- "the quicker yellow fox jumped" is not very far away from "the quick brown fox jumped"
--- "moon elephants" is far away from the "the quick brown fox jumped".
--- "seashells" is much further away.
--- likely because the prebuilt model recognises tnat "fox" and "elephant" have some similarity being animals

SQL> l
  1  SELECT VECTOR_DISTANCE(
  2  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
  3  VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'houshold electricals' as DATA),
  4* COSINE)as Vector_Distance
SQL> /

VECTOR_DISTANCE
---------------
   9.97313E-001

SQL>
-- "household" electricals is further away


SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                3.65937E-001
           6 aeroplanes fly in the sky                           8.43676E-001

SQL>
SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           2 she sells sea-shells at the seashore                1.46406E-001
           6 aeroplanes fly in the sky                           9.26212E-001

SQL>
--- If I refine my query from "sea-shells" to "she sells seashells", I can get a better match, with the Vector Distance dropping from 3.6*10^-1  to 1.4*10^-1

SQL> l
  1  SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) as Vector_Distance
  2  FROM my_data_vectors
  3  ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE)
  4* FETCH FIRST 2 ROWS ONLY
SQL> /

          ID MY_SENTENCE                                      VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
           1 The quick brown fox jumped over the lazy dog        3.22912E-001
           4 the elephant knows the fox and dog                  5.21092E-001

SQL>
--- "quick fox" has a better similarity with sentence 1 than sentence 4

What LLMs do when "generating" the next word (known as "token") is to find the "token" that has a higher probability -- a lower COSINE Distance

COSINE Distance measures the Angle between the Vectors

There are other Methods such as DOT PRODUCT, EUCLIDEAN (L1/L2) etc

Some Useful References :

Oracle Docs and Examples :

https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/sql-quick-start-using-vector-embedding-model-uploaded-database.html

https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/import-pretrained-models-onnx-format-vector-generation-database.html

https://docs.oracle.com/en-us/iaas/Content/generative-ai/home.htm

https://docs.oracle.com/en-us/iaas/Content/generative-ai-agents/home.htm

Oracle Blog :

https://blogs.oracle.com/machinelearning/post/use-our-prebuilt-onnx-model-now-available-for-embedding-generation-in-oracle-database-23ai

External Blog :