As I have recently gone through a course on LLMs for Chatbots and RAG on OCI (Oracle Cloud Infrastructure), I thought about putting together a quick demo that uses
A The Hugging Face all-MiniLM-L12-v2 model (with some augmentations done by Oracle for download)
B The ONNX (Open Neural Network Exchange) Runtime Engine that runs in Oracle 23ai (and, yes, in the 23ai Free downloadable edition, not requiring the Oracle Cloud)
C Vectors stored in an Oracle Database using the Vector Datatype
D Use the Hugging Face model to generate Vectors for Similarity / Semantic Searches
but DOES NOT USE Oracle Cloud Infrastructure. (To use OCI for Embedding -- generating "Vectors" from "Tokens", I have to book a Dedicated AI Cluster to "Fine Tune" an existing model, so I am currently staying away from this expensive method)
[Similarly, I cannot currently build a Generative AI as I have to setup a model and server resources]
The sequence is :
1. Setup an Oracle 23ai database (either in the Oracle Cloud OR on an Exadata machine OR the Free edition for On-Premises testing)
2. Create a User to hold the source data and vectors
3. Import the augmented Hugging Face model
4. Load the Source Data as words or sentences or chunks -- here I use a few sentences
5. Compute Vectors for the target column
6. Optional but recommended : Build Vector Index -- for this demo with very few rows I haven't built a Vector Index but if and/or when I do build a demo with a few hundred or thousand sentences or chunks of data, I will create an Index on the Vectors
7. Use Oracle's VECTOR_DISTANCE Function for a Semantic Similarity Search
So here goes :
User and Grants :
Using the 23ai FREEPDB downloadable VM, I created the user with these grants :
grant connect to vector_demo identified by password ;
alter user vector_demo default tablespace users ;
alter user vector_demo quota unlimited on users ;
grant DB_DEVELOPER_ROLE to vector_demo ;
grant CREATE MINING MODEL to vector_demo;
CREATE OR REPLACE DIRECTORY DM_DUMP as '/tmp/Load_Model'; -- on the Linux VM
GRANT READ ON DIRECTORY dm_dump TO vector_demo;
GRANT WRITE ON DIRECTORY dm_dump TO vector_demo;
Download and Import the Model :
Download the ZIP file from
https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip
It also contains a README
Place the extracted ONNX file on the Linux Server at /tmp/Load_Model
SQL> !ls -l /tmp/Load_Model
total 249932
-rw-r--r--. 1 oracle oinstall 122597346 Jul 15 2024 all_MiniLM_L12_v2_augmented.zip
-rw-r--r--. 1 oracle oinstall 133322334 Jul 15 2024 all_MiniLM_L12_v2.onnx
-rw-r--r--. 1 oracle oinstall 4232 Jul 15 2024 README-ALL_MINILM_L12_V2-augmented.txt
SQL>
If necessary, delete the previously imported model from the database (logged in to the SQL command line as vector_demo) :
exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'ALL_MINILM_L12_V2_AUGMENTED', force => true);
Import the model file :
SQL> EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL(-
directory=>'DM_DUMP',-
file_name=>'all_MiniLM_L12_v2.onnx',-
model_name=>'ALL_MINILM_L12_V2_AUGMENTED'-
metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')-
);
The "model_name" is the User Defined Name (i.e. defined by the DBA).
In many online demos and the online demo, I see the model name specified as "doc_model".
I preferred to use the name "ALL_MINILM_L12_V2_AUGMENTED". You could use any meaningful name
Verify the imported model with these queries :
SQL> l
1 SELECT model_name, attribute_name, attribute_type, data_type, vector_info
2 FROM user_mining_model_attributes
3 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED'
4* ORDER BY ATTRIBUTE_NAME
SQL> /
MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TY DATA_TYPE
-------------------------------- ---------------- ------------ ----------------
VECTOR_INFO
--------------------------------------------------------
ALL_MINILM_L12_V2_AUGMENTED DATA TEXT VARCHAR2
ALL_MINILM_L12_V2_AUGMENTED ORA$ONNXTARGET VECTOR VECTOR
VECTOR(384,FLOAT32)
SQL>
SQL> l
1 SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM,
2 ALGORITHM_TYPE, MODEL_SIZE
3 FROM user_mining_models
4 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED'
5* ORDER BY MODEL_NAME
SQL> /
MODEL_NAME MINING_FUNCTION ALGORITHM ALGORITHM_ MODEL_SIZE
-------------------------------- ------------------------------ ------------------------------ ---------- ----------
ALL_MINILM_L12_V2_AUGMENTED EMBEDDING ONNX NATIVE 133322334
SQL>
SQL>SELECT * FROM DM$VMALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME ;
NAME VALUE
------------------ ------------------------------------------------
Graph Description Graph combining tokenizer and main_graph
tokenizer
main_graph
Graph Name tokenizer_main_graph
Input[0] input:string[?]
Output[0] embedding:float32[?,384]
Producer Name onnx.compose.merge_models
Version 1
6 rows selected.
SQL>
SQL> SELECT * FROM DM$VPALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME;
NAME VALUE
------------------ ------------------------------------------------
embeddingOutput embedding
function embedding
SQL>
SQL> SELECT * FROM DM$VJALL_MINILM_L12_V2_AUGMENTED;
METADATA
--------------------------------------------------------------------------------
{"function":"embedding","embeddingOutput":"embedding","input":{"input":["DATA"]}
SQL>
Test generation of Embeddings with the Model :
SQL> SELECT VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA) AS embedding;
EMBEDDING
------------------------------------------------------------------------------------------------------------------------------------
[1.65517051E-002,3.19098569E-002,-1.96293015E-002,-3.56926955E-002,
SQL>
SQL> SELECT VECTOR_DIMS(VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA)) as Dimensions_Count;
DIMENSIONS_COUNT
----------------
384
SQL>
note : The output is actually a list of 384 dimensions, I copied the whole output from SQL Developer put it into a Python List and
then did a LEN(list) to count the number of elements
Create the Table for the source data (sentences of text in my demo) and load some data (sentences)
SQL> drop table my_data_source purge;
Table dropped.
SQL> create table my_data_source(id number primary key, my_sentence varchar2(4000));
Table created.
SQL> insert into my_data_source values (1,'The quick brown fox jumped over the lazy dog');
1 row created.
SQL> insert into my_data_source values (2,'she sells sea-shells at the seashore');
1 row created.
SQL> insert into my_data_source values (3,'the fox and dog are brown friends');
1 row created.
SQL> insert into my_data_source values (4,'the elephant knows the fox and dog');
1 row created.
SQL> insert into my_data_source values (5,'the fox, dog and elephant live together');
1 row created.
SQL> insert into my_data_source values (6,'aeroplanes fly in the sky');
1 row created.
SQL> commit;
Commit complete.
SQL>
SQL> select * from my_data_source;
ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
1
The quick brown fox jumped over the lazy dog
2
she sells sea-shells at the seashore
3
the fox and dog are brown friends
4
the elephant knows the fox and dog
5
the fox, dog and elephant live together
6
aeroplanes fly in the sky
6 rows selected.
SQL>
Create the Table to hold the computed Vectors, using the model that has been loaded
SQL> drop table my_data_vectors;
Table dropped.
SQL> create table my_data_vectors (id number primary key, my_sentence varchar2(4000), sentence_vector vector);
Table created.
SQL>
SQL> insert into my_data_vectors
select ds.id as id, my_sentence, to_vector(et.embed_vector) sentence_vector
from
my_data_source ds,
dbms_vector_chain.utl_to_embeddings(
dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(ds.my_sentence), json('{"normalize":"all"}')),
json('{"provider":"database", "model":"ALL_MINILM_L12_V2_AUGMENTED"}')) t,
JSON_TABLE(t.column_value, '$[*]' COLUMNS (embed_id NUMBER PATH '$.embed_id', embed_data VARCHAR2(4000) PATH '$.embed_data', embed_vector CLOB PATH '$.embed_vector')) et;
2 3 4 5 6 7 8
6 rows created.
SQL>
SQL> commit
2 /
Commit complete.
SQL> select * from my_data_vectors;
ID
----------
MY_SENTENCE
------------------------------------------------------------------------------------------------------------------------------------
SENTENCE_VECTOR
------------------------------------------------------------------------------------------------------------------------------------
1
The quick brown fox jumped over the lazy dog
[1.70537992E-003,5.95331714E-002,-2.32173726E-002,3.02353837E-002,
2
she sells sea-shells at the seashore
[1.2790652E-002,5.97022101E-002,8.24511051E-002,3.93822305E-002,
3
the fox and dog are brown friends
[7.67209902E-002,-7.82399923E-002,-2.40087509E-002,-2.37390138E-002,
4
the elephant knows the fox and dog
[4.04452085E-002,-4.15055361E-003,-1.71641614E-002,2.63017584E-002,
5
the fox, dog and elephant live together
[9.44276601E-002,-4.74944711E-002,2.31287945E-002,2.9055763E-002,
6
aeroplanes fly in the sky
[1.1147093E-001,2.18752325E-002,-4.58196662E-002,-2.64751501E-002,
6 rows selected.
SQL>
Do a Semantic Similarity Search for certain phrases :
SQL> l
1 SELECT VECTOR_DISTANCE(
2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quicker yellow fox jumped' as DATA),
4* COSINE) as Vector_Distance
SQL> /
VECTOR_DISTANCE
---------------
1.76127E-001
SQL>
SQL> l
1 SELECT VECTOR_DISTANCE(
2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'seashells' as DATA),
4* COSINE) as Vector_Distance
SQL> /
VECTOR_DISTANCE
---------------
9.81778E-001
SQL>
SQL> l
1 SELECT VECTOR_DISTANCE(
2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'moon elephants' as DATA),
4* COSINE)as Vector_Distance
SQL> /
VECTOR_DISTANCE
---------------
7.51297E-001
SQL>
--- "the quicker yellow fox jumped" is not very far away from "the quick brown fox jumped"
--- "moon elephants" is far away from the "the quick brown fox jumped".
--- "seashells" is much further away.
--- likely because the prebuilt model recognises tnat "fox" and "elephant" have some similarity being animals
SQL> l
1 SELECT VECTOR_DISTANCE(
2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA),
3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'houshold electricals' as DATA),
4* COSINE)as Vector_Distance
SQL> /
VECTOR_DISTANCE
---------------
9.97313E-001
SQL>
-- "household" electricals is further away
SQL> l
1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) as Vector_Distance
2 FROM my_data_vectors
3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE)
4* FETCH FIRST 2 ROWS ONLY
SQL> /
ID MY_SENTENCE VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
2 she sells sea-shells at the seashore 3.65937E-001
6 aeroplanes fly in the sky 8.43676E-001
SQL>
SQL> l
1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) as Vector_Distance
2 FROM my_data_vectors
3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE)
4* FETCH FIRST 2 ROWS ONLY
SQL> /
ID MY_SENTENCE VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
2 she sells sea-shells at the seashore 1.46406E-001
6 aeroplanes fly in the sky 9.26212E-001
SQL>
--- If I refine my query from "sea-shells" to "she sells seashells", I can get a better match, with the Vector Distance dropping from 3.6*10^-1 to 1.4*10^-1
SQL> l
1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) as Vector_Distance
2 FROM my_data_vectors
3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE)
4* FETCH FIRST 2 ROWS ONLY
SQL> /
ID MY_SENTENCE VECTOR_DISTANCE
------------ ------------------------------------------------ ---------------
1 The quick brown fox jumped over the lazy dog 3.22912E-001
4 the elephant knows the fox and dog 5.21092E-001
SQL>
--- "quick fox" has a better similarity with sentence 1 than sentence 4
What LLMs do when "generating" the next word (known as "token") is to find the "token" that has a higher probability -- a lower COSINE Distance
COSINE Distance measures the Angle between the Vectors
There are other Methods such as DOT PRODUCT, EUCLIDEAN (L1/L2) etc
Some Useful References :
Oracle Docs and Examples :
Oracle Blog :
External Blog :
Hugging Face :
ONNX :
Oracle SQL Reference :