As I have recently gone through a course on LLMs for Chatbots and RAG on OCI (Oracle Cloud Infrastructure), I thought about putting together a quick demo that uses
A The Hugging Face all-MiniLM-L12-v2 model (with some augmentations done by Oracle for download)
B The ONNX (Open Neural Network Exchange) Runtime Engine that runs in Oracle 23ai (and, yes, in the 23ai Free downloadable edition, not requiring the Oracle Cloud)
C Vectors stored in an Oracle Database using the Vector Datatype
D Use the Hugging Face model to generate Vectors for Similarity / Semantic Searches
but DOES NOT USE Oracle Cloud Infrastructure. (To use OCI for Embedding -- generating "Vectors" from "Tokens", I have to book a Dedicated AI Cluster to "Fine Tune" an existing model, so I am currently staying away from this expensive method)
[Similarly, I cannot currently build a Generative AI as I have to setup a model and server resources]
The sequence is :
1. Setup an Oracle 23ai database (either in the Oracle Cloud OR on an Exadata machine OR the Free edition for On-Premises testing)
2. Create a User to hold the source data and vectors
3. Import the augmented Hugging Face model
4. Load the Source Data as words or sentences or chunks -- here I use a few sentences
5. Compute Vectors for the target column
6. Optional but recommended : Build Vector Index -- for this demo with very few rows I haven't built a Vector Index but if and/or when I do build a demo with a few hundred or thousand sentences or chunks of data, I will create an Index on the Vectors
7. Use Oracle's VECTOR_DISTANCE Function for a Semantic Similarity Search
So here goes :
Using the 23ai FREEPDB downloadable VM, I created the user with these grants : grant connect to vector_demo identified by password ; alter user vector_demo default tablespace users ; alter user vector_demo quota unlimited on users ; grant DB_DEVELOPER_ROLE to vector_demo ; grant CREATE MINING MODEL to vector_demo; CREATE OR REPLACE DIRECTORY DM_DUMP as '/tmp/Load_Model'; -- on the Linux VM GRANT READ ON DIRECTORY dm_dump TO vector_demo; GRANT WRITE ON DIRECTORY dm_dump TO vector_demo;
Download and Import the Model :
Download the ZIP file from https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip It also contains a README Place the extracted ONNX file on the Linux Server at /tmp/Load_Model SQL> !ls -l /tmp/Load_Model total 249932 -rw-r--r--. 1 oracle oinstall 122597346 Jul 15 2024 all_MiniLM_L12_v2_augmented.zip -rw-r--r--. 1 oracle oinstall 133322334 Jul 15 2024 all_MiniLM_L12_v2.onnx -rw-r--r--. 1 oracle oinstall 4232 Jul 15 2024 README-ALL_MINILM_L12_V2-augmented.txt SQL> If necessary, delete the previously imported model from the database (logged in to the SQL command line as vector_demo) : exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'ALL_MINILM_L12_V2_AUGMENTED', force => true); Import the model file : SQL> EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL(- directory=>'DM_DUMP',- file_name=>'all_MiniLM_L12_v2.onnx',- model_name=>'ALL_MINILM_L12_V2_AUGMENTED'- metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')- ); The "model_name" is the User Defined Name (i.e. defined by the DBA). In many online demos and the online demo, I see the model name specified as "doc_model". I preferred to use the name "ALL_MINILM_L12_V2_AUGMENTED". You could use any meaningful name Verify the imported model with these queries : SQL> l 1 SELECT model_name, attribute_name, attribute_type, data_type, vector_info 2 FROM user_mining_model_attributes 3 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 4* ORDER BY ATTRIBUTE_NAME SQL> / MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TY DATA_TYPE -------------------------------- ---------------- ------------ ---------------- VECTOR_INFO -------------------------------------------------------- ALL_MINILM_L12_V2_AUGMENTED DATA TEXT VARCHAR2 ALL_MINILM_L12_V2_AUGMENTED ORA$ONNXTARGET VECTOR VECTOR VECTOR(384,FLOAT32) SQL> SQL> l 1 SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM, 2 ALGORITHM_TYPE, MODEL_SIZE 3 FROM user_mining_models 4 WHERE model_name = 'ALL_MINILM_L12_V2_AUGMENTED' 5* ORDER BY MODEL_NAME SQL> / MODEL_NAME MINING_FUNCTION ALGORITHM ALGORITHM_ MODEL_SIZE -------------------------------- ------------------------------ ------------------------------ ---------- ---------- ALL_MINILM_L12_V2_AUGMENTED EMBEDDING ONNX NATIVE 133322334 SQL> SQL>SELECT * FROM DM$VMALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME ; NAME VALUE ------------------ ------------------------------------------------ Graph Description Graph combining tokenizer and main_graph tokenizer main_graph Graph Name tokenizer_main_graph Input[0] input:string[?] Output[0] embedding:float32[?,384] Producer Name onnx.compose.merge_models Version 1 6 rows selected. SQL> SQL> SELECT * FROM DM$VPALL_MINILM_L12_V2_AUGMENTED ORDER BY NAME; NAME VALUE ------------------ ------------------------------------------------ embeddingOutput embedding function embedding SQL> SQL> SELECT * FROM DM$VJALL_MINILM_L12_V2_AUGMENTED; METADATA -------------------------------------------------------------------------------- {"function":"embedding","embeddingOutput":"embedding","input":{"input":["DATA"]} SQL>
SQL> SELECT VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA) AS embedding; EMBEDDING ------------------------------------------------------------------------------------------------------------------------------------ [1.65517051E-002,3.19098569E-002,-1.96293015E-002,-3.56926955E-002, SQL> SQL> SELECT VECTOR_DIMS(VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA)) as Dimensions_Count; DIMENSIONS_COUNT ---------------- 384 SQL> note : The output is actually a list of 384 dimensions, I copied the whole output from SQL Developer put it into a Python List and then did a LEN(list) to count the number of elements
SQL> drop table my_data_source purge; Table dropped. SQL> create table my_data_source(id number primary key, my_sentence varchar2(4000)); Table created. SQL> insert into my_data_source values (1,'The quick brown fox jumped over the lazy dog'); 1 row created. SQL> insert into my_data_source values (2,'she sells sea-shells at the seashore'); 1 row created. SQL> insert into my_data_source values (3,'the fox and dog are brown friends'); 1 row created. SQL> insert into my_data_source values (4,'the elephant knows the fox and dog'); 1 row created. SQL> insert into my_data_source values (5,'the fox, dog and elephant live together'); 1 row created. SQL> insert into my_data_source values (6,'aeroplanes fly in the sky'); 1 row created. SQL> commit; Commit complete. SQL> SQL> select * from my_data_source; ID ---------- MY_SENTENCE ------------------------------------------------------------------------------------------------------------------------------------ 1 The quick brown fox jumped over the lazy dog 2 she sells sea-shells at the seashore 3 the fox and dog are brown friends 4 the elephant knows the fox and dog 5 the fox, dog and elephant live together 6 aeroplanes fly in the sky 6 rows selected. SQL>
Create the Table to hold the computed Vectors, using the model that has been loaded
SQL> drop table my_data_vectors; Table dropped. SQL> create table my_data_vectors (id number primary key, my_sentence varchar2(4000), sentence_vector vector); Table created. SQL> SQL> insert into my_data_vectors select ds.id as id, my_sentence, to_vector(et.embed_vector) sentence_vector from my_data_source ds, dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(ds.my_sentence), json('{"normalize":"all"}')), json('{"provider":"database", "model":"ALL_MINILM_L12_V2_AUGMENTED"}')) t, JSON_TABLE(t.column_value, '$[*]' COLUMNS (embed_id NUMBER PATH '$.embed_id', embed_data VARCHAR2(4000) PATH '$.embed_data', embed_vector CLOB PATH '$.embed_vector')) et; 2 3 4 5 6 7 8 6 rows created. SQL> SQL> commit 2 / Commit complete. SQL> select * from my_data_vectors; ID ---------- MY_SENTENCE ------------------------------------------------------------------------------------------------------------------------------------ SENTENCE_VECTOR ------------------------------------------------------------------------------------------------------------------------------------ 1 The quick brown fox jumped over the lazy dog [1.70537992E-003,5.95331714E-002,-2.32173726E-002,3.02353837E-002, 2 she sells sea-shells at the seashore [1.2790652E-002,5.97022101E-002,8.24511051E-002,3.93822305E-002, 3 the fox and dog are brown friends [7.67209902E-002,-7.82399923E-002,-2.40087509E-002,-2.37390138E-002, 4 the elephant knows the fox and dog [4.04452085E-002,-4.15055361E-003,-1.71641614E-002,2.63017584E-002, 5 the fox, dog and elephant live together [9.44276601E-002,-4.74944711E-002,2.31287945E-002,2.9055763E-002, 6 aeroplanes fly in the sky [1.1147093E-001,2.18752325E-002,-4.58196662E-002,-2.64751501E-002, 6 rows selected. SQL>
Do a Semantic Similarity Search for certain phrases :
SQL> l 1 SELECT VECTOR_DISTANCE( 2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA), 3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quicker yellow fox jumped' as DATA), 4* COSINE) as Vector_Distance SQL> / VECTOR_DISTANCE --------------- 1.76127E-001 SQL> SQL> l 1 SELECT VECTOR_DISTANCE( 2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA), 3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'seashells' as DATA), 4* COSINE) as Vector_Distance SQL> / VECTOR_DISTANCE --------------- 9.81778E-001 SQL> SQL> l 1 SELECT VECTOR_DISTANCE( 2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA), 3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'moon elephants' as DATA), 4* COSINE)as Vector_Distance SQL> / VECTOR_DISTANCE --------------- 7.51297E-001 SQL> --- "the quicker yellow fox jumped" is not very far away from "the quick brown fox jumped" --- "moon elephants" is far away from the "the quick brown fox jumped". --- "seashells" is much further away. --- likely because the prebuilt model recognises tnat "fox" and "elephant" have some similarity being animals SQL> l 1 SELECT VECTOR_DISTANCE( 2 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'The quick brown fox jumped' as DATA), 3 VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'houshold electricals' as DATA), 4* COSINE)as Vector_Distance SQL> / VECTOR_DISTANCE --------------- 9.97313E-001 SQL> -- "household" electricals is further away SQL> l 1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) as Vector_Distance 2 FROM my_data_vectors 3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'sea-shells' as DATA), COSINE) 4* FETCH FIRST 2 ROWS ONLY SQL> / ID MY_SENTENCE VECTOR_DISTANCE ------------ ------------------------------------------------ --------------- 2 she sells sea-shells at the seashore 3.65937E-001 6 aeroplanes fly in the sky 8.43676E-001 SQL> SQL> l 1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) as Vector_Distance 2 FROM my_data_vectors 3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'she sells seashells' as DATA), COSINE) 4* FETCH FIRST 2 ROWS ONLY SQL> / ID MY_SENTENCE VECTOR_DISTANCE ------------ ------------------------------------------------ --------------- 2 she sells sea-shells at the seashore 1.46406E-001 6 aeroplanes fly in the sky 9.26212E-001 SQL> --- If I refine my query from "sea-shells" to "she sells seashells", I can get a better match, with the Vector Distance dropping from 3.6*10^-1 to 1.4*10^-1 SQL> l 1 SELECT id , my_sentence, vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) as Vector_Distance 2 FROM my_data_vectors 3 ORDER BY vector_distance(sentence_vector , VECTOR_EMBEDDING(ALL_MINILM_L12_V2_AUGMENTED USING 'quick fox' as DATA), COSINE) 4* FETCH FIRST 2 ROWS ONLY SQL> / ID MY_SENTENCE VECTOR_DISTANCE ------------ ------------------------------------------------ --------------- 1 The quick brown fox jumped over the lazy dog 3.22912E-001 4 the elephant knows the fox and dog 5.21092E-001 SQL> --- "quick fox" has a better similarity with sentence 1 than sentence 4
What LLMs do when "generating" the next word (known as "token") is to find the "token" that has a higher probability -- a lower COSINE Distance
No comments:
Post a Comment