Reranking with embedding models
A python code of RAG's reranking
Reranking is a second step in Retrieval Augmented Generation (RAG) systems, right between Retrieving and Generating.
This above is how Flux-1 dev imagines Electric cubes in digital space
.
Retrieval with reranking
If we store the documents in the form of embeddings in the vector DB from the start - the retrieval will give us the list of similar documents right away.
Standalone reranking
But if we download the documents from the internet first, the search system response could be affected by search provider preferences/algorithms, sponsored content, seo optimisation, etc. so we need to to post-search reranking.
What I was doing -
- getting embeddings for the search query
- getting embeddings for each document. the doc wasn’t expected to be more than 8k tokens anyway
- computed similarity between query and each of the document’s embeddings
- sorted documents by this similarity.
No vector db here, let’s go.
Sample code
Using the Langchain to connect to Ollama and langchain’s cosine_similarity function. You can filter by similarity measure, but keep in mind that for different domain and embedding LLMs it the threshold would be different.
I will be glad if this bit of code is useful to you in anyway. Copy/Paste/UseAnyWayYouWant license. Cheers.
from langchain_core.documents import Document
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.utils.math import cosine_similarity
import numpy as np
def cosine_distance(a: np.ndarray, b: np.ndarray) -> np.ndarray:
return 1.0 - cosine_similarity(a, b)
def compute_score(vectors: np.ndarray) -> float:
score = cosine_distance(vectors[0].reshape(1, -1), vectors[1].reshape(1, -1)).item()
return score
def list_to_array(lst):
return np.array(lst, dtype=float)
def compute_scorel(lists) -> float:
v1 = list_to_array(lists[0])
v2 = list_to_array(lists[1])
return compute_score([v1, v2])
def filter_docs(emb_model_name, docs, query, num_docs):
content_arr = [doc.page_content for doc in docs]
ollama_emb = OllamaEmbeddings(
model=emb_model_name
)
docs_embs = ollama_emb.embed_documents(content_arr)
query_embs = ollama_emb.embed_query(query)
sims = []
for i, emb in enumerate(docs_embs):
idx = docs[i].id
s = compute_scorel([query_embs, docs_embs[i]])
simstr = str(round(s, 4))
docs[i].metadata["sim"] = simstr
sim = {
"idx": idx,
"i": i,
"sim": s,
}
sims.append(sim)
sims.sort(key=sortFn)
sorted_docs = [docs[x["i"]] for x in sims]
filtered_docs = sorted_docs[:num_docs]
return filtered_docs
Best Embedding Models
For my tasks the best embedding model currently is bge-large:335m-en-v1.5-fp16
The second place took nomic-embed-text:137m-v1.5-fp16
and jina/jina-embeddings-v2-base-en:latest
.
But do your own tests for your own domain and queries.
Useful links
- https://en.wikipedia.org/wiki/Retrieval-augmented_generation
- Python Cheatsheet
- Writing effective prompts for LLMs
- Testing LLMs: gemma2, qwen2 and Mistral Nemo
- Install and Configure Ollama
- LLMs comparison: Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi
- Conda Cheatsheet
- Ollama cheatsheet
- Docker Cheatsheet
- Layered Lambdas with AWS SAM and Python