If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).
Chunking is the most under-estimated hyperparameter in Retrieval ‑ Augmented Generation (RAG):
it silently determines what your LLM “sees”,
how expensive ingestion becomes,
and how much of the LLM’s context window you burn per answer.
From basic RAG to production: chunking, vector search, reranking, and evaluation in one guide.
Production-focused guide to building RAG systems: chunking, vector stores, hybrid retrieval, reranking, evaluation, and when to choose RAG over fine-tuning.
Choosing the right vector store can make or break your RAG application’s performance, cost, and scalability. This comprehensive comparison covers the most popular options in 2024-2025.
Unify text, images, and audio in shared embedding spaces
Cross-modal embeddings represent a breakthrough in artificial intelligence, enabling understanding and reasoning across different data types within a unified representation space.
Retrieval-Augmented Generation (RAG)
has evolved far beyond simple vector similarity search.
LongRAG, Self-RAG, and GraphRAG represent the cutting edge of these capabilities.
Implementing RAG? Here are some Go code bits - 2...
Since standard Ollama doesn’t have a direct rerank API,
you’ll need to implement reranking using Qwen3 Reranker in GO by generating embeddings for query-document pairs and scoring them.
The Qwen3 Embedding and Reranker models are the latest releases in the Qwen family, specifically designed for advanced text embedding, retrieval, and reranking tasks.