Back to Blog
API GuidesJune 26, 20258 min read

Best Free Embedding APIs for RAG Applications in 2025

Build RAG (Retrieval-Augmented Generation) pipelines for free — a comparison of the best free embedding APIs with code examples for vector search and semantic similarity.

What Are Embeddings and Why Do They Matter for RAG?

Embeddings are numerical vectors that represent the meaning of text. When you convert sentences into embeddings, similar sentences end up close together in vector space — which is what makes semantic search possible. RAG (Retrieval-Augmented Generation) uses embeddings to find relevant context from a document database, then feeds that context to an LLM to generate accurate, grounded answers.

Without good embeddings, your RAG system retrieves irrelevant documents and the LLM hallucinates. With good embeddings, it retrieves precisely what the user is asking about. The embedding model is often the most overlooked component of a RAG pipeline, and the cost of embedding APIs can add up fast — unless you use free alternatives.

Free Embedding API Options in 2025

ProviderModelDimensionsFree TierOpenAI-Compatible
FreeLLMKeystext-embedding-3-small1536Yes — via shared keyYes
Hugging Faceall-MiniLM-L6-v2384Yes — inference APINo
Cohereembed-english-v3.01024Trial key (1K calls)No
Google AI Studiotext-embedding-004768Yes — free tierNo
Ollama (local)nomic-embed-text768Free — runs locallyYes
Jina AIjina-embeddings-v310241M tokens freeNo

Using Free Embeddings via FreeLLMKeys

from openai import OpenAI

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-your-freellmkeys-key"
)

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Test it
vec = embed("What is a transformer model?")
print(f"Vector dimensions: {len(vec)}")  # 1536

Building a Complete Free RAG Pipeline

pip install openai faiss-cpu numpy
import numpy as np
import faiss
from openai import OpenAI

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-your-freellmkeys-key"
)

# ── Step 1: Embed your documents ──────────────────────────

documents = [
    "Python is a high-level programming language known for readability.",
    "FastAPI is a modern web framework for building APIs with Python.",
    "LLMs are neural networks trained on large amounts of text data.",
    "RAG combines document retrieval with language model generation.",
    "Vector databases store embeddings for fast similarity search.",
    "FAISS is a library for efficient similarity search of dense vectors.",
]

def get_embeddings(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return np.array([d.embedding for d in response.data], dtype="float32")

doc_embeddings = get_embeddings(documents)

# ── Step 2: Build FAISS index ─────────────────────────────

dim   = doc_embeddings.shape[1]  # 1536
index = faiss.IndexFlatIP(dim)   # Inner product = cosine similarity (with normalized vectors)
faiss.normalize_L2(doc_embeddings)
index.add(doc_embeddings)

# ── Step 3: Retrieval function ────────────────────────────

def retrieve(query: str, top_k: int = 3) -> list[str]:
    q_vec = get_embeddings([query])
    faiss.normalize_L2(q_vec)
    scores, indices = index.search(q_vec, top_k)
    return [documents[i] for i in indices[0] if i >= 0]

# ── Step 4: RAG — retrieve then generate ─────────────────

def rag_answer(question: str) -> str:
    context_docs = retrieve(question)
    context      = "\n".join(f"- {doc}" for doc in context_docs)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Answer questions using only the provided context. If the answer is not in the context, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

# Test it
print(rag_answer("What is FAISS used for?"))
print(rag_answer("How does RAG work?"))
print(rag_answer("What is the capital of France?"))  # Should say "not in context"

Using Local Embeddings with Ollama (Zero API Cost)

# Pull the embedding model
# ollama pull nomic-embed-text

local_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

def local_embed(text: str) -> list[float]:
    response = local_client.embeddings.create(
        model="nomic-embed-text",
        input=text
    )
    return response.data[0].embedding

Which Embedding Model Should You Use?

  • Best quality, free via FreeLLMKeys: text-embedding-3-small — OpenAI's model, 1536 dimensions, excellent retrieval quality
  • Completely free, no API needed: Ollama with nomic-embed-text — good quality, runs locally, no rate limits
  • Multilingual: Jina AI's jina-embeddings-v3 — best for non-English content, 1M token free tier
  • Fastest for prototyping: FreeLLMKeys — grab a key, copy the code above, running in 5 minutes

For most RAG projects, start with text-embedding-3-small via FreeLLMKeys. It gives you OpenAI-quality embeddings at zero cost during development. When you move to production, you can upgrade to the official API or switch to local embeddings with Ollama — the code barely changes.

F
FreeLLMKeys Team
Building tools for the AI developer community