Introduction

The Inference.net Embeddings API provides a fully OpenAI-compatible interface for generating high-quality text embeddings. Embeddings are numerical representations of text that capture semantic meaning, enabling powerful applications like semantic search, clustering, recommendations, and anomaly detection. Our embeddings API is designed to be a drop-in replacement for OpenAI’s embeddings endpoint, making it easy to switch between providers without changing your code.

Key Features

  • OpenAI Compatible: Use the same code and libraries you already have
  • Multiple Models: Choose from a variety of embedding models optimized for different use cases
  • High Performance: Fast inference with low latency
  • Scalable: Handle high-volume workloads with our robust infrastructure
  • Asynchronous Support: Process large batches with webhooks for completion notifications

Quick Start

API Endpoint

POST https://api.inference.net/v1/embeddings

Basic Request

curl https://api.inference.net/v1/embeddings \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3-embedding-4b",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Request Parameters

ParameterTypeRequiredDescription
modelstringYesThe embedding model to use. See available models
inputstring or arrayYesText(s) to embed. Can be a string or array of strings
encoding_formatstringNoFormat for the embeddings. Options: float (default) or base64
metadataobjectNoCustom metadata for tracking and webhooks

Batch Processing

You can embed multiple texts in a single request by passing an array:
const response = await openai.embeddings.create({
  model: "qwen/qwen3-embedding-4b",
  input: [
    "First text to embed",
    "Second text to embed",
    "Third text to embed"
  ]
});

// Access individual embeddings
const firstEmbedding = response.data[0].embedding;
const secondEmbedding = response.data[1].embedding;

Response Format

The API returns embeddings in the standard OpenAI format:
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023064255, -0.009327292, ...]
    }
  ],
  "model": "qwen/qwen3-embedding-4b",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Asynchronous Embeddings

For large-scale embedding tasks, use our asynchronous API with webhooks to receive notifications when processing is complete.

Async Request

To make asynchronous embedding requests, use the /v1/async base URL and include a webhook ID in your request metadata:
curl https://api.inference.net/v1/async/embeddings \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3-embedding-4b",
    "input": ["Text 1", "Text 2", "Text 3"],
    "metadata": {
      "webhook_id": "YOUR_WEBHOOK_ID"
    }
  }'

Webhook Payload

When the embedding process completes, you’ll receive a webhook with the following structure:
{
  "event": "async-embedding.completed",
  "timestamp": "2024-01-15T10:30:00Z",
  "webhook_id": "YOUR_WEBHOOK_ID",
  "generation_id": "EMB_abc123",
  "data": {
    "state": "Success",
    "stateMessage": "Embeddings generated successfully",
    "request": {
      "model": "qwen/qwen3-embedding-4b",
      "input": ["text1", "text2", ...],
      "metadata": { "webhook_id": "YOUR_WEBHOOK_ID" }
    },
    "response": {
      "object": "list",
      "data": [
        {
          "object": "embedding",
          "index": 0,
          "embedding": [...]
        }
      ],
      "usage": {
        "prompt_tokens": 100,
        "total_tokens": 100
      }
    },
    "finishedAt": "2024-01-15T10:30:00Z"
  }
}
See our webhook documentation for more details on setting up and handling webhooks.

Use Cases

# Generate embeddings for your documents
documents = ["Document 1 text", "Document 2 text", ...]
doc_embeddings = []

for doc in documents:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)

# Generate embedding for search query
query = "Find documents about machine learning"
query_response = client.embeddings.create(
    model="qwen/qwen3-embedding-4b",
    input=query
)
query_embedding = query_response.data[0].embedding

# Calculate similarities (cosine similarity shown)
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]

Text Classification

Embeddings can be used as features for classification tasks:
# Generate embeddings for training data
train_embeddings = []
for text in training_texts:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=text
    )
    train_embeddings.append(response.data[0].embedding)

# Use embeddings as features for your classifier
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(train_embeddings, labels)

Best Practices

  1. Batch Requests: Send multiple texts in a single request for better efficiency
  2. Model Selection: Choose models based on your specific use case. Visit our models page to compare options
  3. Normalization: Some applications benefit from normalizing embeddings to unit length
  4. Caching: Store generated embeddings to avoid redundant API calls
  5. Async for Large Batches: Use webhooks for processing large datasets asynchronously

Error Handling

try:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=text
    )
except Exception as e:
    print(f"Error generating embedding: {e}")
    # Handle error appropriately

Rate Limits

Embeddings API requests are subject to our standard rate limits. See our rate limits documentation for details.

Pricing

Embeddings are priced per token. Visit our models page for current pricing information.

Next Steps