GS Catalyst - voyage-multimodal-3 Embedding Model

Product Description

Overview
Multimodal embedding models convert multiple data types—such as text and images—into numerical vectors. They are essential for semantic search, retrieval systems, and retrieval-augmented generation (RAG), directly impacting retrieval performance. voyage-multimodal-3 is an advanced multimodal embedding model that uniquely embeds interleaved text and images while extracting visual information from PDFs, slides, tables, figures, and more—removing the need for complex document parsing. Across three multimodal retrieval benchmarks (20 datasets), it delivers an average 19.63% gain in retrieval accuracy over the next best model. It achieves 75 ms latency for single queries (≤200 tokens) and supports 57M tokens/hour throughput at $0.06 per 1M tokens on an ml.g6.xlarge instance.

Highlights

Embeds interleaved text and images while capturing visual cues from PDFs, slides, tables, and figures—no complex parsing required.
Achieves an average 19.63% improvement in retrieval accuracy across three multimodal tasks (20 datasets).
Supports 32K token context length with 75 ms latency (≤200 tokens) and 57M tokens/hour throughput at $0.06 per 1M tokens on ml.g6.xlarge.

Tell Us About Your Needs

Company Name *

Company Industry *

Request Private Offer

How can we help?

Submit Request Browse Other Products