Don’t be dense: Launching sparse indexes in Pinecone

Pinecone has long supported sparse retrieval in serverless indexes, evolving from sparse boosting—where dense and sparse methods are combined—to now offering sparse-only indexes for greater control and precision. This new index type enables direct indexing and retrieval of sparse vectors, supporting both traditional methods like BM25 and advanced learned sparse models such as pinecone-sparse-english-v0, and is now available in public preview for all users.

What are sparse vectors?

First things first: what on earth is a “sparse” vector, and what is the difference between a “sparse” and “dense” vector in machine learning and information retrieval applications, anyway?

A dense vector is what we traditionally think of as coordinates that can be represented on an x/y/z axis, in “Euclidian” space (remember grade school geometry…).

For example, in 3 dimensions the dense vector:

Can be visualized as:

For dense vectors we choose a “dimension”, in this example of 3, which corresponds to the length of the vector. For dense vectors all “dimensions” in the vector need to be represented with a real value number, meaning the space required to represent the vector scales linearly as O(dimension).

Now consider exact-matching text retrieval applications, where we can leverage sparse vectors to represent documents and searches. Consider a case where we want to find articles that contain exact matches for the text search: “Why did the price of Nvidia stock go up today?”

Let’s assign each unique “word” in this search query a unique number:

Word	Position
why	0
did	1
the	2
price	3
of	4
Nvidia	5
...	...
today	9

Then, take this mapping and transform the text query into a vector, where the value at position i in the vector is set to 1 if the term is present in that query.

Query: “Why did the price of Nvidia stock go up today?”

Then we can similarly represent the article search results with the same mapping from word to number:

Article 1: “The price of Nvidia went up because they reported strong earnings”

Article 2: “PIMCO had a positive EBIDTA of 2.5B in the second quarter.”

Now we can score the query pairwise with the article, by performing a dot-product multiplication between the query vector and the article vector:

A higher score for a given article compared to others indicates, in this case correctly, that it is more relevant. This works similarly to how an index at the back of a book (remember those) used to work, showing you which pages a given word or phrase appears on.

In this toy example, we can represent each word with an index in a small dense vector, but in practice, the number of unique words in a collection of documents can reach millions (the number of unique words in the Oxford English Dictionary is over 600,000). The space required to store all of these representations as dense vectors renders them impractical: for 100 million articles with a modest total vocabulary size of 100,000, we would have to store 10 trillion real numbers, for a total size of about 40 terabytes.

By representing these vectors as “sparse”, we can take advantage of the fact that most articles will only contain a small subset of words from the total vocabulary. This means that 99%+ of values in the dense vectors from the set of articles are useless “0”s.

We can avoid storing these zero values by instead representing the vectors as a compressed list of (index, score) where each non-zero value is represented by its index (position in the vector) and the corresponding real number score at that position in the vector. For example, the above articles would be represented as:

And the query would be:

Assuming 50 non-zero values per vector, the corresponding size of these sparse representations for 100M vectors would now be 50 * 100,000,000 * 8 bytes = 40GB, 1000x smaller (and cheaper) than before.

In order to score a query against any given document, we again perform the dot-product, multiplying the corresponding scores at each pair of non-zero coordinates in the query and document, and then taking the sum:

Why should you care?

Most people today use Pinecone for high-dimensional dense vector embedding search over text. These dense representations of documents are good at capturing semantic information but don’t achieve the same predictability and precision as traditional sparse or lexical search, due to the lack of exact matching that I’ve shown above.

That’s why sparse or lexical indexes are best suited for search cases where exact token matching is desirable. For example, if you have a corpus of finance articles and want to search for a specific stock ticker like “NVDA” you may want to get all the articles that reference this exact stock ticker. The results can then be ranked based on the contextual importance of that specific token within the document, such as with the pinecone-sparse-english-v0 model (learn more).

Sparse representations of text are also cheaper and faster to produce since documents can be scored either with a heuristic such as BM25 or a cheap model. This makes sparse indices a good fit for applications that are sensitive to overall latency or cost.

Pinecone’s next-generation serverless architecture

Using our new serverless architecture, we’re now able to provide lexical/keyword search through the integrated inference API as well as a new fully managed “sparse” index type which can be leveraged for more custom text search solutions.

Additionally, we leverage our novel serverless LSM search architecture to provide a managed service that scales dynamically with any shape or size of workload, avoiding the need for up-front configuration of parameters or manually scaling shards. This is in sharp contrast to existing sparse or lexical search offerings on the market.

Performance

Pinecone's performance for sparse search outperforms existing state-of-the-art search systems such as Elasticsearch and OpenSearch. Using sparse embeddings produced by the pinecone-sparse-english-v0 model for the MS Marco DL19 dataset of 8.9M vectors, we performed the following benchmarks on a single node Elasticsearch cluster, and an OpenSearch cluster running a single r7g.large.search node. We compared these to the newly available Pinecone sparse vector index type running on similar hardware:

We see that the performance of Pinecone vastly surpasses that of Elasticsearch and OpenSearch (running on AWS), especially when returning a large number of candidates (n=1000). In the following section, I will explain how the superior performance of Pinecone is achieved by a combination of optimal algorithms and low-level hardware optimizations.

Algorithms

In order to perform low-latency search over millions of sparse vectors and return only the most relevant documents (by dot product score), we use an “inverted index” data structure. This data structure stores a list of documents that contain every unique word, ie (from the above example):

For every search query, we look up the list containing all documents that contain that word. Then, we leverage the “MaxScore” algorithm to score the documents in an optimized order, avoiding computing the scores for documents whose scores are guaranteed to be outside of the top results and thus irrelevant anyway.

Non-lexical sparse vectors

Recent research has focused on various techniques for generating embedding vectors that feature the best characteristics of both sparse and dense embeddings. “SPLADE” most notably pioneered the approach, with OpenSearch also publishing a similar offering. These representations are more “dense” than lexical sparse representations since they rely on “word-piece” tokenization, rather than splitting words by whitespace. For example, using SPLADE the sentence:

Why did the price of Nvidia stock go up today

Might be split into “words” like:

wh y did the pri cd of nv da sto ck go up to day

Resulting in embeddings with a smaller vocabulary size, but more non-zero values per document.

While still new and highly experimental, these types of embeddings represent a promising new direction for information retrieval research and are supported by Pinecone’s sparse index offering.

Getting started

The new sparse-only index is now available in public preview for all users. Sparse indexes are priced based on read units, write units, and storage based on the number of vectors and amount of data involved. See our understanding costs documentation for more details.

When combined with dense retrieval and reranking — an approach we call cascading retrieval — you get up to 48% better performance than either sparse or dense alone. Try it out in our notebook or learn more in our recent announcement. We'll update you with more benchmarks and features during our upcoming Launch Week, March 17-21.