AI Deal Watch 5.10.23

Special Edition - Vector Databases

Parker Lyman

and

Alex Shaw

May 10, 2023

Regular updates on the latest VC-backed AI startups. Follow along to stay informed!

Weaviate, an Amsterdam-based AI native vector database, raised $50M in Series B funding. Index Ventures led the round, with participation from Battery Ventures.

Qdrandt, a Berlin-based open-source vector database, raised $7.5M in Seed Round funding. Unusual Ventures led the round, with participation from 42cap, IBB Ventures, and a handful of angel backers.

Pinecone, a New York-based fully managed vector database, raised $100M in Series B funding at a $750M post. A16Z led the round, with participation from ICONIQ Growth and previous investors Menlo Ventures and Wing Venture Capital.

Vector databases have raised hundreds of millions in recent weeks. Today’s special edition breaks down how they work and why it matters for AI.

Problem to be Solved

Ever tried to ask ChatGPT for information on specific, niche topics? You probably realized that AI chatbots based on large language models (LLMs) aren’t great when you need accurate answers based on specific data (they can be “confidently wrong” or “hallucinate”). LLMs seem smart because they have general knowledge gleaned from training billions of parameters on massive datasets, but they fail in use cases that require accurate information on specific data and facts because they don’t have memory or context.

Vector databases empower LLMs with memory and context. This makes it possible to build applications for talking to a specific expert on any subject, searching specific YouTube videos or podcasts for relevant clips, Q&A over an internal business database, or searching the King James Bible in natural language.

As developers race to integrate LLMs into pretty much everything, the need for supplementing foundation model prompts with context is apparent. Vector databases are rapidly gaining popularity.

We first covered vector databases when Chroma raised its Seed Round. Now we’re diving deeper into how they actually work.

How They Use AI

Let’s start with a simple analogy explaining the role of vector databases and their pros and cons.

Imagine you spend a week reading the first book in The Lord of the Rings. When you are done, a friend asks you a few questions about the book. You answer their questions off the top of your head but get some details wrong because you can’t reference the book.

Then, they ask you a question about the second book, which you haven’t read yet. Lucky for you, you have the second book with you, and you are able to page through it and eventually find a passage that answers your friend’s question. You quickly skim the page and then answer your friend’s question.

A vector database functions in the same way as the second book in this analogy. Whenever a user asks a model about something the model has not “read,” it needs to reference the database in order to generate its answer, just like you needed to reference the book to answer your friend’s question.

So what are the benefits? In our analogy, a big one was that you were able to answer your friend’s question about a book you had not read yet. Additionally, you got all the details correct because you had the text right in front of you. Similarly, vector databases allow language models to reference material they have not been trained on, and reduce the likelihood that they make mistakes.

And the drawbacks? In our analogy, the biggest drawback is that it took much longer to look through the text, find the right passage, read it, and then answer than to just answer a question about a book you already read. Also, you could only answer questions about small snippets of text and you risk accidentally referencing the wrong passage. Likewise, vector databases increase the computational and temporal costs of generating responses. This is why vector databases try to store and retrieve context in efficient ways, similar to how a textbook helps you look things up quickly with a table of contents. Additionally, vector databases only retrieve small portions of text so they cannot be used to answer general or broad questions about the database’s contents as a whole. And last but not least, they are far from perfect and will often retrieve suboptimal passages that may not even answer the user’s question.

Okay, now that we’ve built some intuition, let’s get a bit more technical. Vector databases sit between foundation models and the application layer in the generative AI stack.

The first step of integrating a vector database into an application is to embed and upload the unstructured data (typically just freeform text) into the database. Embedding the text is usually done by an external model (OpenAI’s embedding models are a popular choice). Embedding text just means using a model to turn the text into a vector, or sequence of numbers: e.g. [1.2, 4.2, 8.9]. The text is embedded in such a way that semantically similar texts also have numerically similar embeddings (usually defined by cosine similarity). The embeddings are then stored in a clever way to make retrieval really efficient (think back to our table-of-contents example in the analogy).

When the user submits a prompt, the backend embeds the prompt and computes the similarity between the embedded prompt and the embeddings in the vector database. The database then retrieves the text corresponding to the embeddings with the highest similarity and uses that as context to enrich the prompt.

To summarize, despite lacking abstraction abilities, potentially overlooking nuanced questions, and adding computational cost to the application, vector databases fulfill the essential need of providing language models with context on specific and niche topics, enabling the models to better function in a variety of products.

Business Model

These businesses are built on open-source software, meaning users can run small-scale apps using vector databases for free. But of course, vector databases have raised hundreds of millions of dollars in VC funding over just a few weeks because investors believe they will eventually produce cash. As applications scale, they eventually require access to commercial products run on fully managed infrastructure. These solutions make money on a usage-based model.

For example, Pinecone’s enterprise-level product starts at $104/ month for one “index” (stores ~5M vectors) on one “S1 pod” (Each S1 pod includes 1 vCPU and 20GB SSD) running for 30 days at an estimated price of $0.144/ hour.

AI Meme Watch

Go Deeper into AI

Thanks for reading, feel free to share :)