Promethean Wager - AI, Vector Databases, and Data Sovereignty

Promethean Wager – AI, Vector Databases, and Data Sovereignty

June 14, 2024

Stephen Batifol

In the frontier of AI, vector databases have emerged as a pivotal technology, sitting at the intersection of everyday ops concerns, AI application frameworks, and data compliance.

Join our CEO Vinay and Stephen Batifol, a seasoned Developer Advocate at Zilliz as they explore the fundamental differences between vector and traditional databases, their symbiotic relationship with AI, and the implications of the large-scale adoption of Large Language Models (LLMs) driving AI’s current evolution, including the data sovereignty/AI dilemma. They also shed light on the strategic and practical implications of vector databases in today’s tech environments, envision their future trajectory, and much more.

Whether you’re a tech leader, a dev, or simply an AI enthusiast, this episode will give you insights into the landscape’s current shape and what you should think about when navigating the powerful, complex interactions of AI, vector databases, and data sovereignty.

Key insights

A second act: Vector databases have never been more relevant

The recent boom in vector databases has a lot to do with big leaps in AI, particularly with the rise of GenAI LLMs like ChatGPT. The need for Retrieval Augmented Generation (RAG) models to ensure output integrity has also played a significant role here.

But don’t forget that, AI-hype or not, these databases aren’t exactly new and have been key in supporting ML-based systems that we take for granted, like the recommendation engines implemented by AWS and Netflix. Take Milvus, which has been around since 2017, or tried-and-true algorithms like FAISS and HNSW that have been in the game for years.

LLMs as Magwai: Be careful how you care for them

LLMs like GPT-4 are on the brink of changing the way organizations operate. That could be anything from how they are productized for external customers to how they are used to enhance the efficiency of internal operations, whether those be engineering, revenue, or HR.

We’ll likely see LLMs used in two ways moving forward: public versions for everyday tasks and private versions for critical workloads. But, especially for these models to be widely adopted for critical workloads, the aphorism Garbage In / Garbage Out still holds true. RAG will play a key role, but simply implementing a RAG pipeline isn’t sufficient; thorough data curation and validation processes are needed to achieve accuracy and trust.

Keep your data close, and your AI closer: How to ensure data privacy

As AI evolves, concerns about data privacy and its potential misuse by companies are growing. Individual privacy concerns aside, companies whose business models revolve around harnessing proprietary data to develop unique products face significant risks.

To address these concerns, robust data privacy guardrails should be implemented. For instance, companies can employ local AI models that operate within their own environments. On the database operations side, they can implement solutions like Zilliz’s “Bring Your Own Compute” (BYOC), thereby further reducing the bad actor risks.

The devil you know: When to go augmented or specialized

Deciding between a specialized vector database and augmenting a traditional database with vector search capabilities depends on your organization’s needs. For small use cases,, solutions like PostgreSQL’s pgvector are affordable and efficient. But, they will quickly hit their ceiling, whether that be in terms of load or feature sets like supported index count.

Specialized databases like Milvus could be your answer if you’re expecting significant volume and require robust feature sets. However, you need to bear in mind that the complexity of scale brings about operational challenges. So, consider your current use case’s needs, load growth assumptions, and how much complexity you’re ready to manage.

Episode highlights

Time in the sun: Why vector databases are experiencing a resurgence [3:30 – 4:58]

Vinay and Stephen discuss how ChatGPT and other LLMs have amplified interest in vector databases, though the likes of Milvus and others have been around and in use for a while. They delve into how these databases are employed outside of the current AI use case, their significant role for advanced AI applications, how they relate to RAG pipelines, and how ongoing improvements in deep learning models have made them more relevant than ever.

Timeless advice: Garbage in / Garbage Out still prevails [11:18 – 13:05]

Vinay and Stephen discuss the hurdles to widespread adoption of LLMs, noting their tendency to “hallucinate”. Stephen underscores the fundamental principle, Garbage In / Garbage Out, highlighting that more care needs to be taken in data curation. The approach should never be to toss an LLM a bag of trash, and expect one filled with rainbows in return.

“Can you eat rocks?” Limitations of data quality assurance methods [14:40 – 15:58]

They dive further into how traditional databases return precise results, while vector databases return approximations based on, well, vectors. This can result in downright ridiculous results, such as the infamous response to the “Can you eat rocks?” query. Stephen mentions using methods such as LLM as judge and golden queries, but they pose their own limitations, especially when you’re talking about scales reaching Google levels.

Healthy boundaries: What specific data privacy concerns does AI face? [16:16 – 17:52]

Stephen sheds light on the pervasive paranoia surrounding data privacy in AI, particularly among enterprises and privacy-conscious states like Germany. Concerns about OpenAI snooping into private emails or uncovering intellectual property have people clutching their digital pearls. He suggests that the “data is the new oil” mantra has never been truer, especially in crucial sectors like biotech where proprietary information is akin to black gold. Vinay highlights Slack and others’ use of historical user data for AI training, sometimes without explicit consent, further underscoring anxieties about data usage and control.

The value ladder: Where data sovereignty concerns rank when making tooling decisions [26:41 – 27:27)

Stephen’s stance is that developers prioritise ease of implementation and scalability when selecting solutions. They want something that’s quick to set up and operationalize without getting bogged down in complex infrastructure tasks. While options like Kubernetes give them flexibility and control over where they place their workloads, specific concerns around data sovereignty and other compliance standards might not always be top of mind.

Let us count the ways: Issues with proprietary database solutions [37:31 – 39:17]

Vinay and Stephen talk through the risks involved in relying solely on proprietary database technology for vector databases (and any solution for that matter), whether they be different lock-in vectors, inflated costs, implementation limitations and outright restrictions and long-term viability. That’s not to say that open-source alternatives, especially recently, doesn’t come with their own controversy and set of considerations, but they remain the superior choice for those who want cutting-edge features, control and flexibility.

So you want to adopt? Recommendations for enterprises considering adoption [49:17 – 50:08]

Wrapping up, Stephen highlights key recommendations for enterprises adopting vector databases. First, consider the potential scale of your needs – whether you’ll have a few thousand documents or anticipate significant growth. Next, look at the maintainability of the database, especially if it’s open-source. Finally, check independent benchmarks, like the ANN-benchmark, to get an unbiased picture of potential performance and scalability.

Here’s the full transcript:

Vinay: Hello and welcome to another episode of Sovereign DBaaS Decoded, brought to you by Severalnines. My name is Vinay Joosery, Co-founder and CEO of Severalnines. Our guest today is Stephen Batifol, developer advocate at Zilliz. Thanks for joining us.

Stephen: Thanks for inviting me.

Vinay: Can you tell us a little bit about you and what you do?

Stephen: Yes, sure. I’m Stephen. I’m a developer advocate at Zilliz. I’m based in Germany, in Berlin. My job as a developer advocate is to raise awareness about a product. Zilliz is the owner and maintainer of Milvus, an open-source vector database, which we’ll probably talk about more in the future.

But my job is to talk to developers, explain to them how they can use Milvus, why they should use it, maybe why they shouldn’t use it sometimes, and to collaborate with different partners, LLamaIndex and LangChain and create content online.

Vinay: Yeah, that’s a brave new world, all these vector databases. We can say 2023 was maybe it was the year of vector databases, but not everyone understands what vector databases are or why they exist. Can you explain what these databases are and how they differ from traditional databases?

Stephen: Yeah, sure. Usually, traditional databases, you would see it as like you do your SELECT* from a specific table, and then you give your return result that is true and that is precise. It’s like you’re going to have specific results. Whereas a vector database, you mostly look through vectors.

Those are like series of numbers, and then you look at the closest vector you have in what we call a vector space. You always look at approximations, and you don’t really return a result that is always true, basically. That’s the main difference with the traditional database.

Vinay: Okay. Why are they a hot topic today? The hardware, the applications have made huge strides. Something else?

Stephen: Yes. Well, I guess you heard about it, but ChatGPT was released, and then everything went all in on vector database. They’ve been here for a while. For example, Milvus has been here since 2017, if I remember correctly.

It’s not like ChatGPT was created, then we arrived. It’s been here for a while, and now it’s really a hot topic for some different applications. One is RAG, you know retrieval augmented generation. But also now it’s really interesting because a lot of algorithms have been released in the past.

They’ve been open source, so like FAISS, HNSW, those are different algorithms which are used by vector databases. Those have been released in 2016, 2017, and they are, yeah everyone is using them. That’s one of the reasons why it’s a hot topic now.

Also, the deep learning models have been getting better and we have more and more data, which means that then we can use the embedding models to transform more data from unstructured. Let’s say, take an image, you take an audio or something, you put it to a deep learning model, then you have your vectors, and then you can do a similarity search on it. That’s also a reason the deep learning models have been getting better.

Vinay: You mentioned ChatGPT, which obviously everyone in the world now knows. But would vector databases be as popular as they are without AI and machine learning workloads? Do they offer any additional capabilities that may offer non-AI use cases?

Stephen: Yeah, I’d say for sure it wouldn’t be as popular. Let’s not lie. It’d be like, yeah, it would be everywhere. They clearly wouldn’t be as popular. We still need deep learning models anyway to transform our data. It’s also the thing we need all the time. But then we’re really good at doing some clustering or classification or just which run similarity search in general.

Without AI, which I guess you mean like LLMs and everything, which clearly wouldn’t be as popular, but then you can still run similarity search on images or audio if you’re building a recommender system. Technically, I guess it’s not AI as the way we call AI now. It’s traditional machine learning. But I don’t know, Amazon, Netflix, they all run a recommender system. Then they use vector databases in the back to provide those.

Vinay: The thing is, LLMs, GPT-4, I guess they are the poster child for the broad application of generative AI. But what do you see is their future in a way? What is the true potential in terms of adoption? Do you see, for example, LLMs in every organization?

Stephen: Yeah, that’s a good question. I feel like I can’t really predict the future, but what I feel like is that, yes, actually, it will probably be used in a lot of companies. Maybe they won’t use it directly with their customers, but it can simplify some processes. Let me give you an example of I recently signed a contract in Germany for a flat.

This contract is handwritten, it’s printed, and it’s in German. What I did was that I took a picture of it, I put it through ChatGPT 4 and I was like “can you summarize it for me and give me all the action points that I need to do?” Then read through it, translated it, and gave me the action points I had to do.

I was like, say, maybe that’s the way I see it being used everywhere in the future. Maybe not you serve it to your customer because I don’t know if it’s always useful, but to process some things to simplify the life of your employees. I think, yeah, we could have some cool things in the future.

Vinay: I guess some of these use cases can be mission critical. Some of them more like, well, they are helping you in your daily, whatever things you’re doing. Do you think we’ll see a split of in the use cases, maybe publicly available implementations for low-stakes workloads and private implementations for maybe high-stake ones?

Stephen: Yeah, definitely. I think it’s what people are realizing at the moment. I guess it’s the same for you. You probably see it on your side. But a year ago, everyone was using ChatGPT and we were like, let’s go. We use the API. We go all in on it.

Then they realized they had a couple of problems with it. I mean, it’s still an amazing system, don’t get me wrong here, but you need to fine-tune it, you need to make sure it’s correct, you need to use your own data. It can be also a bit hard to scale or maybe more expensive, I should rather say. Suddenly, if you have millions of customers, it’s going to become very expensive.

Also, you don’t really have SLAs with those private models. Sometimes they don’t tell you, “Hey, we’re going to make sure it actually works all the time”. Or 99.99%.” Those are the things where I think we’ll have different things in the future, and we can already see it with LLaMA 3. I see some customers are using LLaMA 3, for example. They’re trying to deploy it. Some of the customers, they work for governments, and they are like.

We’re just not allowed to use OpenAI or any private models, actually. We have to host it ourselves. Same for they’re like, we have to use open-source everywhere. We’re not allowed to use US software as well. I’m talking to some companies that are in Central Asia, and they’re like, yeah, we’re just not allowed to use American software. I do see different things now in the future, or probably some that you host yourself as well.

Vinay: Right. Coming back in terms of adoption, what is needed for these LLMs to, let’s say, for us to see large scale adoption? I mean, as you mentioned, there are still a lot of testing to be done. Some are still producing incorrect results, what maybe some people call these hallucinations.

I mean, is the solution training them on your data, on your own data, as implied in one of the previous webinars at Zilliz? And if so, will everyone have their own vector database in the future?

Stephen: Yeah, I think it’s a good question. I think it’s a very relevant question. I don’t know if you’ve seen, but Google recently released their new LLM-powered search. Might have backfired a bit because I think I’ve seen examples on Twitter of people being like, “Oh, can I eat rocks?” And Google being like, “Yes, it’s totally fine to eat rocks. You may eat one rock per day. It’s good for you.” What’s interesting is that before the whole AI and LLM, people were really careful with their data. They were really cleaning the data, really made sure the data was correct. I feel like we forgot that.

We were like, you can just throw everything at the LLM and then it’ll figure things out itself. Actually, now you can see that it’s still the same thing. It’s like, garbage in, garbage out. You still need to be very careful with your data. I do think fine-tuning can really help you in some cases, but you also have to be careful with the data. Just don’t throw like Reddit answers out of the blue because some answers on Reddit are amazing, but then some of them will tell you to eat rocks.

I feel like that’s probably the biggest thing in the future to be careful with. Then to reply to your second question, will everyone have a vector database. I think it can become some kind of commodity when you use LLMs in general, because even if you don’t fine-tune, you’ll probably want to add your own data.

Then you do what you call RAG, usually, which is retrieval augmented generation, which means that you take your data and then you store everything in your vector database, and then the LLM will fetch everything. We’ll run the similarity search on it, and then we’ll give you a result. That’s what I see a lot because it’s fairly easy to do as a POC. It’s quite hard to scale. It’s not an easy thing to do.

People always say, yeah RAG is very easy. It’s 20 lines of code. Yes, it is 20 lines of code when you do a POC and you have five documents. If you have a million documents, it’s a bit harder. I do, in my opinion, see it everywhere, especially to ground LLMs. But then it’s also, I think, we have I don’t know if educate is the right word, but we have to really tell developers, hey, garbage in, garbage out.

Be careful on your data. Just because you put RAG doesn’t mean everything will work. And you’re like, yeah, everything is correct now, as we’ve seen on Google.

Vinay: That’s interesting that you mentioned Reddit. I mean, Reddit did a deal with Google. I think, I don’t know, $40 million a year or something like that, right?

Stephen: Yeah. OpenAI just did a deal with Reddit. So it’s hard to know because obviously everything is on Reddit. It’s not like everything is true or it’s people discussing things, but it’s a huge amount of information in there. But then you don’t know what’s not all of it is accurate, obviously.

Vinay: Given the, let’s say, this decision, let’s say, process opacity, if we can say. What guardrails can we implement to, let’s say, validate outputs, ensure accuracy, identify potential bias within the training data? There’s also talks about machine learning, machine un-learning, in a way. What’s your thoughts?

Stephen: I feel like it’s a very complicated one. I’ve seen a lot recently as LLM as a judge, for example, where you use an LLM to then verify your LLM queries. Also, to have some golden queries where you know the truth and you make sure that your LLM is saying the truth. But it’s also like, would you ever thought of writing down, “can you eat rocks?” It’s also the thing.

What I feel like is that usually it can be good to have different sources, check different sources, check different things, and then return, depending on what you’re doing, but then you can ask your LLM to be like, okay, have three, four sources. Are they conflicting each other? Are they saying the same thing?

Then if they’re conflicting each other, maybe you can give a reply, being like, actually, I’m not so sure. Please, you should check out yourself. Or then if the sources are the same, you can be a bit more confident, I would say. From what I’ve seen a lot, LLM as a judge seems to work a bit, but then to the scale of, for example, Google.

No one, I think, has figured it out yet, hence why Google has these problems. But I feel like for smaller companies, maybe to reply for that one, smaller companies have some golden queries in a way. You want to make sure that the LLM is giving the correct answer and have some data set that you can use all the time. Then LLM as a judge as well, it seems to be quite popular from what I talk to customers and what I’ve tried on.

Vinay: Moving over a little bit to data sovereignty concerns around AI and how these affect vector databases. From your experience, what are the data privacy concerns that AI applications face?

Stephen: I feel like everyone is a bit scared that OpenAI or any companies will use their data, their private data, so especially enterprises. I live in Germany, for example, which tends to be a lot of data privacy-focused as well. A lot of people are scared of, oh, will OpenAI know about my private life or will they read all my emails? And then I can ask, “hey, what did Vinay say on this email?” And OpenAI is giving me all the information. I feel like that’s what a lot of people are scared of.

They really don’t want to share their private data, which makes sense, especially companies. Like enterprises, you don’t want to share I don’t know if you have an investment round or something. Probably don’t want to share that with the rest. I feel like those are the main ones.

Then it’s also about the usage of data. We used to say data is a new oil. I guess you see it now with the LLMs, but it’s also like, do you want to keep it to make sure that no one can train it? Then you stay the best. You stay on top of your game because you have access to the data, for example. I think it’s a big problem.

Also, a company I know, they’re creating proteins. They create proteins, they create vegan proteins, and their whole success is how they treat the data, who treat the data, and how they process it, and how they use an LLM model for it. For those, it’s extremely important that it’s not leaked anywhere. Otherwise the company, is not doomed, but it’s really their key advantage.

Vinay: Actually, I even saw last week that as a Slack user, I didn’t realize this, but your historical data is being used to train some of these AI models, and apparently it’s somehow used to make, I don’t know, you’re going to get a better experience by the training. But then the question is, well, what happens if all these companies are having their Slack data trained?

How is it that it’s training it’s used as training data, but it only benefits the organization. So your private data doesn’t leak. It’s a thing. For Slack in this context, it’s quite expensive. So as a company, it’s also like, I don’t know how you put yourself as a company being like, oh, yeah, I’m paying like, I don’t know, 10,000.

And then Slack could just use your data. But I’ve seen I’ve seen the Slack response as well. I think they were saying that they have different LLMs as well running on Slack, basically. They have a smaller one that is running a new private company to predict emojis and things like that. But yeah, I think it’s a tricky one.

I think a lot of companies are now going to either ban sourcing directly the data. Reddit, for example, is very likely banning if you try to get the data and if you don’t make a partnership with them. It was the same for news. I don’t know about journalism, how are they going to deal with that in general.

Vinay: If in the future everyone will be using some AI in their workloads, will everyone bring AI to the data to ensure data security and privacy, or at least in highly regulated industries? Because the other way is you have AI there somewhere in the cloud, and then you bring your data to it. How do you see this?

Stephen: I think it really depends on your use case. For example, one thing I like with Zilliz in general, so Zilliz is the managed Milvus. We have Zilliz Cloud. And then we have what is called BYOC, which is bring your own compute. So you have the Zilliz Cloud that is just managing the control panel and then everything.

Then all the data stays in your cloud, so then we never see it, we never touch it. I think it’s becoming more and more popular to not fully trust companies and be like, yeah, we are like a cloud company or whatever.

It’s becoming more and more popular to keep your data, keep it in your, I don’t know, AWS or Google Cloud or Alibaba or wherever you are, and then have the control pane somewhere else from the company so they don’t really talk to each other.

I think it’s becoming… Yeah, I see it more and more. I’ve seen it from some of our customers as well, asking, “Hey, I’d love to use your tool, but I don’t want you to have access to our data at all.” Yeah, that’s something that I see more and more common. So everything is private.

Vinay: Actually, you mentioned, you bring your own cloud from from Zilliz Cloud, right? So you’re allowing customers to keep their data in there, maybe in a virtual private cloud, and then you have the control plane somewhere else in your network, and you’re actually managing those databases, right?

Stephen: Yes.

Vinay: And the assumption is, I guess, I’ve seen support for AWS, GCP, and Azure, which is the most common ones. But what about those who use other clouds, Alibaba or IBM.

Stephen: Yeah, or Baidu. Yeah.

Vinay: Because there’s also a lot of data sitting in all these enterprise data centers, right? A lot of stuff which has been there for 10, 15, 20 years. How do you handle that?

Stephen: Yeah, those are like… I think it’s also a very interesting point. I know, for example, that Zilliz is also compatible with Alibaba and different cloud providers. But then, yeah, I think it’s where also AWS and GCP are trying. They’re trying to make it easy to transfer data from different cloud providers.

On our side, I know we don’t support Renew and Compute on Baidu or IBM. I think it’s mostly we would probably likely do it if a customer comes in and like, hey, by the way, I really want that. We are actually at the moment on IBM Watson, we really need it. We could probably do it because then it’s mostly about how you create a cluster on AWS or GCP or then IBM.

But then that part, I would say, is not the hardest part. James, which is our head of engineering, the way he created that part is to make it very easy to then deploy it to other clouds or other regions as well. It’s something that is possible, but then we need to have customers to make it because in the end, we’re still a startup. We don’t want to invest in, let’s say, IBM Watson and no one comes. That’s how it works.

Vinay: I guess, I think we’ll come back to that a little bit later, but the way that you guys have made it possible to run in multiple environments, you’re using Kubernetes as the common denominator across all these environments.

Stephen: Definitely, yeah. So Kubernetes is where we usually tell you to deploy Milvus, and then in the end, Zilliz is also running on that. But yeah, it’s really like we also support multi-cloud. So then if you are also on AWS or GCP, then it’s also possible. But yeah, Kubernetes is where we run things, and that’s also the way to operate and the way to scale it.

We also have Milvus Lite, actually, which has been recently released, which is the idea with Milvus, the idea from the beginning was like, we want to make it possible for you to scale up to billion-plus vectors easily. But then with the rise of GenAI developers, especially since last year, we also had to focus a bit more on beginners.

You don’t want to start your POC with your app where you have no idea what LLM is working and everything, and we have to tell you, hey, you have to deploy on Kubernetes first. We release Milvus Lite, which runs everything in your memory, which makes it also way easier for you. But for scale, Kubernetes is definitely the answer.

Vinay: Interesting that you mentioned, let’s say, developers using something easy because you don’t want them to install a big infrastructure, right?

But very often data sovereignty is not really the reason they would pick something, right? Developers would usually just pick something that just helps them to get where they need to get without much fuss, because they have their application or whatever to build. They don’t want to be handling database stuff.

But the question is, will data sovereignty impact the choice of a vector database? You talked about Lite, which is a flexible deployment option. And with Kubernetes, people can stay in control of their infrastructure, whether it’s for compliance or cost benefits. But how do you see the role of sovereignty when choosing a vector database?

Stephen: I think it’s a good one. It’s always a complicated one. I think in the end, it’s like if I were to be a developer now, like I was in the past, I would really start of, okay, how easy is it to use and how is it to start with? Then, how can it scale up? Is it easy to scale it up? That’s mostly the way I would look at more than data sovereignty directly. I would really be more like, oh, is it easy to manage?

Am I going to spend the rest of my three months in deploying a database and trying to scale it up and then become a Kubernetes expert? That’s usually how I would see it. And yeah, there is a difference here, as a developer, at least, and that’s just my personal opinion, but that wouldn’t have a huge impact, I would say.

Vinay: So let’s continue. Let’s talk a little bit about adopting vector databases. There must be some decision tree. So the question I would say, and that’s not just with vector databases, that question is even earlier. Fifteen years ago, we suddenly talked about polyglot persistence. Okay, you always had SQL, and suddenly you have no SQL, you have document databases, you have key-value stores, you have whatever, column-oriented databases, graphs.

Now, we talked about vector. Do you specialize or do you go general? You can see people like Oracle or even PostgreSQL, they’ve done multi-model databases, different models.

So for example, in a PostgreSQL, you can have your JSON documents, you can index those documents. So there are sometimes conversations, at least from some parts, that a dedicated vector database may be overkill. But how do you decide whether you need a specialty database or you can get by using a traditional database which has been extended with vector search, like pgvector or even something else common as Redis?

Stephen: Yeah, I think it’s a good question. The way I answer that question is usually dependent on the use case. For me, that’s always been the case, and even before. In my previous job, I was working as a machine learning engineer for the ML platform for a food delivery company. We started without an ML platform because we were like, we can just go generic. We don’t need an ML platform dedicated for doing machine learning. At one point, you reach a scale where you’re oh, actually, we need an ML platform now. That’s usually the way I would answer for a vector database.

You might not need one at the beginning. You might go for pgvector, which is you already have your PostgreSQL database. You’re happy with it. A lot of people love PostgreSQL, to a point sometimes where I’m like, wow, you really love PostgreSQL. But then at the beginning, I would be like, yeah, you can try it out. Honestly, it doesn’t cost much.

You have the extension and that’s it. The only problem, and I’m not talking for myself here, but talking for my customers or people I talk to, is that very quickly, you reach out a problem with scale. That’s one of the problems. I’m not talking millions of documents. I’m just talking like tens of thousands of PDFs, which are not too big. They are not too big, but they’re also not too small. But then at one point, you’re going to start to struggle because they usually they run everything in memory. For example, they only support a couple of indexes or only one. Of course, they have to make a trade-off. You can’t have a full vector database.

That’s usually the way I answer is like, okay, maybe try or like pgvector, it’s an extension, you can use your JSON and you can make your query. But very likely, if you think you’re going to run into having hundreds of thousands of documents or just a bit of scale, you’ll reach out the limit very quickly.

That’s what I should say. Then it also depends on how you want to do it. Milvus, for example, you have to manage it because it’s open-source, it’s running on Kubernetes and everything. But then if you want to do a quick POC or you don’t want to handle that, we have Zilliz Cloud, which removes all of that. That’s also a thing where you can see is like, okay how much is the cost of me managing everything?

How much is the cost of me? I don’t know, extending pgvector? I’m going to continue on these examples. Some customers, they still use pgvector, but they are implementing on their own metadata filtering, and they’re implementing on their own a lot of things that we have because pgvector isn’t supported. I was like, maybe at one point, you should look somewhere else because they are a startup, they just raised the pre-seed.

They’re also spending a lot of time on those different things because they continue with pgvector because they use PostgreSQL. I think it’s also like at some point you might need some different capabilities like, I don’t know, hybrid search or meta data filtering or different things like that where you might not have it in pgvector.

Vinay: You spoke about scale. Are there any operational considerations unique to a vector database? Would they mirror those that are of other, let’s say, NoSQL databases? I guess for Milvus, Kubernetes is probably the main one to operate. Would that be the difference? Because I see that in the architecture, there’s quite a few components in there.

Stephen: Definitely, yeah. For Milvus, the way it works is that you always have the sharding, this partitioning as well. Depending on your data, you can do it as well. Like you have, for example, for MongoDB. The way then it works, and in particular for Milvus, is that everything is independent.

Milvus, you can scale it up horizontally and vertically, and you can scale up every component as well. I like to explain it like you have three different things on Milvus. You have the index nodes, you have the data nodes, and the query nodes. You can scale up and down those depending on your data, depending on your needs as well. If you have a lot of data, but you don’t make queries often, then you can scale up the data node. Then if you’re going to have a lot of queries, then you can scale up the query nodes. That way, one doesn’t impact the other, and they’re fully independent.

Vinay: That’s I think that’s one thing that from a software engineering point of view, I’m very impressed just because in the past, I’ve run on different databases or something, and then usually, especially on Kubernetes, it can be hard to scale up and down some components, whereas here you can really have control over whatever you want to scale up and down.

Maybe you’re going to have your index will be like you’re going to build a new index because you have some new data or you’re like, you want to do something else, and then you can just scale it up instead of struggling and waiting. And then your query is slow because then you’re indexing everything. So yeah, usually that’s how we go for scale. It’s fully distributed and scale up in vertically and horizontally.

Stephen: Yeah. And I guess you need to have pretty good, let’s say, metering of metrics, performance management, because I guess the act of scaling itself, it’s done by the Kubernetes operator, so you can tell it to scale. But then at some point, you need to have the information to know what to scale.

Vinay: Of course, yeah. It’s really like you can monitor and it’s like, I don’t know if I’ve done it in the past, but using Grafana or then using whatever you want, you have all the metrics available where we tell you, hey, you have a lot of usage on your query node, for example, maybe you should scale it up.

It doesn’t say it like that, obviously, but you get the idea. You can see your metrics and the P99 will be way higher, and then you’re going to struggle. That’s usually the way to do it.

Stephen: With Zilliz, what is nice is that we obviously manage that for you. That’s part of the cloud offering. We have auto-scaling everywhere for you. That is the same. We have an auto-index. Depending on your data, we also pick the best index for you. Because one thing, for example, with Milvus, which is, I guess, good, but can also be a bit confusing, is that we support 15 different indexes.

It’s like, yes, it’s amazing. You can pick whatever you want, but as a developer, it might be complicated for you as well to choose which one. Obviously, then we have docs explaining which one you should use and when. But for example, on Zilliz, you don’t have to do that because we do that for you.

Vinay: But then those indexes are like, you have a lot. But in a way, it’s similar to even relational databases, right? I mean, you can read the docs, there’s ways, but are you doing a lot of full table scans? Well, probably you need to put an index on that column. I mean, there’s a number of things you do, which probably every DBA knows, right? But then you don’t need to learn about them.

Stephen: Yes. Exactly.

Vinay: Moving a little bit to, let’s say, licensing, right? Using the open-source software versus fully managed. So your company is, let’s say, Milvus it can be fully downloaded. It is available under a permissive, I believe, Apache license.

Stephen: Yeah, Apache license.

Vinay: An enterprise can download it and use it internally, or you can get it via Zilliz Cloud. Pinecone is another popular vector database, but it’s only available as a managed service, I believe. So is it a risk to rely on a technology that is only available as a cloud service?

Stephen: I would say yes, but now I come as very biased, obviously. So I’m just going to explain the reasoning is that, well, in my previous job, the whole platform was running on open-source, for example. So that means we could fork whatever we wanted. If there was a bug, also we could fix it ourselves.

Sometimes there are a lot of times where you do something that is a bit out of the box and then you counter a bug. Then if you’re on something that is proprietary, then you have to wait for them to fix it, which can take a while as well. I’d say that’s one risk when you rely on something that is closed source. The other one is that they may close, they may run out of money at one point if they don’t find customers, or they might increase the prices.

Then if you’re too locked down, then you have no choice but to accept it and you need to pay more, usually. That’s also another problem, I would say. Then it’s like, sometimes they’re going to limit you in what you can do as well with it. Maybe you can’t build a business on top of it that is a direct competitor.

I don’t know if it’s the case for Pinecone. I’m just giving examples here. Those are, I would say, the three main reasons where I would say it can be a bit risky, especially coming from a price point of view. It can get very expensive very quickly.

Then it’s also like it can be hard to then move on to another database, for example, because your format is available for Pinecone, but maybe it doesn’t work for another one. So yeah, those can be a bit risky, I would say.

Vinay: Looking at the open-source landscape, and this is a little bit something that you alluded I think the third reason that you spoke about. I mean, so several database companies have changed their open-source license to source available.

Stephen: Yes.

Vinay: MongoDB, Redis, Elasticsearch, even HashiCorp moved from open-source. And Milvus has a permissive Apache 2.0 license, so somebody could offer it as a service and compete against your company. I mean, thoughts about that.

Stephen: Yeah, I mean, I guess it’s the main debate at the moment, is that also, to be clear, Milvus was donated to the Linux Foundation, so it’s also like a something where we would be fine. If you offer it as a service and you could manage it as a competitor, then we always first release things on Milvus and then we integrate it into Zilliz.

Also, really open-source is really important for us. But then, it’s a choice we made, and it’s basically coming from Charles as well, our founder, which is a big believer in open-source. Yes, it could be a problem. I guess it’s something you have to think about. But we see it as we’d rather have that than have open-source available or something.

I think it’s in our, not DNA, but it’s really our idea that it should be fully open-source. And if you want to use it, go for it. If you want to fork it and you want to build a service on top of it, it’s also the case, go for it. That’s the way we see it.

Anyway, it belongs to the foundation now, so it’s also a thing where some people ask me, “oh, are you going to change your license?” I’m like, I’m not sure the Linux Foundation would be happy about that. If I would be like, yeah, let’s go. We change Milvus to source available, and then they’re going to create Open Milvus or something like they did with HashiCorp.

I think it’s something where it could be a problem. Some companies are very famous for that. One is AWS, they’re very famous for it. For example, IBM Watson data is using Milvus. They built a whole vector database.

They wrote it on their blog, so I can share it. But they offer a vector database service and it’s fully built on Milvus. It’s also on our side, the way we see it is also like, we can also then come to our customers or come to people and be like, well, when we tell you that we can scale, we can show you with like IBM is going for it with really big data. It’s also what you see it.

Stephen: Yeah, that’s great. But it’s nice for IBM to write it in the blog as well and give credit because they could just say, hey, we have this great service. No credit to anybody.

Vinay: No, yeah, that’s actually a nice thing as well of them to do that. It’s really like, yeah, to actually put our name there and be like, yeah, it’s based on Milvus and everything. It’s a fair game. So kudos to them for that, for sure.

Stephen: Yeah.

Vinay: So looking a little bit into the future. What next? Where do vector databases go? More broadly or AI, ML applications?

Stephen: Yeah, I do see it really becoming everywhere, in every AI application, honestly. But just really of it’s like people always also some people ask me, oh, why should I, why aren’t you scared of vector databases becoming a commodity? I see it a bit of LLMs. You have so many different LLMs, but you just pick one depending on your needs as well.

I feel like that’s something there where I see it’s becoming basically everywhere. Then you also, we have at the moment everything is based on similarity-based search. Maybe in the future, we can include exact search or matching, basically, going back to the normal databases.

Also, I don’t know, classification, ML or vector clustering, all those things. For now, I could see them being added to the different applications. Then I see it like, that’s how I see the future, at least of vector database. I do think it’s really cool at the moment, but I also don’t think they’re going to go away. Maybe not all of them will survive because at the moment, every company is a vector database. Or at least they call themselves a vector database. I think that’s the thing.

But the way I see it, I don’t remember who said it, but 80% of the data in the world is unstructured data. It’s also like you have all the data that is not accessible anywhere that you can’t really use with a normal database. Then you can have access to this one. Obviously, at the moment, you need deep learning models or you need some models to convert it. But in the end, if you can have access to those, then you can be like, that could be actually really cool.

Vinay: For that future what do we need to get there from the community, from organizations, from vendors?

Stephen: I guess there is more of these features to be developed, right? I mean, if you’re looking at search, for example, you talked about similarity searching, you talk about classification. But I guess also, Milvus Lite, in a way, you’re looking at giving let’s say, an easy experience to every single developer as well.

Vinay: Exactly.

Stephen: So that it expands the reach of vector databases.

Vinay: Yeah, exactly.

Stephen: But it’s also like… I mean, that’s, in my opinion, then where you also have the power of open-source in general, is that by, let’s say, you’re a new developer and you’re, I don’t know, you’re someone that is an expert in Kubernetes, and then you’re going to strike with the Milvus Lite, and then you play around, and then you’re going to be like, oh, actually, that could be more efficient. Then you make something new, you make a collaboration.

I think it’s the same as well for search capabilities, where we have a similarity search and we have different algorithms that have been released by Meta, by different players like that. Then they open-source it. The way I see it in the future is that it’s very likely, I’m certain that other companies will then come and like Meta is doing a lot of open-source for LLMs in general and for AI.

I do see them releasing something new and be like, “hey, look, open-source community, have a look at our new similarity search or our new search that can be actually not similarity. I think that’s where the future is, will be where the community works, where the community is really good to have because you meet people from everywhere, and you meet people with so many different backgrounds and so many different ideas as well, that it can be really nice.

That’s what I think open-source can shine. You’re not limited by the employees you have or anything. You can just have someone somewhere on their laptop. They are like, you know what, I’m really passionate about that. I’m going to write on it. I’m going to work on it, and I’m going to collaborate with different people. I think that’s where the community is really good.

Then from vendors, I think it’s ourself included is we need to make it clear, we need to make it easy for people to do whatever they want to do with AI, so being RAG. But RAG is cool, RAG is here. I think RAG is here to stay, for example, but then it’s just going to evolve.

Now we have Agentic RAG, so you have your agents, and then we have RAG that’s going to be used. You can check the internet as well for different sources. But from a vendor point of view, sorry. I think we just need to be better at accessibility, making sure that people understand what she can do with an LLM, because I feel like a lot of people are scared.

They don’t know what to do with a LLM. They’re like, oh, my God, it’s so scary. I tried to at least showcase that you can do a lot of things locally as well. I don’t know if you worked a lot on it, but I’m showcasing a lot recently of using LLaMA 3 and Milvus Lite on your laptop. And then, like the good old software engineering where you do not need to have an API key somewhere. You just run everything on your laptop, and then you continue.

Vinay: Okay, great. So we’re coming to the end. To summarize, Stephen, what would be your recommendation to enterprises when it comes to adopting a vector database?

Stephen: I think it’s first look at the potential scale you may have. Are you going to run into only have a thousand documents or are you going to have a bit more later? I think that’s the first one. Look at also how the maintainability, if it’s open-source or not.

And look at different I think it’s really interesting to look at different benchmarks on how it can scale and how quickly you can have your data. I think benchmarks are very important. Have a look at different ones that are independent. ANN-benchmark, for example, is one that’s fully independent. You can have a look at them. And then, yeah, have a look at different blog posts that are posted and the community work and everything, I would say.

Vinay: Excellent. Okay, well, thank you, Stephen. It’s been great talking vector databases and Milvus with you. That’s it for today, folks. So see you all for the next episode. Thank you.

Guest-at-a-Glance

Name: Stephen Batifol

What he does: Developer Advocate

Website: Zilliz

Noteworthy: Stephen has journeyed through roles from Android developer to Data Scientist and Machine Learning Engineer, until he landed at Zilliz. He is driven by a passion to simplify tasks for Data Scientists and Software Engineers.

You can find Stephen Batifol on LinkedIn