Build RAG-powered LLM applications using the tools you know with a managed vector index in Azure.
If youโre building generative AI applications, you need to control the data used to generate answers to user queries. Simply dropping ChatGPT into your platform isnโt going to work, especially if youโre using proprietary data that wasnโt part of the initial training set for the large language model (LLM) youโre using.
Without some form of grounding, your AI is liable to quickly and randomly generate plausible-sounding text as it tries to predict the output associated with a userโs prompt. Itโs not surprising that the LLM โhallucinatesโ like this, as itโs purely a way of statistically generating text at a syllable level. Instead of working with words, itโs using its neural network to navigate through the most likely path in a multidimensional semantic space.
Reducing AI risk with RAG
Thereโs more than one way to reduce this risk: You can fine-tune an LLM on your own data, or you can take advantage of retrieval-augmented generation (RAG) techniques. This last option is popular; itโs part of how search engines like Bing use LLMs, running a search and then generating text based on search results. Itโs also at the heart of popular LLM workflow tools, such as the open source LangChain, LlamaIndex, Haystack, or Microsoftโs own Semantic Kernel.
Itโs important to understand that you canโt simply plug any old database into an LLM because thereโs no semantic model in most databases, no way to link either the relational or NoSQL world to what an LLM โunderstands.โ We need to bring the two together in a way that lets us construct a search that can feed into a prompt.
Why did Microsoft start its Copilot story with Bing? The answer is quite simple: A search engine uses some of the same underlying technologies as generative AI, building on a vector index. Like an LLM, it treats content as semantic vectors, which it compares to deliver results using variants on a nearest neighbor algorithm. Data is stored along with a vector index that allows a quick comparative search, which is then used to construct the familiar search page.
If weโre going to build a RAG-grounded AI application, weโre going to have to build our own search engine on our data, adding a vector index to deliver those essential semantic searches. During the past few months, Iโve looked at several vector index implementations where Microsoft has been adding the necessary tools and features to existing database properties and services.
Searching with vector indexes
Microsoft isnโt the only vector index provider out there, with MongoDB and others offering supported vector index services on Azure. Most of them are simply add-on vector indexes that use semantic embeddings to add a vector index column to existing stores. It works, but you may prefer to take your existing data and use a native vector database for at-scale searches.
One option is Pinecone, which offers a serverless vector database as a managed Azure service. Available through the Azure Marketplace, it offers a range of different price points, from a basic free service for small applications to larger-scale standard or enterprise versions. They all have an embedding API, as well as storage and read/write charges. You will need to spend time figuring out which plan is best for your service.
The key to working with Pinecone is its API. All you need to connect is a key and an SDK. Using an SDK simplifies working with the underlying REST API, translating its calls and responses to platform-specific ways of working and integrating it with existing code.
Introducing Pineconeโs .NET SDK
Along with Azure support, Pinecone now has a .NET SDK, developed on GitHub and available in NuGet, ready for use in your code. Connecting to a vector database is easy. Simply set up a namespace for your database connection and then create the connection using your API key. You can get the key from the dashboard associated with your Pinecone instance, where you can generate a new one if necessary.
The .NET SDK includes methods that allow you to work with your database programmatically, letting you build your own management tools. This can help you build tools to manage and update your RAG vector index, allowing subject-matter experts to create embeddings and add data to the database, combining the two operations into a single action.
You can use the SDK to create a new index from scratch, for example, if you want to provide alternate RAG sources for an application. This technique allows you to ensure that results for, say, one product or service donโt pollute the results for another, avoiding user confusion. This way you can filter searches to the appropriate RAG source before using an LLM to generate an output.
Indexes can be either serverless or pod-based. If youโre using the managed Azure Pinecone, then you want to work with serverless indexes. To do this start with a name, then the dimension of the semantic space youโre creating, which can be very large! Choose a serverless index and target it to the Azure region you want to use. Other options include adjusting how your index scales and the type of search algorithm.
Upserting and querying in .NET
Once youโve created an index, you can quickly add new values.ย Pinecone calls this process โupserts.โ Here youโll load the vectors created by an embedding, first providing a batch of IDs, and then the dense and sparse vectors, and finally appropriate metadata (which can include the original text of the source document). This last option allows you to prefilter data when making a query. Other commands in the SDK help manage your vector index: deleting, updating, and listing vectors.
If youโre building an index for a RAG application, you should first chunk your documents, so each embedding only encodes a section of text. This can then be included in the upsert as part of the metadata,
The most important part of building the search component of a RAG application is, of course, querying your vector index. As part of setting up an index, youโve already defined the distance metric used to select similar documents. The type of metric depends on the type of results you want to return. For most RAG applications you probably want to choose a cosine metric, as this finds documents that are similar to your query term.
Querying is relatively straightforward. Simply make a request to your index and namespace, with an embedding vector of your query and the number of responses to return. When you run a query and are adding context to a prompt, use the document chunks from the index metadata. Ranking information can help construct the context needed, allowing the LLM to produce a natural language text output.
Having a .NET SDK for Pinecone allows you to quickly integrate the managed Azure Pinecone service into your applications, working with tools like Semantic Kernel to build RAG applications using Azure Open AI. Building new tools like these into a familiar ecosystem that supports cloud-native development makes a lot of sense. It flattens the learning curve and allows you to reuse code and components, and as the SDK is open source, you can help guide its future development.


