See how to query documents using natural language, LLMs, and Rโincluding dplyr-like filtering on metadata. Plus, learn how to use an LLM to extract structured data for text filtering.
One of the handiest tasks large language models can do for us is answer questions about a specific collection of information. This is often done using a technique called RAG, or retrieval augmented generation. Instead of relying on what the model knows from its training data, a RAG application searches for the most relevant parts of a document collection, then uses only those text chunks as context for the LLMโs response.
Now, thanks to some relatively new R packages, itโs easy to create your own RAG applications in R. You can even combine RAG with conventional dplyr-like filtering to make responses more relevant, although that requires additional setup and code.
This tutorial gets you started creating RAG applications in R. First, weโll cover how to prepare, chunk, store, and query a document with basic RAG, using information about Workshops for Ukraine for our demo. Youโll quickly be able to ask general questions like โTell me three workshops that would help me improve my R data visualization skillsโ and get a relevant response. Next, weโll layer on some pre-filtering to answer slightly more specific questions like โWhat R-related workshops are happening next month?โ
The 5 steps of building a RAG app
There are five basic steps for building a RAG application with the ragnar and ellmer R packages:
- Turn documents into a markdown format that
ragnarcan process. - Split the markdown text into chunks, optionally adding any metadata you might want to filter on (we wonโt do the optional part yet).
- Create a
ragnardata store and insert your markdown chunks into the store. That insertion process automatically includes adding embeddings with each chunk (embeddings use a lengthy string of numbers to represent a text chunkโs semantic meaning). - Embed a query and retrieve text chunks that are most relevant to that query.
- Send those chunks along with the original query to an LLM and ask the model to generate a response.
Letโs get started!
Set up your development environment
To start, you will need to install the ellmer and ragnar packages if you want to follow the examples. ellmer is the main tidyverse R package for using large language models in R. ragnar is specifically designed for RAG and works with ellmer.
I suggest installing the latest development versions of bothโespecially ragnar, since useful new features are being added somewhat frequently. You can do that with pak::pak("tidyverse/ragnar") and pak::pak("tidyverse/ellmer"). Iโm also using the dplyr, purrr, stringr, and rio R packages, which can all be installed from CRAN with install.packages().
Iโll be using OpenAI models both to generate embeddings and ask questions, so youโll need an OpenAI API key to use the example code. If you want to use an Anthropic or Google Gemini model to generate the final answers, youโll also need an API key from that provider. While itโs possible to run the example with a local LLM using ollama, your results may not be as good.
Steps 1 and 2: Wrangle the โWorkshops for Ukraineโ data
Workshops for Ukraine is a two-hour data science webinar series where volunteers teach a specific topic or skill, often R-related. The goal is to raise money for Ukraine, so participants donate at least $20 or โฌ20 to one of several charities. Participants can attend live or get access to past recordings and materials.
The workshops are listed on a single web page hosted on Google Sites. Our first task is to import the web page using ragnar, which includes several functions for importing web pages and other document formats such as PDFs, Word, and Excel.
In the code below, read_as_markdown() converts the web page into markdown, then markdown_chunk() splits that into chunks. The segment_by_heading_levels = 3 argument splits the text using the original HTML H3 headers, so that each new row is a workshop.
library(ragnar)
library(dplyr, warn.conflicts = FALSE)
library(stringr)
workshop_url <- "https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine"
ukraine_chunks <- workshop_url |>
read_as_markdown() |>
markdown_chunk(
target_size = NA,
segment_by_heading_levels = 3
) |>
filter(str_starts(text, "### "))
Why did I use H3s to split the HTML text? Because I examined the workshop HTML page structure, and it looked like each workshop had its own H3 HTML header. Always check the format, because other web pages may have a different format.
The final filter deletes any rows without a level-3 heading, because those arenโt workshops.

Data frame generated by the read_as_markdown() and markdown_chunk() functions.
Sharon Machlis
The resulting data frame has columns for text, context (header and potentially other information), and start and end locations. The start and end locations help ragnar handle chunk overlapping, which can help retain semantic meaning across text segments.
Step 3: Create a data store and insert chunks
Now Iโm ready to create a data store and add my chunks. The code below creates a simple ragnar data store that is set up to use OpenAIโs text-embedding-3-small model when creating embeddings for each chunk. The embed_ollama() instructs the app to use a local ollama embedding model if one is installed on your system. ragnar uses DuckDB for its data store database.
store_file_location <- "ukraine_workshops.duckdb"
store <- ragnar_store_create(
store_file_location,
embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small")
)
To add chunks to a store, use the syntax: ragnar_store_insert(store_object, chunk_dataframe). This single line of code saves the chunks, generates embeddings, and saves the embeddings:
ragnar_store_insert(store, ukraine_chunks)
If youโre having problems with any of this, as I did initially, make sure you have the latest version of the duckdb R package installed. I ended up having to uninstall it completely and reinstall with pak::pak("duckdb").
Thereโs one more quick step before you can use the store: Build the search index with the ragnar_store_build_index(store_object) function. Donโt forget this part, or you may find yourself wondering later why youโre not seeing any search results:
ragnar_store_build_index(store)
If you want to know what your store looks like, ragnar has a built-in function for inspecting the store in a browser: ragnar_store_inspect(store).

The ragnar packageโs ragnar_store_inspect() function lets you view and search a data store.
Sharon Machlis
ragnar comes with two search algorithms by default: BM25 and VSS. BM25 looks for close matches; e.g., โgraphsโ should match โgraphโ but not โplotsโ or โvisualizations.โ
VSS uses semantic similarity, so in theory, โgraphโ and โplotโ should also match โdata visualization.โ The similarity matching may not always be as smart as youโd like, so try adding synonyms to your query if you need better results.
Despite the limitations of VSS, I prefer ragnar_retrieve_vss() when working with small text chunks, since thereโs a reasonable risk that a workshop may talk about โvisualizationโ and โplotsโ while a query may only say โgraphs.โ If you want both VSS and BM25, raganr_retrieve() returns de-duplicated results from both algorithms.
In addition to using ragnar_store_inspect() to view the data store, you can also query a ragnar data store as you would any other DuckDB database in R. Hereโs one way to do this:
chunks_df <- tbl(store@con, "chunks") |>
collect()
You can close the store connection with DBI::dbDisconnect(store@con), which is a good habit to get into since DuckDB can get finicky if you leave a database write connection open.
Step 4: Query your data store
Now weโre at the fun part!
We can use the store weโve just created to retrieve text chunks related to a query. If you donโt already have a store connection in your Rsession, use ragnar_store_connect() to connect to the DuckDB file:
store <- ragnar::ragnar_store_connect("ukraine_workshops.duckdb", read_only = TRUE)
The following code retrieves the six chunks deemed most relevant to a query using VSS semantic searching (top_k sets how many chunks are returned):
query <- "What workshops would help me improve my R data visualization skills?"
similarity_chunks <- ragnar_retrieve_vss(store, query, top_k = 6)

Structure of data returned by basic ragnar retrieval.
Sharon Machlis
When I ran this code, I got back the following list of workshops: โEffective Data Visualization in R in Scientific Contexts,โ โAdvanced data visualization in R with ggplot,โ โEffective Visual Communication with R,โ โData Visualization with ggplot,โ โColor Palette Choice and Customization in R and ggplot2,โ and โEfficient R โ How to write faster code.โ Most of these look pretty relevant to me.
You can easily view all the text in the similarity_chunks results data frame by entering something like the following R code:
similarity_chunks$text |> cat(sep = "\n=====\n")
Step 5: Generate your answer
The final step is to send the query and retrieved text chunks to an LLM, with instructions to use the retrieved text chunks to generate an answer.
The tidyverse way to do this is to register your ragnar data store as a tool for an ellmer chat. (Tools are functions that LLMs can access to give them additional capabilities. The ellmer documentation has a good overview of LLM tool calling.)
The ragnar_register_tool_retrieve() function is the easiest way to do a basic retrieval. Hereโs an example:
# Create a chat object
library(ellmer)
my_chat <- chat_openai(
system_prompt = "You are a helpful assistant who answers questions about Workshops for Ukraine. You use available tools to answer questions and do not use your own existing knowledge.",
model = "gpt-4.1"
)
# Register your store as a tool, setting your desired number of chunks to return
ragnar_register_tool_retrieve(my_chat, store, top_k = 6)
# Ask your question
my_chat$chat("What workshops would help me improve my R data visualization skills?")
Hereโs a look at the results in the console:

Results when using ellmer to query a ragnar store in the console.
Sharon Machlis
The my_chat$chat() runs the chat objectโs chat method and returns results to your console. If you want a web chatbot interface instead, you can run ellmerโs live_browser() function on your chat object, which can be handy if you want to ask multiple questions: live_browser(my_chat).

Results in ellmerโs built-in simple web chatbot interface.
Sharon Machlis
Add metadata filtering to the RAG R app
Basic RAG worked pretty well when I asked about topics, but not for questions involving time. Asking about workshops โnext monthโโeven when I told the LLM the current dateโdidnโt return the correct workshops.
Thatโs because this basic RAG is just looking for text thatโs most similar to a question. If you ask โWhat R data visualization events are happening next month?โ, you might end up with a workshop in three months. Basic semantic search often misses required elements, which is why we have metadata filtering.
Metadata filtering โknowsโ what is essential to a queryโat least if youโve set it up that way. This type of filtering lets you specify that chunks must match certain requirements, such as a date range, and then performs semantic search only on those chunks. The items that donโt match your must-haves wonโt be included.
To turn basic ragnar RAG code into a RAG app with metadata filtering, you need to add metadata as separate columns in your ragnar data store and make sure an LLM knows how and when to use that information.
For this example, weโll need to do the following:
- Get the date of each workshop and add it as a column to the original text chunks.
- Create a data store that includes a date column.
- Create a custom
ragnarretrieval tool that tells the LLM how to filter for dates if the userโs query includes a time component.
Letโs get to it!
Step 1: Add the new metadata
If youโre lucky, your data already has the metadata you want in a structured format. Alas, no such luck here, since the Workshops for Ukraine listings are HTML text. How can we get the date of each future workshop?
Itโs possible to do some metadata parsing with regular expressions. But if youโre interested in using generative AI with R, itโs worth knowing how to ask LLMs to extract structured data. Letโs take a quick detour for that.
We can request structured data with ellmerโs parallel_chat_structured() in three steps:
- Define the structure we want.
- Create prompts.
- Send those prompts to an LLM.
We can extract the workshop title with a regexโan easy task since all the titles start with ### and end with a line break:
ukraine_chunks <- ukraine_chunks |>
mutate(title = str_extract(text, "^### (.+)\n", 1))
Define the desired structure
The first thing weโll do is define the metadata structure we want an LLM to return for each workshop item. Most important is the date, which will be flagged as not required since past workshops didnโt include them. ragnar creator Tomasz Kalinowski suggests we also include the speaker and speaker affiliation, which seems useful. We can save the resulting metadata structure as an ellmer โTypeObjectโ template:
type_workshop_metadata <- type_object(
date = type_string(
paste(
"Date in yyyy-mm-dd format if it's an upcoming workshop, otherwise an empty string."
)
),
speaker_name = type_string(),
speaker_affiliations = type_string(
"comma seperated listing of current and former affiliations listed in reverse chronological order"
)
)
Create prompts to request that structured data
The code below uses ellmerโs interpolate() function to create a vector of prompts using that template, one for each text chunk:
prompts <- interpolate(
"Extract the data for the workshops mentioned in the text below.
Include the Date ONLY if it is a future workshop with a specific date (today is {{Sys.Date()}}). The Date must be in yyyy-mm-dd format.
If the year is not included in the date, start by assuming the workshop is in the next 12 months and set the year accordingly.
Next, find the day of week mentioned in the text and make sure the day-date combination exists! For example, if a workshop says 'Thursday, August 30' and you set the date to 2025-08-30, check if 2025-08-30 is on Thursday. If it isn't, set the date to null.
{{ ukraine_chunks$text }}
"
)
Send all the prompts to an LLM
This next bit of code creates a chat object and then uses parallel_chat_structured() to run all the prompts. The chat and prompts vector are required arguments. In this case, I also dialed back the default numbers of active requests and requests per minute with the max_active and rpm arguments so I didnโt hit my API limits (which often happens on my OpenAI account at the defaults):
chat <- ellmer::chat_openai(model = "gpt-4.1")
extracted <- parallel_chat_structured(
chat = chat,
prompts = prompts,
max_active = 4,
rpm = 100,
type = type_workshop_metadata
)
Finally, we add the extracted results to the ukraine_chunks data frame and save those results. That way, we wonโt need to re-run all the code later if we need this data again:
ukraine_chunks <- ukraine_chunks |>
mutate(!!!extracted,
date = as.Date(date))
rio::export(ukraine_chunks, "ukraine_workshop_data_results.parquet")
If youโre unfamiliar with the splice operator (!!! in the above code), itโs unpacking individual columns in the extracted data frame and adding them as new columns to ukraine_chunks via the mutate() function.
The ukraine_chunks data frame now has the columns start, end, context, text, title, date, speaker_name, and speaker_affiliations.
I still ended up with a few old dates in my data. Since this tutorialโs main focus is RAG and not optimizing data extraction, Iโll call this good enough. As long as the LLM figured out that a workshop on โThursday, September 12โ wasnโt this year, we can delete past dates the old-fashioned way:
ukraine_chunks <- ukraine_chunks |>
mutate(date = if_else(date >= Sys.Date(), date, NA))
Weโve got the metadata we need, structured how we want it. The next step is to set up the data store.
Step 2: Set up the data store with metadata columns
We want the ragnar data store to have columns for title, date, speaker_name, and speaker_affiliations, in addition to the defaults.
To add extra columns to a version data store, you first create an empty data frame with the extra columns you want, and then use that data frame as an argument when creating the store. This process is simpler than it sounds, as you can see below:
my_extra_columns <- data.frame(
title = character(),
date = as.Date(character()),
speaker_name = character(),
speaker_affiliations = character()
)
store_file_location <- "ukraine_workshop_w_metadata.duckdb"
store <- ragnar_store_create(
store_file_location,
embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),
# overwrite = TRUE,
extra_cols = my_extra_columns
)
Inserting text chunks from the metadata-augmented data frame into a ragnar data store is the same as before, using ragnar_store_insert() and ragnar_store_build_index():
ragnar_store_insert(store, ukraine_chunks)
ragnar_store_build_index(store)
If youโre trying to update existing items in a store instead of inserting new ones, you can use ragnar_store_update(). That should check the hash to see if the entry exists and whether it has been changed.
Step 3: Create a custom ragnar retrieval tool
As far as I know, you need to register a custom tool with ellmer when doing metadata filtering instead of using ragnarโs simple ragnar_register_tool_retrieve(). You can do this by:
- Creating an R function
- Turning that function into a tool definition
- Registering the tool with a chat objectโs
register_tool()method
First, you will write a conventional R function. The function below adds filtering if a start and/or end date are not NULL, and then performs chunk retrieval. It requires a store to be in your global environmentโdonโt use store as an argument in this function; it wonโt work.
This function first sets up a filter expression, depending on whether dates are specified, and then adds the filter expression as an argument to a ragnar retrieval function. Adding filtering to ragnar_retrieve() functions is a new feature as of this writing in July 2025.
Below is the function largely suggested by Tomasz Kalinowski. Here weโre using ragnar_retrieve() to get both conventional and semantic search, instead of just VSS searching. I added โdata-relatedโ as the default query so the function can also handle time-related questions without a topic:
retrieve_workshops_filtered <- function(
query = "data-related",
start_date = NULL,
end_date = NULL,
top_k = 8
) {
# Build filter expression based on provided dates
if (!is.null(start_date) && !is.null(end_date)) {
# Both dates provided
start_date <- as.Date(start_date)
end_date <- as.Date(end_date)
filter_expr <- rlang::expr(between(
date,
!!as.Date(start_date),
!!as.Date(end_date)
))
} else if (!is.null(start_date)) {
# Only start date
filter_expr <- rlang::expr(date >= !!as.Date(start_date))
} else if (!is.null(end_date)) {
# Only end date
filter_expr <- rlang::expr(date <= !!as.Date(end_date))
} else {
# no filter
filter_expr <- NULL
}
# Perform retrieval
ragnar_retrieve(
store,
query,
top_k = top_k,
filter = !!filter_expr
) |>
select(title, date, speaker_name, speaker_affiliations, text)
}
Next, create a tool for ellmer based on that function using tool(), which needs the function name and a tool definition as arguments. The definition is important because the LLM uses it to decide whether or not to use the tool to answer a question:
workshop_retrieval_tool <- tool(
retrieve_workshops_filtered,
"Retrieve workshop information based on content query and optional date filtering. Only returns workshops that match both the content query and date constraints.",
query = type_string(
"The search query describing what kind of workshop content you're looking for (e.g., 'data visualization', 'data wrangling')"
),
start_date = type_string(
"Optional start date in YYYY-MM-DD format. Only workshops on or after this date will be returned.",
required = FALSE
),
end_date = type_string(
"Optional end date in YYYY-MM-DD format. Only workshops on or before this date will be returned.",
required = FALSE
),
top_k = type_integer(
"Number of workshops to retrieve (default: 6)",
required = FALSE
)
)
Now create an ellmer chat with a system prompt to help the LLM know when to use the tool. Then register the tool and try it out! My example is below.
my_system_prompt <- paste0(
"You are a helpful assistant who only answers questions about Workshops for Ukraine from provided context. Do not also use your own existing knowledge.",
"Use the retrieve_workshops_filtered tool to search for workshops and workshop information. ",
"When users mention time periods like 'next month', 'this month', 'upcoming', etc., ",
"convert these to specific YYYY-MM-DD date ranges and pass them to the tool. ",
"Past workshops do not have Date entries so would be NULL or NA",
"Today's date is ",
Sys.Date(),
". ",
"If no workshops match the criteria, let the user know."
)
my_chat <- chat_openai(
system_prompt = my_system_prompt,
model = "gpt-4.1",
params = params(temperature = 0.3)
)
# Register the tool
my_chat$register_tool(workshop_retrieval_tool)
# Test it out
my_chat$chat("What R-related workshops are happening next month?")
If there are indeed any R-related workshops next month, you should get the correct answer, thanks to your new advanced RAG app built entirely in R. You can also create a local chatbot interface with live_browser(my_chat).
And, once again, itโs good practice to close your connection when youโre finished with DBI::dbDisconnect(store@con).
Thatโs it for this demo, but thereโs a lot more you can do with R and RAG. Do you want a better interface, or one you can share? This sample R Shiny web app, written primarily by Claude Opus, might give you some ideas.


