Categories: OpenWebUI (OWUI)Technical Tidbits

The Open WebUI RAG Conundrum: Chunks vs. Full Documents

UPDATE 23rd Feb 2025… At the end of this post, I outline a solution to the problem.

On Reddit, and elsewhere, a somewhat “hot” topic is using OWUI to manage a knowledge base / files and take advantage of OWUI’s built-in RAG (Retrieval Augmented Generation) functionalities. The thing is, sometimes, you’re not trying to retrieve snippets for context; you’re aiming for summarization, translation, file comparison, or brainstorming. I often see people struggling with system prompts or RAG prompts to get an LLM do process documents in ways that RAG simply doesn’t support. You can’t “chat to your PDF” and ask, “Take the grand totals section from my Excel file and re-write the summary in the financial report to reflect those numbers.” RAG isn’t built for this. It’s built to merely return chunks of text to the Agent that are hopefully semantically similar to the user’s request.

In these cases, feeding the LLM the whole document is crucial, not just chunks.

OpenWeb UI is great for document storage and a chat interface, and RAG configuration options but you can’t disable RAG. Wouldn’t it be nice to be able to switch between RAG (chunks), full document text, and even full document as a binary file / base64 file so when, for example, you want to send a whole PDF to Google Gemini for parsing (as opposed to just the extracted text from a RAG process), you can?

In my custom setups, I’ve built in toggles to choose between full documents and RAG chunks, or even dual vector databases (one for small chunks, one for huge “chunks”) and I simply don’t use OWUI for RAG. But I was hoping I could perhaps use OWUI as a front-end, allowing the user to decide between chunks and full documents as they wish, and then having a custom pipe to send the user’s prompt + chunks (or full document) to whatever platform they like (in my case, n8n). But to doing this cleanly in Open WebUI turned out to be trickier than expected.

The point of this post is to see if OpenWeb UI can be tweaked (ok, wrestled) into submission to allow the user the flexibility to choose between RAG or full document context. It’s not to provide a “template” of a solution because when it comes to document management (RAG), there is no turnkey solution. So instead this post outlines the technical possibilities and I hope that you can take the knowledge and weave your own solution.

For non-tech users who want to use OpenWeb UI and avoid RAG, a simple workaround is cranking up the chunk size to something massive (like a million tokens if you’re using Google Gemini) or whatever your LLM’s context window can handle. This effectively gives you one chunk per document – basically the same as “full document” for most scenarios. The problem? Open WebUI’s chunking settings are global. You can’t just tweak it on the fly. This is fine if you always want full documents passed to the LLM and I think a massive chunk size in this scenario is fine as it allows you to use all the benefits of OpenWeb UI and kinda “bypass” RAG.

And for those who want to tell me that massive chunk sizes in RAG is a “no no”, this is my response:

Perhaps that reference shows my age! But I’ve really found no issue with manipulating RAG settings to send massive chunks when there’s simply no other alternative.

I also often see questions – and negative comments – about OWUI’s RAG implementation. So to start with, I wanted to also look into exactly how they do it. I’ll first outline how it works and then go onto how I tried to bypass it.

My opinion, on a cursory look at their code is they handle RAG pretty damn well so I’m not sure what the negative comments are about.

Open Web UI’s RAG implementation

Open WebUI’s RAG implementation relies on several components working together:

Document Processing and Storage:

Uploaded files are processed using Langchain document loaders. Different loaders are used based on file type (PDF, CSV, RST, etc.). For web pages a WebBaseLoader is used.
After loading, the documents are split into chunks using either a character-based or token-based text splitter (Tiktoken is used for the latter) with configurable chunk size and overlap.
These chunks are embedded using a SentenceTransformer model, chosen through settings or defaults to sentence-transformers/all-MiniLM-L6-v2. Ollama and OpenAI embedding models are also supported.
The embeddings, along with the document chunks and metadata, are stored in a vector database. Chroma, Milvus, Qdrant, Weaviate, Pgvector and OpenSearch are supported.

Query Generation:

When a user submits a query, a separate LLM call is made to generate search queries based on the conversation history.
This uses a configurable prompt template, but by default encourages generating broad, relevant queries unless there’s absolute certainty no extra data is needed.
This step can be disabled through admin settings if not desired.

Retrieval:

The generated search queries are used to retrieve relevant chunks from the vector database. The default setting combines BM25 search and vector search.
BM25 provides a keyword-based search, while vector search compares the query embedding with the stored chunk embeddings.
Optionally, retrieved chunks can be reranked using a reranking model (like a CrossEncoder) and a relevance score threshold is applied to filter results based on similarity.

Context Injection:

The retrieved relevant contexts (chunks with metadata) are formatted into a single string.
This string, along with the original user query, is injected into a prompt template designed for RAG.
This template is configurable but defaults to one that instructs the LLM to answer the query using the context and include citations when a <source_id> tag is present.

LLM Response Generation:

The final prompt, including context and user query, is sent to the chosen LLM.
The LLM’s response, which may include citations based on the provided context, is then displayed to the user.

Key Files:

routers/retrieval.py: Handles the API endpoints for document processing, web search, and querying the vector database.
retrieval/loaders/main.py: Contains the logic for loading documents from different file types.
retrieval/vector/main.py: Defines the interface for vector database interaction and includes implementations for Chroma and Milvus.
retrieval/vector/connector.py: Selects the specific vector database client based on configured settings.
utils/task.py: Contains helper functions for prompt templating, including rag_template.

In short, Open Web UI’s RAG uses a multi-step process involving query generation, hybrid search (BM25 + vector search), reranking, context preparation, and finally, LLM response generation. This process is highly configurable, allowing users to fine-tune each step to their specific needs.

Ways to get files into OpenWeb UI

Just so everyone is on the same page about what’s possible, I thought I’d outline the options:

Drag and drop a file into the prompt.
1. If it’s a document (not an image) it will get ragged into a temporary knowledge base called “uploads”.
2. But… you can click on the file and select “Using Focused Retrieval” which means “send the full content, not chunks” – awesome.
Create a knowledge base, add your files. Then link the knowledge base to your model (see my post on creating your own CustomGPT in OWUI).
1. Your documents will get RAGged. Nothing you can do about it.
Same as option (2) above, but you don’t link the knowledge base to your model. When you want to send one or more files, enter # as the first character of your prompt and select from the knowledge base one or more files (or the whole knowledge base, if you like).
1. Again, everything is RAGged.

The Open WebUI RAG Roadblock

If you’re only dealing with images, there’s no issue because images can’t be RAGged and are therefore embedded directly in the prompt’s JSON structure as base64 data:

{

“stream”: true,

“model”: “some model”,

“messages”: [

{

“role”: “user”,

“content”: [

{

“type”: “text”,

“text”: “hello there”

{

“type”: “image_url”,

“image_url”: {

“url”: “data:image/png;base64,iVBORw…”

}

]

}

]

}

This structure, generated within open_webui/utils/middleware.py, allows a custom pipe to easily capture and process the image data.

If you drag-and-drop documents into the prompt, this is where the RAG inflexibility becomes a major problem. Open WebUI intercepts the document, performs RAG, and never exposes the original file data to the custom pipe until after it’s taken your prompt, performed a RAG search, retrieved the chunks, and injected those chunks + a system prompt into your conversation which tells the model to use the chunks when answering your question.

This means it’s not simply a matter of reading in the original files and sending the full content (text or binary) to wherever you want because you also have to deal with the injection of chunks and RAG template into your pipe / messages.

Here’s the logic flow:

User Input: The user types something into the chat input.
Middleware (in open_webui/backend/middleware.py):
The request goes through middleware, specifically process_chat_payload.
process_chat_payload is where the RAG logic is always applied, regardless of whether a custom pipe is being used.
It checks for features.web_search, features.image_generation, and features.code_interpreter to see if those should be enabled.
Crucially, it always calls get_sources_from_files if there are any files. This function is the heart of the RAG system.
The RAG template (RAG_TEMPLATE) is always prepended to the first user message, or a system prompt is added if one doesn’t exist.
get_sources_from_files (in open_webui/backend/retrieval/main.py)

With the file upload options of “Using Focused Retrieval”, the way OWUI sends the full content to your pipe is exactly the same way it sends RAG chunks to your pipe. It uses the same back-end RAG pipeline, simply skipping the RAG vector search, and jumping straight to injecting into your pipe / messages the RAG template and this time the full content.

And… there’s a problem if your model is connected to a knowledge base. I think a common use-case is where people want a knowledge base but occasionally want to upload a file when the conversation requires more focussed attention by the LLM on a specific document.

But while you might have the reasonable assumption that uploading a file (and even toggled to use full content) will make it the “focus” of the current discussion, OWUI simply throw that file into the mix with all other knowledge base files before a RAG search is done. It’s therefore possible that the file you uploaded won’t even be included in the search results! You have no control over indicating to OWUI that your uploaded file is more important that the knowledge base files. And you can’t disable the knowledge base on a turn-by-turn basis.

Regardless, once OWUI has done its search and provided your pipe with the chunks and file IDs, you can fetch the full document content via the /api/v1/files/{id}/content endpoint (defined in open_webui/routers/files.py).

However, the prompt still contains the RAG chunks, necessitating manual removal to avoid redundancy – a clumsy workaround.

The latest release of OWUI (Feb 2025) now includes a setting, “Full Context Mode”, where you can specify whether you want “full documents” or RAG. This achieves the same result as setting “Using Focused Retrieval” on a file-by-file bases for files you’ve uploaded into the prompt.

However, there’s a catch. The new feature is controlled by a boolean setting in the backend configs RAG_FULL_CONTEXT, which unfortunately means it’s global.

This means users can’t select on a file-by-file basis, or a prompt-by-prompt basis, or even a model-by-model basis to send full files, or chunks from the RAG query.

This setting impacts how the get_sources_from_files function in retrieval.utils operates…

If RAG_FULL_CONTEXT is True, then the entire document is returned from all specified sources. The context returned from the function does NOT get chunked or embedded but still only returns the text content from the document (no binary or base64 of the file can be accessed by a pipe)
If RAG_FULL_CONTEXT is False (the default), then chunks are retrieved as before. The number of chunks can be configured via the RAG_TOP_K config setting. The function will then call the embedding function and use that as your query embeddings in the vector db.

And, just like when you upload a file and set “Using Focused Retrieval”, OWUI still uses the internal RAG pipeline, even though it’s sending the full contents of the document. So again, there’s no way to intercept the document in a pipe and do something with it before the RAG search has taken place and injected chunks into your chat history.

I’ve tested all workarounds that I can think of, using pipes, filters, inlets… there’s no solution to be found where a custom-written pipe can avoid / disable / block OWUI’s internal RAG pipeline from triggering and modifying your prompts before your pipe is even called – unless it’s a file that can’t be RAGged (eg; image files).

But also… OWUI uses your pipe as a host for this:

A pipe is considered a model by OWUI. So, your custom pipe, being a model is used by OWUI to generate a nice title for your chat / conversation. It does this by sending to the model (your pipe) a request for the LLM to look at the first prompt from the user and come up with a nice title with an emoji or two.

Luckily, this can be switched off in settings but you still need to cater for the possibility it’s not switched off:

# Check if this is a chat title generation request
if “### Task:\nAnalyze the chat history” in system_content:
print(“Detected chat title generation request, skipping…”)
return {“messages”: messages}

A Workable Workaround (Minimal Core Modification)

One possible (untested) solution involves a minor change to the core process_chat_payload function. This modification ensures that the entire file-handling logic (including chunking and vector database lookup) is skipped if the “!” prefix is present. Critically, it preserves the original file information and passes it along to the custom pipe via the knowledge parameter in extra_params.

open_webui/utils/middleware.py (Simplified)

async def process_chat_payload(request, form_data, metadata, user, model):

# … other code

if user_message is not None and user_message.startswith(“!”):

bypass_rag = True

extra_params[“__knowledge__”] = metadata.get(“files”, []) # Preserving original files

if not bypass_rag:

# … (Original file handling logic to be skipped)

pass

# … rest of process_chat_payload …

This modified process_chat_payload effectively acts as a true “bypass RAG” switch, giving your custom pipe complete control over how file content is handled.

While this modification requires touching the core code (which isn’t ideal), it’s a targeted – and quite small – change that may resolve the conflict between OWUI’s internal processing and your custom pipe’s intended behavior.

Hopefully, future versions of Open WebUI will include more robust mechanisms for dynamically controlling RAG and accessing full file content directly, eliminating the need for any workarounds. Until then, you’ll need to drag and drop files into your prompt as required as those will be sent, in full, to the LLM.

LATEST UPDATE 23rd Feb 2025…

The Final Solution: Breaking Free from the Flying Dutchman

I’ve re-written this blog post 3 times as each day I find new information and discover possible solutions.

OWUI’s forcing of RAG, even with options for “full documents” felt quite like the tale of Bootstrap Bill Turner and Davy Jones. OpenWebUI’s RAG system can be an unwanted passenger, binding itself to my custom pipe like Bootstrap Bill bound to the Flying Dutchman. Every time I tried to process a document, OpenWebUI’s RAG would inject itself into the process, like a symbiotic entity I couldn’t shake off. I needed to stop it somehow.

Here’s what I’ve now implemented in a rather complex and long pipe – but remember how at the start of this post I mentioned how everyone has a very specific environment, and tech stack, and use-case when it comes to RAG? Well I do to. I want to be able to connect to all unsupported models (Perplexity, Google, Anthropic) and also connect OWUI to an n8n workflow. And I have very specific requirements about how the prompt, conversation history, documents (chunks, full text, binary) are handled depending on the model.

So it’s just too convoluted to share as people will inevitably have questions and there’s a limit to the time I can put in. But I do have demos of pipes and n8n workflows on my github that comprise the concepts I’ve discussed in here. It’s just that final solution is very much coded for my use-case, and it’s 1,500 lines long.

I encourage you to take what I’ve learned, look at the demos I have, and build out your own solution suitable specifically for you.

Here’s what I implemented:

Firstly, my pipe looks at all settings and if the settings are such that the internal RAG pipeline will return actual “chunks” instead of full document “chunks” then the pipe knows that the “chunks” that are attached to the RAG template and injected into the pipe / messages are in fact actual “chunks”. Otherwise it knows the chunks are full document “chunks”.
In addition it looks at the <source_id> tags that OWUI uses when injecting chunks into the prompt, to work out what the exact file is that’s related to each chunk. It then adds a new tag into the results, <filename>. This means the user can actually mention a filename to the model and the model will know where to look
If actual chunks are being returned, then the pipe then checks if the first character of the user prompt is the “-” character, and if it’s there, this is essentially a message from the user saying, “I want to disable all RAG for this turn of the conversation.” So the pipe strips out the chunks, and the RAG template entirely from the prompt – only the user’s prompt, chat history, and system message are sent.
If actual chunks are being returned, and there’s no “-” as the first character of the use prompt, the pipe then checks if the first character of the user prompt is the “!” character, and if it’s there, this is a message from the user saying, “I want full document text to be sent on this turn of the conversation.” So the pipe strips out the chunks, reads in the actual file contents, and inserts the full content to where the chunks were
1. The method to get the right metadata to get the file ID is different depending on whether the chunks are returned from a file in the knowledge base or from a file that was uploaded in to the prompt.
For Google, Anthropic, and Perplexity, if it’s an image, it grabs the base64 (which is easy because it’s just part of the user input) and sends it along to those models.
1. NOTE: image files can’t be put into a knowledge base and you can’t select an image file and toggle on “Using Focused Retrieval” because there’s no text content in an image (obviously) so OWUI (obviously) doesn’t trigger any RAG processes
For Google, if the chunk refers to a PDF and the settings are such that the user wants full content, and the valve is set to perform the following function, the actual binary of the PDF is sent to Google because I like how Google does PDF OCR.
For n8n, I want to handle all RAG, and all chat history. (I’m using OWUI as an interface, my logic layer is all n8n, and my data layer is Supabase). So, regardless of any settings, any RAG process, or anything, it will:
1. read in the original file that’s part of the prompt (from the knowledge base, or uploaded, image or document) and sends the base64 of the file and the last user message to n8n
2. waits for a response, updating the status in OWUI every 2 seconds
3. updates the status in OWUI whenever n8n calls a tool or executes a sub-workflow
4. receives the response from n8n
5. extracts any <think> elements for display as collapsable elements in OWUI
6. displays the response + <think> element

Summary

With the OWUI-provided admin-level and uploaded file-level settings to use full content or not, combined with prompt-level ability to disable RAG (-), or force full content (!), and the fact that a single pipe handles multiple models plus an n8n workflow, I think I’ve finally proven to myself that is IS possible to… well, not work around, but work within OWUIs RAG implementation and get reasonable flexibility for switching between RAG chunks and full document chunks and binary files.

Technical Footnote

I discovered a bit of a challenging bug. Here’s how OWUI format the <source> and <source_id> tags:

<source>
<source>1</source_id>
content here
</source>

See the issue I’ve set to bold? That took about 6 hours to notice, wondering why my regex extraction of the “content” (<source>) kept failing!

demodomain

Next Multi-Model, Multi-Platform AI Pipe in OpenWebUI »

Previous « Code Smart in n8n: Programming Principles for Better Workflows

View Comments

Chargers says:

July 15, 2025 at 9:00 am

Howdy! I'm at work browsing your blog from my
new iphone 4! Just wanted to say I love reading your blog
and look forward to all your posts! Keep up the great work!

The Open WebUI RAG Conundrum: Chunks vs. Full Documents

Open Web UI’s RAG implementation

Ways to get files into OpenWeb UI

The Open WebUI RAG Roadblock

A Workable Workaround (Minimal Core Modification)

The Final Solution: Breaking Free from the Flying Dutchman

Here’s what I implemented:

Summary

View Comments

Recent Posts

Real LLM Streaming with n8n – Here’s How (with a Little Help from Supabase)

The Shifting Sands of the AI Landscape

Understanding the Brains (or lack thereof) Behind Your Chat App: Why LLMs Aren’t What You Might Think

Building a Comprehensive N8n Command Center with Grafana: The Detailed Journey

Extracting n8n Workflow Node Execution Times and Displaying in Grafana

The Art and Science of Syncing with n8n: A Technical Deep-Dive

Recent Posts

Recent Comments

The Open WebUI RAG Conundrum: Chunks vs. Full Documents

Open Web UI’s RAG implementation

Ways to get files into OpenWeb UI

The Open WebUI RAG Roadblock

A Workable Workaround (Minimal Core Modification)

The Final Solution: Breaking Free from the Flying Dutchman

Here’s what I implemented:

Summary

View Comments

Related Post

Recent Posts

Real LLM Streaming with n8n – Here’s How (with a Little Help from Supabase)

The Shifting Sands of the AI Landscape

Understanding the Brains (or lack thereof) Behind Your Chat App: Why LLMs Aren’t What You Might Think

Building a Comprehensive N8n Command Center with Grafana: The Detailed Journey

Extracting n8n Workflow Node Execution Times and Displaying in Grafana

The Art and Science of Syncing with n8n: A Technical Deep-Dive

Recent Posts

Recent Comments