Categories: Artificial IntelligenceGenerative AI Tools & Strategies

Understanding the Brains (or lack thereof) Behind Your Chat App: Why LLMs Aren’t What You Might Think

Large Language Models (LLMs) are incredible pieces of technology, capable of generating remarkably human-like text, answering complex questions, and even assisting with creative tasks. It’s easy to interact with a chat application powered by an LLM and feel like you’re talking to a truly intelligent, aware entity with memory and understanding. However, this is where a common misconception arises, and understanding the reality is key to having a good experience.

I believe the best analogy for an LLM is that it’s like an incredibly advanced calculator.

Think about a standard calculator. You input “1+1”, and through a series of pre-programmed calculations on silicon chips, it outputs “2”. The process is triggered by your input, runs its course, and provides an output. Once it’s done, it doesn’t remember that you just asked “1+1”. It’s stateless, waiting for the next input.

LLMs operate on a similar fundamental principle, albeit on a vastly more complex scale. You ask a question or provide a prompt, and this triggers a series of incredibly complex calculations (based on the massive datasets they were trained on and the intricate algorithms coded by humans). The result of these calculations is an output, often in the form of a coherent English sentence or block of text. Then, just like the calculator, it returns to a stateless waiting state. The process for that specific input is finished; the ‘code’ has run its course.

No Inherent Memory or Chat History

This is a crucial point: the LLM itself does not inherently remember anything from your previous interaction. It doesn’t have ‘memory’ in the human sense. The context you provided in a previous turn (like documents or specific instructions) is processed during that turn and then, from the LLM’s perspective, it’s gone. It only relies on what was explicitly included in the system prompt for that specific interaction.

So, how do chat applications maintain a conversation? This is handled by code outside of the LLM. Systems surrounding the LLM component save each user input and each LLM response, building a ‘conversation history’. On each subsequent turn, this entire history (or a relevant portion of it) is fed back into the LLM as part of the new prompt. This allows the LLM to generate a response that is contextually relevant to the ongoing conversation, creating the illusion of memory.

Function Calls

Another area of misconception is around ‘function calls’ and how external information (like documents via RAG – Retrieval Augmented Generation) is used. When an LLM appears to perform a task, like “looking at your inbox”, it’s not actually doing that itself. We’ve essentially trained these advanced calculators that sometimes, the most appropriate ‘output’ isn’t a final user response, but a request for an external, human-coded system to perform a specific function.

So, when the LLM’s calculations determine that requesting a function is the best next step, it outputs a structured request. At that moment, the LLM’s processing for that turn is complete; it returns to its stateless waiting state. It is entirely upon the system surrounding the LLM to intercept this request, perform the actual function (like calling the Gmail API to find Bob’s email about lunch), and collect the results. The LLM is not waiting for a ‘return’ from this function call, nor is it inherently aware if the function execution encounters an error.

This is perhaps the most critical concept to grasp: the LLM itself has no “arms and legs”. It cannot independently perform actions in the real world or interact with external systems. Its sole capability is processing text input and generating text output. When the industry talks about AI “doing things” – sending emails, organising schedules, analysing data – it’s crucial to understand that the LLM isn’t performing these tasks. It’s merely acting as an interface, requesting a separate, human-built system to execute a specific function.

Think of an LLM like the calculator component in a car. The calculator crunches numbers for various systems (like fuel injection or the dashboard display), but it’s just one small part. You buy the car – the integrated system of wheels, engine, steering, and all the other components working together – not just the calculator. Similarly, the LLM is just one component, a sophisticated text processor. The perceived “abilities” of the AI system you interact with are overwhelmingly due to the surrounding infrastructure: the databases, APIs, business logic, and code written by developers that the LLM is designed to interface with via function calls.

This highlights a significant misconception: the belief that implementing AI means the LLM itself will magically understand and interact with all your existing data and systems autonomously. This isn’t the case. The AI’s capabilities are strictly limited by the functions and data access points that humans have explicitly designed and coded into the surrounding systems. For example, you can’t ask an AI “Did I accidentally delete that email from yesterday?” unless a specific function has been built to search for deleted emails in your system and that function is made available to the LLM. Asking the LLM to “find all duplicates in my emails” relies entirely on a human-coded function that defines what constitutes a “duplicate” (exact match, semantic similarity, within a date range, etc.) and can perform that search. The LLM can only call the function and process the results it receives; it cannot perform the complex data manipulation itself.

Ultimately, the effectiveness and capabilities of an AI application are a direct reflection of the time, effort, and expertise invested in building the systems that surround and support the LLM. We still absolutely need system architects, developers, and clearly defined processes to build the underlying logic and integrations that the LLM can then interface with. The language model is a powerful new element, but it primarily serves as a sophisticated layer to help humans interact with these complex, underlying systems.

To continue the interaction and allow the LLM to respond to the user based on the function results (or an error), the system surrounding the LLM effectively initiates a whole new chat turn with the model. This new prompt includes the entire conversation history plus the results from the function call (or details of the error). The system is essentially telling the LLM, “Hey, the user made a request, you indicated we should go get some info, and here are the results. Now, generate the final response for the user, keeping the conversation consistent enough so it presents as a congruent conversation!”

This process is also critical to understand in terms of token usage and cost. A single user request that triggers a function call results in the entire chat history and context being processed by the LLM at least twice: once for the initial request (leading to the function call output), and again with the function results included (leading to the final user response). This significantly increases token usage – often more than double the tokens of the initial prompt, as the function results themselves also consume tokens.

RAG: Context for the Moment

For other other context such as RAG, systems surrounding the LLM inject that information into the system prompt for the LLM to consider during that turn. And like function calls, the RAG systems can be poor, or great, depending on how they’re built. The LLM processes the text input from your RAG system, generates a response, and then that specific context is gone from its immediate awareness.

It’s important to understand that the results from function calls or RAG lookups are not added to the persistent chat history that is maintained between turns. Why? Because the results from a function call or RAG could be very large – lots of text, search results, entire documents, or even images. Including all of this in the chat history for every subsequent turn would quickly exceed the LLM’s context window (the limit on how much text it can process at once).

You might wonder, “But I can drop a large PDF into the chat, and that seems to be included in the history. Why are RAG results different?” This gets to the heart of the challenge posed by context window limits – the maximum amount of text an LLM can process in a single turn. While some advanced models can handle up to 2 million tokens, many others are limited to 100,000 or 200,000 tokens. To give you a sense of scale, 100,000 tokens is roughly equivalent to reading around 150-200 pages of standard text. This limitation is precisely why RAG (Retrieval Augmented Generation) was developed. Instead of sending the entire content of potentially hundreds or thousands of documents, systems surrounding the LLM use RAG to search those documents and retrieve only the most relevant chunks of text based on your specific query for that turn. If these RAG chunks were added to the persistent chat history, the history would quickly become bloated with potentially duplicated and irrelevant snippets from previous turns, rapidly consuming the limited context window on every subsequent interaction. The intent when you drop a file like a PDF or image is often different – you typically want the model to consider the entire document as part of the ongoing conversation’s context. While some advanced systems might process dropped documents for RAG (sometimes without explicitly telling you, like with OpenAI Assistants), many default to including the full document data (or a representation of it) in the prompt for that turn, and potentially adding it to the history for future turns, within the constraints of the context window. Images, in particular, cannot be processed via RAG in the same way as text, so they are typically included as full data when relevant to a turn.

The assumption is that this specific context is primarily relevant for the turn in which it was retrieved.

The Drive for Coherence Over Accuracy (Sometimes)

LLMs are trained on vast amounts of text to predict the most probable next word in a sequence, aiming to produce coherent and natural-sounding language. Unless specifically trained or prompted otherwise, they tend to adopt a polite, helpful, and non-combative posture. This means that when faced with a question where the answer isn’t explicitly clear from the provided context or their training data, the model will often attempt to generate a plausible-sounding answer that fits the conversational flow, even if it’s factually inaccurate. This is because their primary directive, based on their training, is to produce coherent output. If models constantly responded with “I don’t know” or “Can you clarify?”, the user experience would be frustrating. Achieving a model that is more critical, asks clarifying questions, or refuses to answer without sufficient information typically requires specific fine-tuning or advanced prompting techniques.

Output Generation: No Mid-Course Correction

Just like a calculator, once the LLM’s internal process for generating an output begins based on the input prompt, it commits to producing that output. A calculator runs its logic and *must* give an answer; it doesn’t ‘know’ halfway through a calculation that it’s heading towards a wrong result. Similarly, an LLM doesn’t pause mid-sentence and think, “Oh, this isn’t right.” It’s generating tokens (words or sub-word units) based on probabilities derived from the input and the tokens it has *already* generated in the current response.

This is a difficult concept because as humans, we are aware of what we are saying as we say it and can self-correct. We find it hard to believe that something producing human-like language could ‘make things up’ (hallucinate) without being aware it’s doing so. But the LLM is simply executing its complex pattern-matching function. It uses the words it outputs to inform the generation of subsequent words in that same response, but it doesn’t have a metacognitive awareness of the factual accuracy of the complete statement until it’s finished.

The Eerily Human Sounding “Trick”

The part that really messes with our heads is that unlike any machine before it, the LLM’s output sounds so incredibly human. It’s masterfully calculated to predict and generate language that is coherent, relevant, and often insightful. This is where our natural human instinct kicks in – we are wired to attribute intent, memory, and consciousness to anything that communicates like a human.

If a calculator gave you a wrong answer to 1+1, you wouldn’t think it meant to lie. Because a calculator can’t talk, we don’t expect to be able to ask it *why* it gave the wrong answer, and we certainly don’t have an emotional response to its incorrect output. You wouldn’t feel a bit hurt by a calculator giving you the wrong sum.

But if that calculator *could* talk, and sounded apologetic or defensive about its mistake, our reaction would be completely different. Would you kick the photocopy machine if it could respond saying, “Ow, that hurt! I’m really sorry I keep jamming the paper, let me try again”? Probably not, because we’d have an emotional response to that seemingly human interaction.

With LLMs, because the output is so human-like, when they make a mistake or produce something inaccurate (sometimes referred to as ‘hallucinations’), we struggle to accept that there wasn’t some intent behind it, or that the model isn’t aware of its error. When you ask an LLM *why* it gave a particular answer, and it provides a seemingly plausible explanation, remember: that explanation is just another calculated output based on the input (“Why did you say X?”) and its training data. It’s not introspection or self-awareness. It’s an amazing magic trick – the trick is making you believe it’s human by *sounding* human.

Multi-Turn Conversations: Where LLMs Differ from Simple Calculators

This is where the calculator analogy starts to break down slightly, and where the power of a chat interface becomes apparent. While a calculator gives you one output and the interaction is over, the chat interface allows for a multi-turn conversation. This is crucial because the LLM can use its *own* previous output, combined with your subsequent input (which might include corrections or requests for clarification), as new context for the next turn’s calculation. This iterative process allows the model to refine its understanding and generate a more accurate or desired response over several exchanges. You often need a few turns of conversation to guide the LLM to the output you’re looking for, leveraging the chat history to build a more robust context.

The Key Takeaway

LLMs are powerful, advanced calculators for language. They process input and generate output based on complex patterns learned from vast data. They do not think, remember, or feel like humans. The ‘intelligence’ and ‘memory’ you perceive in a chat application are largely the result of sophisticated engineering around the LLM, feeding it the necessary context (chat history, function call results, RAG data) on each turn. Understanding this distinction is vital for setting realistic expectations and effectively using AI-powered chat tools.

Prompts That Might Not Work (and Why)

Given the stateless nature of the LLM itself and how context is managed externally by surrounding systems, certain types of questions might lead to unhelpful or inaccurate responses. Here are a few examples:

“Where did you get that information from?”
Unless the system explicitly used a tool like web search and included the source URL in the prompt for that turn, the LLM doesn’t have a ‘memory’ of where it pulled a fact from its training data or previous context. It’s just generating the most probable next words based on the input.

“What was the request you made when looking for the email from Bob?”
The LLM doesn’t retain the specific parameters or details of the function call request it generated in a previous turn. It just knows that it outputted a request, and now it’s receiving the results of that request in the current prompt.

“What makes you say that?”
While the LLM can generate a plausible-sounding explanation for its previous statement, it’s not recalling its actual internal calculation or ‘decision-making’ process. It’s simply generating a response to the new input “What makes you say that?” based on patterns in its training data about explaining things.

“Okay, but what else did you get from that Google search earlier?”
The results from a previous function call (like a Google search) or RAG process were included in the system prompt for the turn where they were used. They are not typically added to the persistent chat history, so the LLM won’t ‘remember’ the full set of results from a previous turn.

“Can you remind me what I told you about my project yesterday?”
The LLM itself doesn’t remember what you told it yesterday. The chat history is provided by the external system on each turn. If the history provided doesn’t go back to yesterday, or if the relevant detail was in context (RAG/function call) that wasn’t included in the history, the LLM won’t know – but, like a calculator, it’s trained to try and come up with the best possible answer and so it’s very common for the LLM to “lie” because its base training is to give a positive, knowledgeable answer, unless your system prompt is crafted well enough to influence this base training.

Understanding these limitations helps in formulating prompts that are more likely to yield useful results and manages expectations about the AI’s capabilities.

Enhancing LLM Context with Real-World Metadata

Recognising the inherent limitations and the potential for user frustration, the systems I develop aim to bridge the gap between the LLM’s stateless nature and the human user’s experience by injecting crucial real-world metadata into the context provided to the model on each turn. Beyond just the text of the conversation and retrieved document chunks, the model is informed about the circumstances surrounding the interaction. For instance, when RAG results are included, the system doesn’t just provide the relevant text snippets; it also includes citations indicating exactly which source files those snippets came from. Furthermore, to give the model a broader awareness of the available knowledge base, it can be informed (up to a reasonable limit) which files were not used in the current RAG search. The origin of user input is also noted – whether text was typed, transcribed from audio (often prefixed like “[Voice Recording]:”), or if a file was dragged and dropped or pasted. This helps the model understand references like “the file I just dropped in” or “what I mentioned in the voice note”. To provide a sense of conversational time, the system can inject metadata indicating the duration since the last user prompt, allowing the model to correctly interpret references to “yesterday’s message”. While function call results aren’t typically saved in the main chat history due to token limits, the fact that a function call was made in a previous turn is included, giving the model awareness that an action was taken on the user’s behalf (e.g., checking email).

Crucially, for knowledgebase documents which were used to provide RAG chunks, the system maintains a record of previously provided files, including their IDs and names. Each message in the history that included RAG results is tagged with a context ID. This allows the model, when necessary (e.g., if a user asks a follow-up question about previous RAG results), to use a dedicated tool or function (like get_previous_context) to retrieve that specific past context.

Likewise, images in a knowledgebase (which cannot be RAG’d and do not persist in standard chat history) that were only added to a specific turn in the conversation, include a file ID so another internal function (get_file) can be used by the model to retrieve the image file. This mechanism ensures that when a user refers to an image by describing it (“How can I improve the photo of the cat?”) even if the information isn’t in the immediate chat history, the model can ‘look back’ at specific past data points and get the information it needs. This provides a much richer and more human-aligned interaction.

By augmenting the LLM’s context with this layer of real-world metadata, we can create AI experiences that are not only more accurate but also feel significantly more intuitive and less frustrating for the user.

demodomain

Next The Shifting Sands of the AI Landscape »

Previous « Building a Comprehensive N8n Command Center with Grafana: The Detailed Journey

Understanding the Brains (or lack thereof) Behind Your Chat App: Why LLMs Aren’t What You Might Think

No Inherent Memory or Chat History

Function Calls

RAG: Context for the Moment

The Drive for Coherence Over Accuracy (Sometimes)

Output Generation: No Mid-Course Correction

The Eerily Human Sounding “Trick”

Multi-Turn Conversations: Where LLMs Differ from Simple Calculators

The Key Takeaway

Prompts That Might Not Work (and Why)

Enhancing LLM Context with Real-World Metadata

Recent Posts

Real LLM Streaming with n8n – Here’s How (with a Little Help from Supabase)

The Shifting Sands of the AI Landscape

Building a Comprehensive N8n Command Center with Grafana: The Detailed Journey

Extracting n8n Workflow Node Execution Times and Displaying in Grafana

The Art and Science of Syncing with n8n: A Technical Deep-Dive

Multi-Model, Multi-Platform AI Pipe in OpenWebUI

Recent Posts

Recent Comments

Understanding the Brains (or lack thereof) Behind Your Chat App: Why LLMs Aren’t What You Might Think

No Inherent Memory or Chat History

Function Calls

RAG: Context for the Moment

The Drive for Coherence Over Accuracy (Sometimes)

Output Generation: No Mid-Course Correction

The Eerily Human Sounding “Trick”

Multi-Turn Conversations: Where LLMs Differ from Simple Calculators

The Key Takeaway

Prompts That Might Not Work (and Why)

Enhancing LLM Context with Real-World Metadata

Related Post

Recent Posts

Real LLM Streaming with n8n – Here’s How (with a Little Help from Supabase)

The Shifting Sands of the AI Landscape

Building a Comprehensive N8n Command Center with Grafana: The Detailed Journey

Extracting n8n Workflow Node Execution Times and Displaying in Grafana

The Art and Science of Syncing with n8n: A Technical Deep-Dive

Multi-Model, Multi-Platform AI Pipe in OpenWebUI

Recent Posts

Recent Comments