ModelAnalysis.ai

Auto_Draw: Frontier Models Can See What's Wrong. Fixing it is a Different Story

Ram Komarraju — Sun, 19 Apr 2026 01:48:01 GMT

This project was inspired by Andrej Karpathy’s autoresearch concept, but instead of an AI research task, I wanted something visual where the gap between what the model thinks it produced and what it actually produced would be immediately obvious to any human observer.

So I created a simple closed-loop experiment around programmatically generated images. A model had to write code that rendered an image, inspect the output, critique what was wrong, edit the code to fix those problems, and then repeat. I used coding agents including Codex, Claude Code, and Antigravity to run the loop, and the underlying models were GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. To keep the setup simple and measure each model’s end-to-end capabilities, I used self-evaluation throughout: the same model that generated the image also scored and critiqued it. The prompts ranged from easy to hard, starting with a wall clock and a world map, then moving to a bicycle and finally an elephant, all to be rendered as realistically as possible.

What the Models Did Well

The models followed the overall instructions quite well which is not surprising. They created the run folders, made per-iteration subfolders, wrote draw.py, rendered images, evaluated the outputs, edited the code, and stopped when they hit either the iteration limit or their quality threshold. They created some images quite realistically - maybe because they were trained on them specifically. More interestingly, especially in the early iterations of a run, they were often pretty good at seeing what was wrong with the image they had just produced. Even when their scores were a bit too optimistic, their written critiques were often specific, sensible, and directionally correct. They could tell when an elephant was missing a trunk, when a bicycle cockpit looked synthetic, when labels on a map overlapped, or when a clock face was missing an hour hand. The evaluation step, at least at first in each run, often looked stronger than the overall result.

Where the Self-Improvement Broke Down

The problem was that good critique did not reliably translate into good code edits. Again and again, the models identified real flaws, described them clearly, and then failed to make code changes that actually addressed those flaws. Sometimes the next iteration barely changed the image at all. Sometimes it made the result worse. A typical iteration might claim something like: “Added directional contour wrinkles to the ear and trunk, improved the shoulder-to-head transition with graduated shading, and introduced subtle skin texture variation across the flank.” But when I compared the before-and-after images, they often looked nearly identical. All three models often produced a plausible narrative of improvement while producing zero tangible change.

The elephant runs made this especially obvious. In one Opus run, iterations 12 through 20 contain detailed descriptions of anatomical corrections, texture refinements, and shading improvements. The self-scores keep inching upward, and the written evaluations remain articulate and specific. But when I compared iteration 12 and iteration 20 side by side, the visible improvement was minimal.

What This Suggests About Self-Correction

I do not think this experiment proves anything dramatic about self-improvement in the broadest sense. But I do think it says something useful about task-level self-correction. If self-improvement in practice means producing artefacts, evaluating the result, identifying what is wrong, and then making changes that lead to meaningful gains over repeated iterations, these experiments suggest that the loop is still much weaker than the hype implies, at least outside the domains where models are most heavily RL optimized. The agents did not fail completely. All three followed instructions well. All three identified real flaws. All three made some improvements on easier attempts (perhaps because they were trained explicitly on those images), and all three sometimes recovered from regressions.

My Current Take

What I saw most consistently was a disconnect between identifying what’s wrong, and then taking the necessary steps to fix it. Reinforcement learning is highly domain-specific, and coding gets a huge amount of frontier attention because it is measurable, economically valuable, and relatively easy to score, which makes it a natural target for RL training. But once you move into narrower or less directly rewarded domains like auto_draw, the loop seems much patchier.

Now the Results

The wall clock was the easiest subject, and it shows. Opus moved from 5/10 to 8/10 in 7 iterations. Codex went from 3.8/10 to 8.1/10 in three iterations, and Gemini in just two! A wall clock is geometrically simple and the failures are concrete. So the models seem to be able to detect the problems but also make correlated code changes quite easily. This is the auto_draw loop at its most effective and efficient.

The world map drawings turned out to be better than expected - I suspect it’s because the models must’ve been explicitly trained on getting the world map correct given some hoopla around this a couple of years ago.

Drawing the bicycle turned out to be more challenging for the models. There were multiple iterations where Opus and GPT-5.4 kept talking about making massive changes without any discernible change in the images generated. Gemini was just super positive in its own self assessment though :)

The elephant was where the models struggled the most. All three models could often accurately and specifically describe what was wrong. But they couldn’t convert those diagnoses into code changes that moved the image closer to photorealism. Opus was probably the worst performing amongst the three models, and Gemini’s iterations kept on gradually getting better even if its own self assessment was way too optimistic.

The Case for AI-Powered User Interfaces

Ram Komarraju — Sat, 14 Feb 2026 13:56:12 GMT

«Back in April 2025, in my End of the App Store? essay, I speculated about a future where a single AI-powered “super app” would replace all the apps. Models and agents have improved a lot since then at tool use and code generation, turning that “super app” concept into a clear and present capability whereby user interfaces can be generated on the fly to address a user’s need. This article is my February 2026 update, backed by a demo from a weekend project.»

The key takeaways up front (sort of like a very long TL;DR)

The static app model has worked for decades, but it forces every user to learn how to navigate many apps, each with its own fixed set of predetermined screens. The apps in turn need armies of product managers, business analysts, architects, developers, testers, and SREs to build and maintain them.

In this article, I propose that AI-powered Dynamic Interface Generation (DIG) exchanges rigid consistency, long build times, and high build costs for high flexibility, lower predictability, just-in-time builds, and higher runtime costs. The LLM generates exactly the interface the moment requires. No more, no less.

The range of tasks where a generated interface is good enough is growing as models improve. Functions like task tracking, workflows, email triage, calendar management, and integration of different UI components such as maps, tables, charts, and data dashboards are all within reach today. Stateful, high-frequency workflows and complex functionality will likely need a purpose-built UI for a while. And the backend services that implement core business logic will remain (of course, AI coding agents will be used to generate them as well!).

Given these rapidly increasing capabilities, this is the future I envision for phones, laptops/desktops, and enterprises:

On phones, you stop hopping between apps and instead ask a smart assistant for outcomes, and it dynamically assembles UI and flow to achieve that outcome for you.
On laptops/desktops, there’s more room for specialized tools but the main interface for many users will likely be a single adaptive workspace that pulls the right panes and actions across email, docs, tickets, dashboards, and more.
In enterprises, multiple front-end apps converge into one governed interface sitting above many systems and services, making most internal CRUD and workflows generated, and pushing SaaS value downward from vendor screens toward APIs offering high-value capabilities.

In the short to medium term, browsers are a natural choice for implementing DIG for enterprises because of their secure-by-default runtimes, the JS/HTML/CSS combination that lends itself to dynamic interface generation, and centralized deployment capability. On the other hand, mobile operating systems will likely develop OS-native DIG capabilities for tighter device integration and lower latency.

All leading LLMs already exhibit some dynamic interface generation capabilities (when they generate deep research artefacts, for example). Google’s experimental browser, Disco, also takes a stab at this with its gentabs functionality. But it’s not quite what I thought would illustrate the vision I had in mind. So, I built a prototype to show how the dynamic interface generation loop might work. You can watch the video here:

For those of you who are interested in reading through, here we go…

1. Introduction

We are so used to doing things a certain way that we assume it is the only sensible way to get things done. But every so often, a real shift in technology lets us question our most basic assumptions. The idea that we need a separate app for each kind of use, with a bunch of prebuilt screens and rigid flows, is something we’ve all come to expect. But it doesn’t have to be that way!

Many of us literally have hundreds of apps, each created for its own special use. Wouldn’t it be nice if your “smart” phone was actually smart, and gave you a single place to start, whether that is a screen or voice, and then quietly invoked whatever functions and services were needed to get stuff done?

The same problem shows up even more clearly in enterprises. Every corporate worker needs to learn dozens of applications to do their job. Sure, there are massive systems like ERP software that try to pull it all together. But customizing these systems is a gargantuan task, and the result is often a complex jumble of screens that reflects what was feasible to build, and users still have to learn how to go through them to get their job done. Wouldn’t it be better to have a single screen that adapts itself to the job you are trying to do? It might still interact with dozens of systems using all kinds of APIs behind the scenes, but you would not have to care.

That’s what I think is possible with LLM powered Dynamic Interface Generation (DIG). In the rest of this essay, I’ll talk about the pain points faced by today’s app-centric model and static user interfaces, make the case for DIG, share a video of an app that I created to demonstrate the art of the possible, and conclude with current limitations and why I believe many of them are likely to be overcome within the next few years.

2. The Problem with Thousands of Apps and Static User Interfaces

Almost every UI we use today was designed by a team that guessed what screens we’d need. They built navigation hierarchies, screen layouts, and interaction patterns long before we ever opened the app. When our needs don’t fit their assumptions, we submit a feature request (if we can) and hope it gets fixed in the next version.

These feature requests go into backlogs. Product teams prioritize based on aggregate demand. The interface we actually want, a specific combination of data, layout, and actions, may never ship because it’s too niche to justify engineering time.

The result is a world of rigid, prebuilt interfaces that users learn to navigate. Instead of software adapting to them, they adapt to software.

Thousands of special purpose apps also create very concrete operational drag. Users lose time context switching as they move between tools. Work gets stuck in the seams between apps (copy-paste!) and brittle handoffs that rely on people remembering “the next step.”

3. What Dynamic Interface Generation Changes

DIG replaces the static design-build-deploy model with a fundamentally different approach. The user describes what they want to accomplish, and an LLM generates the interface on demand, connected to real data, capable of real actions to support the user’s intent.

It also changes how “UI requirements” work, pushing them into runtime via the intent expressed in the user’s prompt. The “requirement” becomes “help me accomplish X,” and the system is free to render the best UI it can for the moment, given the tools and the data available. That is a different way of building software, and it shifts effort away from endless UI iterations.

There will be fewer and fewer apps with prebuilt screens. No navigation menus and screens designed by committee. The LLM is provided with a list of tools and services, and it decides which tools to invoke, what data to fetch, what interactions to offer, and how to render the user interface. The things steering it are the user’s intent, the state provided in context, and what it knows of the user’s preferences and history.

4. The Opportunities

Smartphones

We will see a gradual transition toward phones where the primary interface is no longer a grid of apps. The OS comes with one “smart” app, and “apps” become API based service providers. For example, in the case of the iPhone, Siri will not just be a voice assistant you occasionally invoke. We will move toward a future where Siri is the main interface you use to get things done. I know, I know. Siri gets a lot of flak, and deservedly so. But I’m talking about a future where Siri is powered by an AI that actually works.

There might remain some special cases where precise control and efficiency of a static UI would be needed, but for the vast majority of cases an LLM generated dynamic UI is likely to be the better choice (especially with LLMs specially trained with UI generation in mind). Consumer devices will likely push more of the DIG experience into OS-native implementations for tighter integration with device capabilities and better latency.

Laptops and Desktops

Laptops and desktops are where DIG can get interesting because the surface area for work is larger, and some specialist interfaces like the unix terminal are likely to remain in use for a while. But most casual users will likely work inside a single adaptive workspace that generates the right mix of panes: email, calendar, documents, spreadsheets, and chat.

The Enterprise

Similar to the mobile and desktop experience, there will be a convergence toward a single interface that sits on top of a complex web of AI agents and traditional deterministic services. And the UI layer orchestrates them based on the user’s intent.

A huge amount of enterprise software consists of CRUD screens, workflow enabling screens (approvals, routing, etc.), reconciliation views, status dashboards etc. And these are ripe for replacement by a single dynamically generated interface.

Modern web browsers have intrinsic advantages that make them a strong starting point for DIG as a universal integration shell in enterprises. They provide mature isolation and security primitives, and the HTML/CSS/JS stack is a powerful, widely supported substrate for dynamic rendering.

SaaS might still matter because the cost of building and operating a complex business domain can be shared among many customers, but the vendors’ primary interface becomes less defensible. It’s likely to become an API and capability layer, while the UI becomes generated and owned by the enterprise rather than by each vendor.

This also reframes build-vs-buy more broadly. Enterprises will still buy capabilities, but end up with one adaptive interface that can unify workflows across many systems, while integrating with corporate data, and enforcing consistent governance and permissioning.

5. So, what changes?

Every user gets a tailored experience

Two users asking “show me my schedule” might get completely different layouts depending on how many events they have, what data is available, and how they phrased the request. There is no single “calendar view” that has to work for everyone. The LLM tailors every render to the specific context.

The long tail of internal tools disappears

Enterprises maintain hundreds or thousands of internal dashboards, admin panels, and CRUD interfaces, each with its own deployment pipeline, bug backlog, and maintenance burden. Most of these are simple combinations of data fetching and form submission. A DIG system replaces them with a single deployment that generates each interface on demand. The integration surface shrinks to a set of tool definitions and API credentials.

Product iteration method changes substantially

The traditional cycle (design a screen, ship it, measure engagement, redesign) takes weeks or months. With DIG, this cycle is broken, and traditional methods like A/B testing become a lot less relevant as every interaction is a unique rendering. We’ll need to find new ways to evaluate the effectiveness of the screens generated, and how to feed what we’ve learned back into the next version of the agent or the LLM.

6. What Doesn’t Work Yet

Many of the limitations described below can (and will) be addressed as models are trained with more examples of DIG (how to build better interfaces dynamically to meet user intent while using context, data, and available tools and services effectively).

Consistency. The same prompt produces different layouts on different runs. For workflows where muscle memory matters, this could be a real problem. The system needs layout pinning, preference learning, and perhaps “UI contracts” for certain workflows.

Latency. Generating a UI render takes many seconds compared to the milliseconds taken by traditional prebuilt screens. Caching common patterns and using partial renders would help, but we will likely rely on special purpose SLMs or on LLM performance improved further as it’s been over the past few years.

Complexity ceiling. Leading LLMs can generate most of the UI functionality dynamically. But, there is a threshold beyond which it needs to rely on prebuilt software (e.g., a spreadsheet component). But, LLMs can be trained to embed and use prebuilt components where needed.

Cost. Every interaction is an LLM call. At current token pricing, a power user using a DIG based app all day will accumulate significant API costs. This will improve as inference gets cheaper, but it’s a real constraint today.

Security. This isn’t a new issue with DIG per se, but I wanted to mention it here before someone calls me out on this :) The traditional security best practices will still matter. Backend services will still need to authenticate requestors, check if they have the right permissions, validate all input coming from the UI, and enforce all business rules, etc.

7. Conclusion

The static app model is optimized for a world where UI is expensive to build and where the safest way to ship software is to freeze screens into predictable, testable flows. But this world is changing.

With DIG, we give up some predictability in exchange for a UI that can match intent, context, and workflow in real time. But the “good enough” frontier is expanding fast. As models and the agentic solutions around them improve, the number of tasks that can be handled by generated interfaces will keep expanding. If that happens, single-purpose apps will reduce in number, and more of what we think of as “apps” will collapse into services and tools behind generated interfaces.

Model Anatomy

Ram Komarraju — Sat, 17 Jan 2026 18:51:18 GMT

I built Model Anatomy to help visualize and compare the internal structures of large language models. Here’s what it does.

Inspect a Single Model or Compare Models Side-by-Side

Select any two models from the dropdown menus and their architectures appear side by side. You can filter by model family (Llama, Gemma, Mistral, etc.) to narrow down your selection. You can see in the diagram below that Gemma 3 uses a hybrid normalization (combination of pre and post-norms) and OLMo 3 uses post-norm, and both of these are different from the more common pre-norm approach.

The LLM architecture doesn’t seem to have changed that much since GPT-2. But a careful comparison reveals interesting architectural differences. Most architectures use post-norm instead of pre-norm. Position embeddings gave way to RoPE. The original multi-head attention mechanism saw a few innovations like GQA. Mixtral stands out with its Mixture-of-Experts (MoE) layers visible in the diagram.

Understand Parameter Counts

Toggle “Show parameter counts” to see where parameters actually live in each model. Every number is interactive. Hover over any parameter count to see exactly how it’s calculated.

For example, hover over the attention layer’s parameter count and you’ll see the formula break down: Q/K/V projection weights, output projection, the dimensions multiplied out.

This reveals some interesting patterns:

Embedding layers are expensive. In smaller models, the embedding table can account for a significant fraction of total parameters. A 50k vocabulary × 4096 dimensions = 200M parameters before you even get to the transformer blocks.
FFN layers dominate. In most architectures, the feed-forward network within each transformer block contains roughly 2/3 of that block’s parameters. The attention mechanism everyone talks about is actually the minority.
Scaling isn’t uniform. Models don’t just multiply everything by the same factor when going from 8B to 70B. Hidden dimensions, number of layers, and attention heads all scale at different rates.

Inspect Individual Modules

Click any module in the diagram to see its details: the mathematical formula, the tensor shapes flowing through it, and what each component does. The side panel shows a graphical breakdown, and if you enable “Show implementation code,” the bottom panel shows PyTorch-style pseudocode.

Implementation details with pseudocode:

What I Learned Building This

I first sketched this idea over a year ago but never built it. The implementation seemed tedious, involving parsing model configs, building SVG diagrams, handling all the edge cases between different architectures.

Last week I sat down with a coding agent and had a working version in an evening. The agent handled the boilerplate, I directed the design decisions. Things that would have taken me hours of documentation-reading happened in seconds. The productivity multiplier was real.

This tool is far from perfect, and I selected only some model variants, missing edge cases in the visualization, or features that would be useful. But, hopefully you’ll still find it useful whether you’re new to the field or just interested in comparing two models.

Learn More

If you want to go deeper into model architectures, Sebastian Raschka has an excellent breakdown of modern LLM designs at Understanding Large Language Models. His blog covers some interesting insights and implementation details that this tool visualizes. Shout out to Jay Alammar whose Illustrated GPT-2 explained GPT-2 pretty comprehensively and intuitively when I started out years ago.

Try It / Contribute

The tool works best on desktop but should be usable on mobile browsers (the layout stacks vertically on phones).

What models are missing? What features would be useful? What’s confusing, incorrect, or broken? Leave a comment or reach out. I’d like to make this more useful for anyone trying to understand what’s actually inside these models. I will also give you credit for all your findings and contributions :)

The Great AI Code Transition

Ram Komarraju — Thu, 08 Jan 2026 07:38:01 GMT

«Disclaimer and Notes: as is usually the case, there are many ways to present a narrative like this. I could’ve sliced the software into a completely different set of categories but I chose the set of categories that I thought would effectively illustrate the points I wanted to make. It should then be obvious that the percentages presented in the chart for various categories of code are meant to indicate a rough guess of what their relative sizes might be like. A more rigorous, data-based approach is needed to better ground this chart. It is often said that prediction, especially about the future, is hard. The biggest of my speculations is the extent to which AI generated disposable code would take over our interactions on edge devices. Finally, each of the points I wrote below can be expanded into an essay, and I might do that if there’s enough interest and response to this article.»

It is very clear as we enter 2026 that LLM-based AI coding agents have become powerful enough that more and more of code will be written using them. In fact, it is inevitable that in the not-so-distant future, almost all of the code will transition from being hand-crafted by humans to being written by these AI coding agents. This article looks at what a transition to such a future could look like.

Let’s start off with a chart that compares at a high-level the present state of affairs with this future. The left-hand side of the chart presents the current state where almost all the code is written by humans (I know, I know, some of the leading tech companies and start ups might have been aggressively switching to coding agents already, but by and large the vast majority of places are still using humans to write the code). As we shift right along the X-axis, you can see more and more of the code being written by AI.

Here are some key observations about each of the four layers that will comprise the AI-written software:

How fast disposable software emerges on edge devices depends not only on model capability but also on privacy, latency, cost, power constraints, and offline reliability. Many consumer experiences may route through cloud inference to access stronger models, while privacy needs and regulatory requirements will push more on-device inference even if those small models are weaker. This will likely create a practical split in user experience, where the best functionality would be available to users that can afford to pay for better connectivity, newer hardware, and premium inference.
Platforms that enable sophisticated “vibe-coded” small business apps and sites will expand quickly. Users will focus on the business value proposition and user experience, while relying on the platform for scaffolding that provides security, payments, observability, data storage and compliance. This reduces the need for deep technical knowledge and extensive testing at small scale while increasing dependence on platform guardrails and concentrates risk in platform providers.
Large enterprises outside big tech will take longer to reach a fully AI-written software stack, maybe a decade or more, because the bottleneck is not capability but liability, controls, and regulations. They will need new SDLC and operating models that make AI-assisted changes auditable, reproducible, testable, and attributable, with clear responsibility when something breaks. The transition will be uneven: experiments and greenfield products will move quickly, while core systems of record and regulated workflows will lag. Organizationally, some middle layers of coordination shrink or change shape, and effort shifts toward governance, verification, security, reliability engineering, and incident response.
For systems and platforms, AI will write much of the code churn, but expert steering remains essential because correctness, performance, and security constraints are unforgiving. Specialists will focus less on manual implementation and more on precise specifications, choosing/designing the correct algorithms, and rigorous verification.

The next four points will focus on what this means from socio-economic and legacy system impact perspectives:

Many software builders who take pride and identity from craftsmanship will find the transition emotionally and professionally difficult as the work shifts from writing code to defining intent, setting constraints, verifying behavior, and owning outcomes. The people most likely to prosper are those with strong product judgment, clear vision, enough technical literacy to reason about tradeoffs, and the ability to communicate precise techno-functional requirements to coding agents. Deep specialists will still matter, especially in systems with high-performance, industrial-grade security, and safety-critical needs, either hand-crafting core components or steering agents with high precision. Status and fulfillment will increasingly come from accountability, reliability, and impact rather than elegance of implementation.
Entry-level coding jobs and many mid-skill implementation roles may shrink as AI can easily write simple code that would historically have been written by junior coders. It’d be interesting to see how the junior coders will be able to learn to become good coders in the age of AI coding assistants. At the same time, more people will be able to build sophisticated functionality without formal training broadening participation.
As to how AI functionality itself will be integrated into software, I’m expecting that two competing product visions will coexist for a while. One is special-purpose applications with AI embedded, optimized for functionality, precision, and performance. The other is a general-purpose AI layer that generates tailored interfaces and orchestrates specialized modules and systems on demand. If the general-purpose layer becomes the dominant interface, it challenges traditional suites like Microsoft Office by demoting Word and Excel to the back-end while elevating agent-driven interfaces to the top.
Finally, if something like AGI emerges, the technical need for human steering could drop further, but human oversight and accountability may still be required for legitimacy. I’m assuming (for humanity’s sake!) that ownership and liability will ultimately remain in human hands.

That concludes this short essay. There’s more that I can write on this topic, but this is all I could write in the time I’d allotted to myself. Let me know your thoughts, and thank you for reading all the way to the end :)

Software Development in the Age of AI Coding Agents

Ram Komarraju — Fri, 26 Dec 2025 21:35:57 GMT

«No AI was used in writing this article. Other than the obvious use of Gemini Nano Banana Pro for the images.»

It’s hard to make predictions, especially about a future that’s rapidly unfolding like science fiction in real life and feels surreal and palpable at the same time. But, that’s exactly what I’ll try to do in this article: predict the future of software development across multiple dimensions.

AI Coding Agents Will Continue to Get Better

As we stand at the cusp of 2026, I think we can confidently envision a future where almost all of the code will be written by AI coding agents. Here are a few supporting factors that will turn the vision into reality:

Cost of producing code will approach zero for all practical purposes. This is pure tokenomics at play. There’s little to distinguish between different leading models, and when coupled with the inevitable algorithmic and infra efficiencies, this can only lead to the eventual commoditization of the coding LLMs (and the agents on top of that).
The three major limitations that currently plague LLM-based coding agents will become only minor limitations over time:
- The amount of memory (context) that LLMs can hold at a given time will increase over the next couple of years. So much so that it won’t be the major limitation that it is today. So, context engineering, much like prompt engineering from ‘23-24, will become a sidebar conversation within the next couple of years.
- Context rot and fragility to prompting will be addressed at the underlying technology and surrounding tooling levels, as well as by users learning to use the coding agents more effectively. For example, we will learn to discard a poisoned context and move to a new context quickly, and configure system and tool prompts to extract the best performance.
- LLMs will get better at following best practices and architectural principles. In my opinion, this is a matter of RL training on more and more examples of high quality code annotated with comments highlighting best practices. In other words, creating high quality code will get better assuming that correct context and prompts with architectural guidance are used.
We will develop new techniques and methodologies to use the AI coding agents effectively (e.g., spec driven implementation and plan driven programming). These practices will continue to evolve at a rapid pace as long as the coding agents continue to evolve in their capabilities, and will reach their maturity as coding agents will reach a mature state. For example, the importance of context engineering will diminish as context length increases, and coding agent developers would have learned how and what to include in the context efficiently. Similarly existing tooling for coding (e.g., IDEs or CLIs) will continue morph and adapt to this new paradigm. We see this transition already whereby IDEs and CLIs adapting their interfaces and tooling around agent-centric development. I predict that we will see integrated tooling that will enable users to manage the entire lifecycle of the application (from testing to deployment) from within a single application like Cursor.
Note that we don’t need coding agents to work completely independently for hours and days in order for these predictions to come true.

What are the likely broad implications for software development?

We’ll see Jevon’s Paradox in action. When cost of writing code approaches zero, there will be more demand than ever to write new software, and the bottlenecks move elsewhere:

Deciding what software to build or what changes to make to the existing software. There’s a tremendous amount of work done outside of coding. You need to understand the business problem. You need to understand the trade offs between scalability, maintainability, and latency vs throughput, and choose the options suitable for the context at hand. You need to agree on the interfaces with other teams. And then you find the way to implement it in an existing system in as elegant a way possible while keeping in mind the impact it might have on end users.
Reviewing and testing what’s been written. In ‘26 it might still be about ensuring that the correct architectural choices are made, but increasingly it would be about ensuring the intent is captured appropriately. Because language is an imperfect communication mechanism and even the best coding agent cannot divine the user’s intentions correctly all the time.
Iterate through what’s been written. Just because software can be produced easily doesn’t mean that can we can get to what we want very quickly. For one thing, it’s often very difficult to know what one wants, and getting things right requires iterations. So, we’ll likely try options A, B and C because of the cost of trying things is trivial, while the cost of getting it wrong remains high.

What does this mean for people/jobs?

There would be many hardcore vibe coders who just want to develop applications to solve the business problem at hand. These individuals will gravitate towards platforms like Lovable, Replit (and other offerings that hyperscalers like AWS are likely to provide). These are the Squarespace/Shopify-like platforms of the future.
Software engineers that understand code better, know architectural best practices and can wield the agents/tools to operate reliable, scalable, and maintainable enterprise software of the future. This forms the bulk of the current SWE space, and I predict that Jevon's paradox will hold here. The skills needed in this space will change, but the number of jobs won't reduce as demand grows in response even as the cost of code approaches zero.
A few specialists will remain writing core engines of the software world - those that develop technical libraries, frameworks, and operating systems, etc. There have been very few individuals of this calibre and the revolution in coding agents isn't going to change that.
New graduates will end up falling in one of the above three buckets. Those that have more of a product bent might end up landing in the first bucket. Those that have the discipline and aptitude to learn through actual coding (when an all-powerful coding agent is a tantalizing keystroke away) and reading technical books will be valuable and employable, joining the latter two categories.

A 10x individual of the future would be a combination of a top product manager and a top programmer of today. And just as today, they will be as rare to find. They will know how to:

identify the right problem to solve
frame the problem so that finding the solution becomes possible
shape that solution into an elegant, maintainable, and scalable design

What does this mean for the enterprises?

Adoption would be gradual but unrelenting. I think by this point it must be clear to even the staunchest of the skeptics that coding agents aren’t just a part of the AI hype. Some of the enterprises have already started allowing coding agents for in-house application development, and this number will only increase in 2026 and beyond.
Managing the complexity of AI-generated code will be challenging. To borrow from Fred Brooks, software systems always had two different types of complexity - essential complexity (the intrinsic complexity of the problem we’re actually solving) and accidental complexity (all implementation details we’ve accumulated to ‘manage’ the application). Coding agents make the addition of accidental complexity incredibly easy. Imagine being able to write thousands of lines of code in a few hours where each thing ‘just about’ works. Modularization and compositionality along with detailed specs will become ever more important to address this problem.
New SDLC methodologies for this new paradigm will spring into place as organizations employ coding agents. One can imagine a method whereby each module’s scope is kept narrow and specified to great precision, and code is produced every time any part of the spec is changed. This is CI/CD on steroids, and wouldn’t be possible without implementing the necessary controls (see next paragraph).
It is not enough for agent-written code to compile and pass unit tests. Enterprises will likely need clear IP provenance for generated code and secure-by-default pipelines that produce SBOMs and enforce supply-chain controls. I expect new controls like specs treated as versioned artifacts, automated functional and security testing, and policy checks as merge gates. Of course, AI will play a key role throughout this pipeline.
Organizational structures will have to change to reflect the new capabilities and best practices. I can see a large number of small and independent teams with each of them owning a significant chunk of the functionality, but centralized teams for ensuring architectural cohesiveness and managing operational data.
It is a mistake for organizations to stop investing in junior talent. While senior engineers bring experience, new graduates are natives of the AI era. They adapt to new workflows more naturally and will ultimately become the backbone of the industry.

In conclusion, I wouldn’t worry if I were a software engineer today or aspiring to become one. As the barrier between intent and code thins, software engineers will be valued for steering AI with precision and keeping complexity under control.

December 2024 vs December 2025: What I Got Right (and Wrong) About AI

Ram Komarraju — Tue, 23 Dec 2025 19:20:24 GMT

Here’s my evaluation of the predictions I made back in December 2024 with the current state of affairs vis a vis AI. Hope you find this interesting. What were your predictions, and how did they pan out in 2025? Don’t forget to share your thoughts in the comments section!

What I wrote in 2024 is in italics. You can see that I got most of the predictions correctly but there were a couple of surprises as well.

As 2024 concludes, I've been thinking about what's next for AI. Here are ten observations and predictions for 2025 - would love to hear your perspectives and what I might have missed!

1. Given the huge R&D investment going into frontier AI, we’ll continue to see tremendous amounts of innovation and progress along multiple fronts.

This turned out to be very true. Hype and expectations continued to drive investment into AI research, engineering, infrastructure, and applications. LLMs dominated research, suffocating most other areas due to a lack of attention and funding. If LLMs don’t lead us to AGI, then consider this time and effort spent as a journey “off-ramp on the way to AGI,” as Yann LeCun put it.

2. Unlike much of traditional computer science where algorithms can be analyzed on paper, deep learning remains highly empirical - ideas must be tested at scale to confirm their viability. With more known ideas than can be developed and tested in 2025, expect progress to continue at pace for several years before any slowdown.

Deep learning remains very empirical though valiant efforts are being made by a few die-hard mathematicians to put the field on a solid theoretical footing.

This book aims “to establish a principled and rigorous approach to understand deep neural networks and, more generally, intelligence itself.” Here’s a MLST video interview with Prof. Yi Ma, a co-author of the book. Here’s another video on MLST that talks about applying category theory to deep learning.

3. Still, progress will continue to appear gradual to those directly involved, and will appear to take place in occasional big leaps to the general public.

This unfolded differently than I’d envisioned. Because of the intense competition between the frontier labs, new models and features were being released to the public without a lot of holding time. New models were announced regularly with reports of better benchmark scores. Most casual users of the chatbots stopped noticing any discernible improvements, and in some cases wanted their favorite old model (OpenAI’s 4o) to be kept available. Image and video generation started to improve gradually (people consistently have 5 fingers per hand and text could finally be displayed without problems), reaching its pinnacle with Google Gemini’s Nano Banana!

4. LLM Maximalism (the idea that foundation models such as ChatGPT will lead to AGI with just more data and compute - say with version 6!) is dead. But transformer based LLMs are the closest thing to AGI we still have.

I was wrong here. LLM Maximalism got a second wind with reasoning models. Test-time scaling with Reinforcement Learning with Verifiable Rewards (RLVR) was exactly what was needed to unlock the intelligence hidden within the pre-trained model (which was created from training for next-token prediction on web-scale data). RLVR enabled code assistants to become the second killer app based on LLMs (the first was chatbots - in case you’re wondering).

5. AI/ML techniques such as Reinforcement learning, search/planning, bayesian networks and logic will claim their rightful place next to pure deep learning based architectures. The path to AGI is pursued along one of these paths:
A. A reasoning system with LLM at its core complemented by RL and Search (e.g., OpenAI’s o3). Here the LLM sits sort of separately and search happens in token space.

The ‘reasoning system’ implemented by o3, and then by all the frontier labs, turned out to be RLVR (see the previous point). A particular version of RLVR was revealed to the general public through this paper by DeepSeek, a Chinese lab.

B. An enhanced LLM with reasoning taking place within the models’ latent space (e.g., Meta’s recent COCONUT paper).

Latent reasoning is still being worked on. Perhaps we’ll see it in 2026.

C. Some other technique that doesn’t involve an LLM at all (e.g., JEPA).
JEPA is still being worked on, as are a few other architectures. With LLMs getting all attention and the investment dollars, this will likely remain this way in 2026.

6. AGI benchmarking remains challenging, as existing metrics either can be gamed or measure only necessary but insufficient conditions. While François Chollet is developing ARC-AGI-2, more effort is needed in this area.

AI continues to have a benchmark problem. I think many people, even those that have been closely following AI progress, underestimate the power of the incentive for LLM makers to target benchmark maxxing.

It's classic game theory - even though each of the frontier labs is aware that targeting benchmarks isn't very helpful for the model performance in the real-world, they have no choice but to invest in it because every model release is prominently accompanied by a benchmark scorecard.

Labs target benchmarks by training on specially created data matching test data (even if the exact benchmark tests are private, labs have access to examples and they can create more of them). RLVR makes benchmark maxxing even easier, btw.

7. Evaluators/verifiers will be developing rapidly as they are needed for reasoning and agentic systems.

Turns out this is a hard problem to solve outside of special areas like coding. LLMs (and even humans) are being used as judges in areas without definite answers. Here’s a fantastic article on the use of LLMs as evaluators.

8. Ever-smaller models will achieve capabilities currently limited to much larger models, bringing costs down and pushing more AI capabilities to edge devices.

This prediction turned out to be accurate. 2025 saw many small(er) models that can run on laptops/phones with surprisingly strong capabilities. This trend will continue into 2026 and beyond, and this promise is what supports Apple’s stock price :)

9. Research efforts (such as mechanistic interpretation) towards understanding and explaining foundational models will progress slowly given the relatively low investment. But this is an essential ingredient in addressing the alignment problem.

Mechanistic interpretability continues to progress, albeit at a very small scale. Anthropic continues to be the only major frontier lab that seems to focus a bit on the alignment problem (at least publicly), and the focus of the industry seems to be on progressing AI as quickly as possible.

10. I agree with the idea that Embodied AI needs its own foundational model, similar to how LLMs unified different natural language processing capabilities (such as translation, sentiment classification, conversation, etc.). Such a model, incorporating LLM-like knowledge and reasoning, would be crucial even for specific applications like self-driving. However, this breakthrough is unlikely in 2025.

This prediction turned out to be true. This isn’t an easy problem to solve. Here’s an article from Physical Intelligence that explains why embodied AI is such a tough problem to crack. It also gives you a glimpse of the current state of the art for robotics.

An AI-themed crossword to kickstart your holidays!

Ram Komarraju — Fri, 28 Nov 2025 20:14:44 GMT

UPDATE: Here’s the crossword web page I vibe-coded specifically for this puzzle. You can interactively fill-in the puzzle, and of course reveal the solution.

I like AI and I like crosswords, and I hope you share at least one of those two interests. So, dear reader, here’s an AI-themed crossword to kickstart your holiday season! I tried to make it relatively easy, but couldn’t quite avoid using a few technical acronyms and bits of jargon. Give it a try and let me know what you think. I’ll share the solution tomorrow in a follow-up post.

This is my first time constructing a crossword, so consider this an experiment. If there’s enough interest, I’ll put more time into crafting a Christmas edition with the style and polish of a Monday New York Times puzzle.

Note: I used crosserville.com to build this crossword. It’s free, and I found it intuitive and easy to use.

Why Is It So Hard to Get AI Agents to Work in the Enterprise?

Ram Komarraju — Sat, 22 Nov 2025 18:23:45 GMT

The year 2025 was supposed to be the “Year of the Agent,” but every week seems to bring another study from a major university or a leading consulting firm explaining why agent deployment in the enterprise has been far slower than expected. Many resort to platitudes like “put the business case front and center,” or “focus on getting the data right,” as if not following those tenets is all that’s holding agents back. Others blame MCP, LangChain, or some other framework. Yet others put the blame on software engineers for not knowing how to build agents properly.

In other words, many people are prone to blame anything but the LLM, the brain of the agent. In the rest of this article, I will share the current limitations of LLMs that make it extremely hard to implement agents in the enterprise. I’ll then cover potential ways in which these limitations can be addressed.

«Note that I don’t include “lack of continual learning” as a major limitation preventing us from deploying AI agents successfully in the enterprise. I’m assuming that addressing the below limitations is sufficient for this purpose, but might have to think more deeply about lack of continual learning at some point.»

Hallucinations

A key limitation driving even state-of-the-art models to be unreliable agents is hallucination, whereby a model generates plausible but wrong responses (i.e., responses inconsistent with either the input context, the model’s previous outputs, or the knowledge contained in its training data).”

Models have always suffered from hallucinations, but the impact is far greater when they are employed as autonomous agents:

There’s simply a lot more opportunity for something to go wrong with all the stuff that happens while an agent is executing a task: reasoning/planning, tool calling, long-term memory/context, and multi-turn execution. Models tend to hallucinate more when context is embedded with incorrect or conflicting information (which is common and often unavoidable in real-world agent deployments). Where an average human might recognize the incorrect information and choose to ignore it or correct it, a model may instead rationalize it and expand upon it, compounding the error. Once a hallucination makes it into the context window of a model, there is no easy way to identify and remove it.
Model hallucination rates remain far higher than many imagine (ranging anywhere from 3% to 30+%, depending on the benchmark and the type of hallucination being measured). What’s more, reasoning models used in agentic solutions hallucinate more than smaller/’dumber’ models.

Of course, frontier labs are aware of this and are actively working to improve the situation. For example, OpenAI’s recent work proposes shifting the reward structure of benchmarks so that confident errors are penalized and recognizing uncertainty is credited. However, this does not fully solve the core problem, and OpenAI admits that even with improvements, hallucinations would still occur. BTW, I was able to elicit hallucinations from GPT-5.1, released two week ago, using the same techniques that I used in the past.

Limited Memory and Context Rot

An agent’s “memory” is the set of all tokens fed into the context window with every turn. Consequently, an agent’s performance ultimately depends on what fits into that context window. But LLMs have only a limited context window on which the entire agentic solution ends up being built. This has led to the rise of “context engineering” as a profession.A context engineer has to decide what parts of the overall context (e.g., previous user instructions, any supporting data, LLM’s responses, and history of tool calls/responses) to embed within the context window at any time.

Context engineers also have to deal with “context rot.” You see, while a model should in theory be able to pay attention to everything within its context window, that’s rarely the case in practice. An LLM’s performance becomes increasingly unreliable as the context window grows. A user might state a critical fact (the “needle”) at the beginning of a conversation. By the time we reach the 15-minute mark, that critical fact might be buried in a hundred-thousand-token “haystack” of task execution history. Because of context rot, the LLM’s ability to “find” that needle is now lower than it was at the beginning.

Of course, frontier labs are working on increasing the context window length. This is a challenging task because attention computations increase quadratically in relation to the context window length. Any attention optimization techniques to control computational costs and speed need to avoid worsening the context rot problem.

Fragility to prompt phrasing

Another fundamental limitation is the sensitivity of LLMs to the exact phrasing and wording of inputs. Hence the explosion of prompt engineering as a discipline, complete with dozens of guides from Anthropic, OpenAI, and Google, and even “prompt optimizers” that algorithmically search for better prompts. Even if we follow all the guidelines, it is impossible in practice to control everything that makes it into the context window of an LLM, and a single word, a clause, or emotional language can produce dramatically different outputs/behaviors.

In messy real-world environments, inputs often come in varying formats and tones from users, tool/LLM outputs, and logs. Since the LLM can misinterpret any of these uncontrolled inputs (e.g., literally interpret and act on an emotional outburst from the user), the system stays inherently unpredictable. And if inputs truly could be controlled precisely, a deterministic workflow would often suffice without needing an LLM in the first place.

To sum it up, the burden should be on the LLM to behave robustly, not on developers to discover ritual incantations to coax the system into functioning correctly or to hope and pray that the context stays clean.

The Ultimate Dilemma: Deterministic Automation vs Intelligent Agents

Finally, and most importantly, there’s perhaps a central, unsolvable contradiction in current LLM-based agents:

The promise of an agent is that it can reason, plan, and think creatively to solve a complex problem to perform tasks that could only be performed by humans until now.
The requirement of the “enterprise” is reliability, predictability, and auditable outcomes.

And the contradiction between promise and requirement is that LLMs are not reliable, not predictable, and (perhaps) not auditable.

Large enterprises employ thousands, if not hundreds of thousands, of HGIs (human general intelligences). Wherever possible, these organizations continue to replace manual activities with deterministic automation because it’s more reliable and less expensive. LLMs are AGIs (LLMs are artificial, and they have general intelligence, so I’ll call them AGIs, thank you), but even SOTA versions are less reliable, less predictable, and less capable than HGIs across multiple dimensions.

The Way Forward

Build robust scaffolding around LLMs
Coding agents are a great example of this. Of course, the nature of the domain lends itself to such scaffolding. It’s not that LLMs don’t hallucinate or make mistakes when coding. It’s just that there are often verifiable and deterministic ways to check for problems and feed the results back to the LLM to fix them. For example, a coding assistant can send lint results, compiler error/warning messages, runtime logs, unit test failures, etc. to the LLM to fix any issues.
But building such scaffolding for other use cases may not be possible or may prove to be very expensive. The scaffolding approach is where I think we are headed in the short to medium term, because I can’t see the fundamental limitations I mentioned above going away anytime soon, and organizations will find the promise and allure of LLM agents too irresistible not to try this approach.
Limit LLM/agent usage to small/narrow steps within deterministic workflows
The approach requires using LLMs in very specific steps within a larger workflow, controlling tightly what’s fed into their context window, thus increasing their reliability. While this might work, the benefit to be gained in this approach is likely to be very limited, and a traditional, fully deterministic solution would probably be cheaper to implement.
Limit their usage to areas where failure isn’t critical or keep a human in the loop
In this scenario, we recognize the limitations of LLMs and use them in areas where failure isn’t critical. Deep Research is a major example of this approach. The reports generated by deep research implementations contain multiple hallucinations, but the reports are usually meant for some human to use as part of their work rather than publishing them as end products. Of course, having humans in the loop doesn’t mean much if they lack the expertise or the patience to carefully read through the reports, use only what’s correct, and discard or fix inaccurate information.
Improving the LLMs by improving the benchmarks and algorithms
This one is for the model labs, and each of these is a challenging problem with no clear/known solution.
Change LLM training objectives and benchmarks to punish guessing and reward acknowledgment of a lack of knowledge. There’s probably a theoretical limit to how much hallucinations can be reduced, however.
Find ways to increase context window length and fix context rot.

Make the LLM more robust to minor changes/tweaks in prompts. This change will likely make the LLMs seem boring/non-creative as chatbots, and therefore we are likely to have dedicated LLMs trained specifically to be effective in agentic settings.

Anatomy of Frontier AI Systems in 2025

Fri, 07 Nov 2025 06:25:34 GMT

When you interact with ChatGPT, Claude, or Gemini, it may feel like a simple exchange with a single model. In reality, you’re interacting with a large distributed system whose complexity has been growing dramatically as the leading LLM providers continue to add new capabilities to their offerings.

Between the prompt (which can be multimodal with text, voice, images, and video) you enter and what you see in response sits a layered architecture of user interfaces, routers, orchestrators, memory stores, specialized models, tools, agents, code executors and content renderers. These systems have evolved from basic chatbots into modular platforms where choices about context, privacy, latency, and cost are made continuously and automatically. Understanding how the pieces fit together makes you a better user.

For developers and architects, this understanding explains why the same LLM apps can behave differently on different machines/users. The set of available tools, their configurations, the memory retrieved and embedded in the context, the routing, pre/post processing choices would ultimately determine the outcomes. It also shows them where to look for when trouble shooting or when looking to design an innovative agentic solution.

The diagram below shows one way in which such a system could be designed, followed by a description of each of the numbered parts. The actual implementation details would vary from provider to provider, of course.

1. The User

Every session starts with the user’s intent. You enter text, speak a command, or attach a file, and that input becomes the first step in a long pipeline. Even at this starting point, the system makes decisions on your behalf: which hidden instructions to include, what metadata to attach, and whether to collect additional local context such as timezone or project settings. Those invisible choices shape how the downstream components interpret your request and decide what to do next.

2. The Native App

The front-end is no longer a thin veneer that just supports text-in and text-out. Instead, today’s clients play an active role in capturing the intent, managing local state, and integrating with your device. A desktop assistant can listen for a wake word, read files you’ve explicitly shared, or summarize notifications. An IDE assistant can index repositories and extract build errors to included in the context. A browser client can extract useful context from the web page along with the user’s prompt to pass on to the server.

3. Client-Side Tools

Client-side tools ground the system in your environment. With your permission, the Native App may use your mic as input, scan selected documents, assemble a code context from your workspace, or even use MCP to make calls to external services (e.g., shopping carts, travel argents, calendars). Many clients parse chosen files, embed their contents, and even build an index for fast search and retrieval so that when you ask a question, the relevant snippets can be included in the prompt sent to the server. Done well, this creates personalization without overflowing the context. Sensitive material should stay on-device unless you explicitly opt in, and the client should only be sending the minimum information required to answer your question accurately.

4. Router & Pre-Processor

Once the request leaves the device, a server-side router becomes the air traffic controller. It classifies intent, inspects metadata, applies safety and normalization, and selects the appropriate path: language model, coding model, image generator, video generator or an agent (e.g., Deep Research). The router also enriches your input with server-side context (conversation summaries, long-term memory, or provider policies/guard-rails) and templates along with the prompt.

Apart from providing access to different models providing a rich and diverse set of functionality, this routing capability allows the LLM provider to save money by running simple(r) queries on smaller/faster LLMs.

5. Image/Video Generation

If the user’s request needs generating images or video, the orchestrator invokes a generative model (distinct from LLMs) specialized for this purpose.

6. LLM Orchestrator

The Orchestrator calls the LLMs as identified by the Router. It may end up calling just one LLM, or a sequence of them (e.g., start with a smaller, fast model to parse tasks and draft a plan, escalate to a larger reasoning model for multi-step logic, and switch to a code-tuned model to generate or execute programs). Intermediate results are stitched together, cached where useful, and reintroduced into the context for subsequent steps. If the intermediate responses from the LLMs include a request to make a tool call on the server-side, the orchestrator makes those tool calls, adds the resulting content to the context and submits them back to the LLM for processing. See section 9 for more details.

7. LLMs

This is very simple really, a given LLM takes the context submitted by the Orchestrator as input and provides its response (one token at a time).

8. Agent Layer

For complex workflows, an agent such as Deep Research may sit between the orchestrator and tools. The agent is responsible for executing the “agentic” loop by leveraging all the tools and a suitable LLM.

9. Tools including Memory System & Execution Environment

In order to respond to the user’s prompt, the LLM might choose to make a tool call. The Orchestrator interprets these calls, executes the tool calls and returns the results to the LLM. Web search grounds answers in fresh information. A calculator guarantees numerical precision. A code runner executes snippets and returns outputs. API connectors let the system perform external actions such as querying a travel database, reading a calendar, or placing an order on a popular shopping site. Each tool’s output is appended to the context and the model continues, forming a tight loop of “plan → act → observe → refine.”

Because base models are stateless and context-limited, the system relies on memory to provide continuity. Short-term memory supplies the immediate chat history. Long-term memory retrieves relevant facts from historical chats and any additional documents and artefacts (e.g., supplied as part of a project conversation). The orchestrator then augments the current prompt with only those pieces of information required for the task at hand. The result is an experience that can pick up threads from prior conversations or recall user preferences.

10) Post-Processor

The text that streams back to you is rarely the raw model output. A post-processor formats the response and applies safety checks. It may redact sensitive material, downgrade uncertain claims, or request another pass from the model with tighter constraints.

11) Response Delivery and Feedback

The response returns through the client, which can take local actions by invoking tools such as saving a file, updating a dashboard, or reading the output aloud. From your perspective, this might like a single conversation but under the hood, it’s a carefully managed pipeline balancing speed, accuracy, privacy, and price.

In conclusion, when using LLMs from leading providers, we don’t interact directly with a model but with complex systems built around models. The LLM remains the centerpiece, but it is part of a broader operating stack where memory and tools define the experience as much as the LLM does.

LLMs and Human Brains

Ram Komarraju — Sat, 25 Oct 2025 22:00:14 GMT

«Disclaimer: these analogies, like all analogies, aren’t perfect. There’s a big difference between humans and LLMs - we operate in a very messy physical world; LLMs operate in a digital world with less messiness but also don’t have our rich sensory experience. A full book, not a short article like this, would be needed to list out all the commonalities, exceptions, and nuances. Finally, to reach the maximum audience I’ve eschewed using technical jargon and details as much as possible. »

LLMs are created from human knowledge, culture, and values expressed in text. So, it’s not a surprise when they act and behave like humans. At the same time, they are different from us in many important ways given that they are created using a completely different method compared to evolution (think of birds vs airplane).

In this article, I’ll develop this idea more fully, comparing some key LLM concepts to their human analogues. For those that aren’t familiar with the technical details of how LLMs are built and work this comparison should give a better and intuitive feel about LLMs. Even those that are intimately familiar with LLMs’ training and inner workings may gain some additional insights. Or perhaps more importantly, they can point out in the comments section where I got things horribly wrong :)

Evolution vs. Pre-training
Evolution spent billions of years ‘searching’ for brain designs and encoding the instructions in our DNA, governing not only our brains but our entire bodies. A newborn baby’s brain doesn’t come with pre-built factual knowledge, but it has innate capabilities to learn from and operate in its environment. For example, it will soon be able to grasp things by hand and learn the particular language(s) it’s exposed to.

In comparison, LLMs can be created in a few weeks’ training on trillions of words of text with the objective of guessing the next token (word) given a sequence of tokens (words). The resulting neural network comes with vast pre-built knowledge and a limited world model. It doesn’t, however, know quite how to engage in conversations, be helpful, reason through problems or how to use tools.
Childhood immersion/imitation learning vs. Supervised Fine-Tuning (SFT)
Toddlers learn how to communicate through immersion and imitation. LLMs learn how to converse after being trained with thousands human-generated examples of prompts (“How are you today?”) and responses (“I am fine. Thank you. How are you?”). Using these examples, SFT teaches LLMs the basic structure of conversations and task following behavior.
Parental guidance & social conditioning vs. Reinforcement Learning through Human Feedback (RLHF)
Feedback from parents and the broader social circle teaches children to behave in ways the children find rewarding, thus sculpting their communication and behavior. RLHF fine-tunes an already pre-trained, instruction-following LLM by incorporating human preferences as a reward signal. The model learns to generate responses that humans prefer. For example, responses that are polite, helpful, and honest.
School or Structured Learning vs. Reinforcement Learning with Verifiable Rewards (RLVR)
IMO, this is a tricky analogy to make, but I’ll give it a go. Humans learn how to apply logic, solve math problems and write programs through structured teaching, and are tested and corrected with objective tests. To achieve a similar outcome in the case of LLMs, we take problems with verifiable answers and feed them to LLM - we reward them when they produce correct answers. Note that we don’t really tell them how to solve problem the way we teach humans. We just give them the problem and just reward them when they get it right, and somehow LLMs learn how to allocate more effort and ‘reason’ to solve the problems.
Lifelong learning and long term memory vs. In-context learning (ICL) & short term memory
Humans continue to learn throughout their lives. New experiences, skills, and knowledge gained are compressed and stored into their long term memory (sleep perhaps plays a role here), enabling later recall and reuse. Once training is complete, LLMs’ knowledge and skills are frozen when they’re released to production. They have a very limited amount of working memory (called the context), and can learn new skills if we provide them with a description and/or a few examples within that context. This is called in-context learning, and it’s lost when the short term memory overflows or when a new context is created.
Rules and Laws vs. System Prompts
Humans are governed by rules and laws that they are meant to follow. LLMs are given a similar set of guidelines and rules via system prompts (which reside at the beginning LLM’s working memory/context). Of course, just like humans, LLMs may or may not follow those rules :)
Reference materials vs. Placing information in working memory
Humans aren’t expected to memorize every fact. They can look them up in books or online. A prompt to an LLM can also embed any additional information that the LLM can use to better answer the prompt.
Human Tool use vs. LLM Tool use
Humans have an innate ability to use tools in general, and during their lives they learn how to use specific tools through practice and demonstration. LLMs see code, APIs, and tool-related text during pre-training, which give them a rough sense of what “tool use” is. They are then given specific tool use examples during supervised fine-tuning and are rewarded for correct tool use during the reinforcement learning (RLHF/RLVR) phases. As mentioned above, LLMs can also learn how to use specific tools in-context when provided with a description and/or a few examples. This capability lasts while that context is active and is not carried forward into future sessions.

In conclusion, the comparison isn’t perfect, but it reveals to us a little bit about how the two intelligences are similar and different. We are the result of a billion years of evolution, while LLMs are imperfect (or perhaps simply different) digital reflections of ourselves.

Determinism, Truth, Creativity and Hallucinations: Untangling Often-Confused Ideas About LLMs

Ram Komarraju — Sun, 21 Sep 2025 13:43:21 GMT

Public discussions about LLMs often blur different ideas: determinism, truth, creativity and hallucinations. A recent wave of discussion starting with the publication of Thinking Machine’s Defeating Nondeterminism in LLM Inference surfaced this confusion even more. Determinism gets casually equated with truth and non-determinism with creativity and hallucinations. It’s therefore understandable that many casual followers of AI took this to mean that Thinking Machine figured out how to eliminating hallucinations and get to truth in LLMs. That’s understandable: stable outputs feel more reliable. But it’s wrong in two ways. First, a system can be perfectly deterministic and still be confidently, consistently wrong. Second, you don’t need non-determinism for a model to be creative or hallucinate.

This confusion isn't just academic; when these wires cross, people misread articles, over- or under-react to AI press releases, and make poor executive decisions.

In this article, we’ll define in plain English what each of these concepts mean in the context of LLMs. Then we’ll attempt to correct common misconceptions, each with concrete examples. As a bonus for the technically minded among you, we’ll walk through how an LLM chooses the next token, highlighting intentional variability from unintentional variability. By dissecting the journey from raw scores to the final token, we’ll pinpoint exactly where true randomness is introduced and where unintentional system quirks can arise.

Definitions

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”
— Humpty Dumpty, in Lewis Carroll's Through the Looking-Glass

Determinism. This refers for the ability to produce the same output from the same input (assuming the same model, software, drivers, and hardware). In practice, this refers to whether the model produces the same next-token prediction scores (logits) given the same input. It’s about reproducibility of the system—not about being right or creative.

Truth. Whether a claim matches the real world. It's useful to consider this in two categories: objective and subjective truth.

Objective Truth or Factuality: For verifiable facts like, "What is the boiling point of water?" the model's output is "true" if it aligns with established, real-world knowledge. Its ability to produce objective truth is driven by the accuracy of its training and in-context data, and its own capability to use this data correctly.
Subjective Truth with Representativeness: For matters of opinion, interpretation, and values like, "Is abstract art meaningful?" the model essentially synthesizes the vast spectrum of human perspectives from its training data to generate a plausible or representative viewpoint. This means the default pathways in the model reflect a majority or dominant cultural opinion, but specific directions in the prompt can steer the model to reflect a specific viewpoint.

Creativity. From a user's perspective, creativity in an LLM is the model's ability to produce text that feels novel, interesting, and non-obvious, while still being coherent and relevant to the prompt. It's the capacity to generate a new poem or a unique marketing slogan. This user experience of creativity is technically derived from intentional and controlled randomness in the token selection process (covered later in the article).

Hallucinations. A hallucination is said to occur when an LLM generates text that is plausible and confident, but is factually incorrect. At its core, it is a failure of the model to distinguish between its internal, statistical representation of the real world (facts) and plausible-sounding fiction. Because the model's primary goal is to generate a fluent sequence of words, it can sometimes blend the style of a factual statement with the content of a fictional one.

Active research is taking place to eliminate or reduce hallucinations: The fact that AI can be confidently wrong is one of the biggest hurdles to its reliable use. And of course there are major ongoing efforts focused on building in models the crucial ability of knowing when they don't know something. Here’s a recent post from OpenAI, and another very recent one from Google.

Common Misconceptions

Misconception #1: Deterministic responses are truthful responses.

Reality: Determinism guarantees repeatability, not truthfulness. In the context of LLMs, determinism can guarantee the same next-token probability distribution, but that doesn’t make it more true.

Example: To the question “What is the capital of Australia?,” the LLM can answer every time with “Sydney” but that doesn’t make it right.

Why it matters: Deterministic outputs lead to consistent model behavior, and help measurement and testing; they don’t guarantee factuality.

Misconception #2: Non-determinism is what leads to Hallucinations

Reality: Hallucinations are the result of the model's inability to distinguish between when it knows something and when it doesn’t know something. A model can produce a fabricated "fact" via a perfectly deterministic path. Making the output deterministic doesn't make it true.

Example: To the question “Who won the gold in 100m sprint in 2028 Oympics?,” the LLM can deterministically answer every-time with “Dean Jones” but that doesn’t mean it has basis in reality. Even if the event hasn’t happened yet, the model may fabricate a plausible name.

Misconception #3: Non-determinism is Necessary for Creativity

Reality: Creativity in LLMs comes from intentional randomness introduced through decoding strategies like temperature and top-p sampling. Non-determinism can also arise from unintentional system variability (e.g., batching effects), which produces random drift without creative intent.

Example: A prompt run twice produces slightly different outputs due to batch variance, but both continuations are bland and repetitive. This is non-deterministic but not creative.

Why it matters: Conflating the two leads to frustration; true creativity requires deliberately tuning the right knobs (like temperature), not just accepting random system quirks.

Misconception #4: Creativity leads to falsehoods

Reality: Creativity is the ability to generate diverse and novel—but still plausible—continuations. A model can be creative while remaining entirely factual, or it can be creative in a fictional context. The output's truthfulness is a separate, unrelated dimension.

Example: Asking an LLM to "explain photosynthesis in the style of a pirate" can produce a creative and engaging answer that is still factually correct.

Why it matters: Dismissing all creative outputs as false means missing out on the LLM's power to brainstorm, rephrase, and explain concepts in more engaging ways.

A Few Common Technical Misconceptions

Technical Misconception #1: Temperature=0 makes answers factual.

Reality: Setting temperature to zero path removes decoding policy randomness. It doesn’t verify facts.

Example: The model repeats an outdated statistic confidently; turning down temperature only makes the mistake consistent.

Why it matters: Claims that “lower temperature improves accuracy” are over-reading a stability knob.

Technical Misconception #2: Setting a seed makes everything deterministic.

Reality: The seed is only used in the Token Sampling step. It ensures that the random choice from the final candidate pool is repeatable. It has no effect whatsoever on the decoding policy introduced randomness or unintentional variability that can occur back in the inference phase, which can change the logits themselves and lead to a different output even with the same seed.

Example. Two runs with the same seed diverge because one was processed in a larger batch, nudging the logits.

Why it matters. “Seeded for determinism” means “sampling is fixed,” not “the entire stack is invariant.”

Technical Misconception #3: Batch-size nondeterminism is just floating-point voodoo.

Reality. Floating-point quirks are part of it, but the another root cause is order of operations changing with batch shape or scheduling.

Example. The same prompt answered alone vs. grouped with seven others takes a slightly different compute path causing tiny logit shifts flip a softmax later.

Why it matters. When you see “same prompt, different output,” ask whether the inference setup changed. The concept to look for is batch invariance. Thinking Machine’s paper proposes a way to address this.

Technical Misconception #4: “Deterministic” means identical across all machines and versions.

Reality. Most real guarantees are scoped per same weights, tokenizer, libraries, drivers, and GPU family. Change any of those and tiny numeric changes can appear.

Example. Bit-for-bit identical on an A100/CUDA stack does not imply bit-for-bit identical on an H100/CUDA.

Why it matters. When reading papers or provider claims, look for the determinism envelope to understand what exactly the promise covers.

Anatomy of Token Generation: A Step-by-Step Guide

At its core, an LLM is a sophisticated prediction engine. Its main job is to answer the question: "Given this sequence of words (tokens), what word (token) should come next?" The diagram below illustrates this process focusing on the token selection, which we can break down into seven key steps.

The Model Generates Raw Scores (Logits). The model processes the prompt + prior tokens to produce logits (raw, unnormalized scores) for every token in the model’s vocabulary. This is where unintentional variability resulting from the numerical drift from batching, kernel order, or hardware tactics can creep in.
Scores are Converted to Probabilities (Softmax). The raw logits are then passed through a Softmax function. This is a crucial mathematical step that transforms the long list of scores into a proper probability distribution, where the probabilities of all possible tokens add up to 1. This is also where the most important control for creativity, temperature, is applied. As the diagram shows, temperature controls the shape of the probability distribution:
- High Temperature: Flattens the distribution. This boosts the probability of less likely tokens, making the model's output more diverse, creative, and sometimes surprising.
- Medium Temperature: Creates a "spiky" distribution that still strongly favors the most probable tokens but allows for some variation. This is often a good balance between creativity and coherence.
- Temperature = 0: This is a special case. It creates a distribution with a single peak, giving a 100% probability to the single most likely token and a 0% probability to everything else.
The Greedy Decoding Path. When temperature=0, one token gets the maximum probability of 1, and the model selects it directly. This is called greedy decoding. It's the most direct and predictable path, completely bypassing the need for any further decoding or sampling.
Decoding policy. If you want diversity, you deliberately introduce randomness by using a decoding policy to create the final candidate pool of tokens you’ll sample from.
Creating the candidate pool of tokens. For any temperature above zero, the model has a distribution of options. The two most common methods are:
- Top-K: A simple rule that tells the model to only consider the 'k' tokens with the highest probabilities.
- Top-P: A more dynamic rule that tells the model to consider the smallest group of top tokens whose probabilities add up to at least a certain percentage 'p'.
Token Sampling (seed=…). From this final, filtered group, one token is chosen through a weighted random selection. A token with a higher probability has a higher chance of being picked, but other tokens may be selected as well with their corresponding probability. This is the step where a seed is used. By providing the same seed, you ensure that the same pseudo-random choice made at this specific step is the same every time. This makes the sampling process repeatable.
The loop repeats (autoregression). The chosen token (from greedy or sampling) is appended to the input and the loop repeats. The entire process, from Step 1 to Step 6, then repeats to generate the token that comes after it. This cyclical, one-word-at-a-time process is called autoregression.

Two purple callouts in the diagram reinforce the distinction:

Unintentional variability sits within model inference
Intentional randomness sits with your decoding policy—what you dial to get creativity or stability.

Conclusion: Mastery Begins with Clarity

The language we use to describe LLMs is a minefield of misunderstood concepts, where technical terms are often used as proxies for desired outcomes. As we've seen, determinism is not a synonym for truth, creativity does not have to imply hallucinations, and random token selection doesn’t negate determinism. Instead, these are distinct dimensions of a model's behavior, each controlled by different mechanisms.

The path to building and deploying AI responsibly doesn't just depend on bigger models; it depends on the clarity of our thinking.

The Holy Grail of AI in Trading: Can We Finally Forecast the Markets?

Ram Komarraju — Sun, 07 Sep 2025 03:03:07 GMT

Preamble

Inspired by Asimov’s Hari Seldon and psychohistory, I’ve carried around an idea for a novel for years. The novel’s protagonist, a brilliant but flawed mathematician, finds the holy grail of applied AI in trading: an app that forecasts financial markets with uncanny precision. The app would take in torrents of data in real-time and would make probabilistic predictions minutes, hours, and days into the future. His challenge, of course, is to keep it a secret while profiting from it because making it public would stop it from working.

I never wrote that novel. Mostly because life got in the way, but I never stopped thinking about how to create that market forecaster. All this time a market forecasting app that can work with the accuracy of a weather forecasting app seemed more like a piece of science fiction than an engineering project. With all the recent advances in deep learning, however, creating such an app feels a lot more feasible now.

In this article, I’ll talk about why market forecasting is such a hard problem to solve, look at some of the notable attempts at solving it, and propose an architectural vision for a model that might finally get us there.

The Herculean Challenges of Market Prediction

At first glance, market prediction might not look daunting for the uninitiated. After all, if we can create planet-scale weather forecasters and LLMs that write poems, produce code, and even win gold medals at the International Math Olympiad, surely building a narrower, special-purpose system ought to be easier? A market forecaster doesn’t need to understand Tolstoy and TikTok memes (or does it?); it only needs to make sense of prices and events to make decent probabilistic predictions of future prices.

But forecasting markets isn't like predicting the weather, where the underlying physics is mostly understood. Neither is it like creating a general purpose large language model. Markets are a different beast entirely, plagued by a few fundamental problems.

The first major hurdle is the problem of non-stationarity. In simple terms, a stationary process is one whose statistical properties (e.g., mean, variance and correlations) don't change over time. And financial markets are maddeningly non-stationary. The rules of the game are constantly shifting. A pattern that worked beautifully for a year can suddenly vanish because of a viral outbreak, a regulatory change, the outbreak of a war, a technological disruption, or a shift in the geo-political environment. This is why simple models like ARIMA (Autoregressive Integrated Moving Average), which rely on past patterns repeating, often fail spectacularly. Any successful solution, therefore needs a sophisticated, adaptive world model that truly understands and adapts to the underlying regimes and dynamics.

The second major hurdle is data:

Building a true world model requires an unimaginable amount of diverse, high-quality, and perfectly time-stamped data. We need everything: real-time news feeds, social media feeds from platforms like X, geopolitical event data, central bank announcements, earnings call transcripts, corporate actions, earnings reports, weather forecasts, satellite imagery of oil tankers, and of course, all the financial time-series data including stock prices, trading volumes, FX rates, interest rates, and bond yields.
Also, this data is multimodal, coming in the form of free text, structured data and maybe even images and video. A successful model must be able to fuse these different streams together to form a holistic picture.
Acquiring the data is also extremely challenging as much of the high-quality data is fragmented and locked behind expensive paywalls.
This also requires careful engineering to avoid data leakage, ensuring that our model isn't accidentally "seeing" information from the future during training, which would make it look deceptively brilliant in backtesting but useless in the real world.
Of course, the world of finance has adversarial players, and data of all varieties and especially those from social platforms can be poisoned. Defensive mechanisms need to be in place to identify and remove such data. Perhaps a sophisticated model would learn to ignore such data just as a expert human would!

Finally, there’s the problem of reflexivity. If a forecasting engine gets widely adopted, the very action of trading off its predictions can change market behavior, weakening its predictive power. This might seem academic, but it can disincentivize firms from investing heavily in building such models as keeping them proprietary can be very difficult (their employees who know how to build the model can be poached by competition).

A Brief History of Market Forecasting Attempts

Early attempts relied on statistical methods like ARIMA and GARCH. While foundational for time-series analysis, they don’t take into account all the non-time-series information available, and are ill-suited for the complex, multivariate, and ever-changing nature of financial markets. The machine learning revolution brought more powerful techniques, but many early models were still limited, often focusing on a narrow set of technical indicators.

The success of deep learning and transformers attracted researchers’ attention, and one of the first notable efforts was FinBERT, a model based on Google's BERT architecture but fine-tuned on a large corpus of financial text to understand the sentiment of financial news. However, FinBERT is quite dated at this point, and is not a forecasting engine. BloombergGPT made headlines in 2023 as a massive language model trained from scratch on Bloomberg's vast, proprietary dataset of financial information spanning decades. Like FinBERT, its primary goal wasn't to be an end-to-end forecasting system. It's a financial knowledge base, not a predictive machine that integrates real-time data to make predictions.

Since 2024, there’s been a new wave of research projects that are moving closer to what I think is the right approach. With efforts like MSMF (Multi-Scale Multi-Modal Fusion) researchers have started moving in the right direction, attempting to fuse textual data like news with price data. Similarly, agent-based simulators like FinArena aimed to create realistic environments to test trading strategies. While these are important steps in the right direction, they are academic exercises trained on limited public data and haven't yet been deployed at the scale or with the breadth of data needed to build a true market-spanning forecasting engine.

The Proposed Solution: A Multimodal Fusion based Market Forecaster

So, here’s my vision: a multimodal fusion forecasting architecture that ingests all kinds and modalities of financial data simultaneously to build a unified understanding of market’s state, dynamics and trajectory to provide probabilistic forecasts across varying time horizons.

Conclusion

I'm confident that a model built on this architecture would outperform a human analyst or a specialized, single-purpose algorithm. It would make decisions based on a more complete and holistically processed set of information than anything that has come before. I also admit that it’s extremely difficult and expensive to build with massive data, compute and engineering demands, and could potentially face regulatory obstacles. Then there’s the challenge of reflexivity: models that forecast may, if broadly deployed, invalidate themselves.

With so much of the AI world currently focused on the grand challenge of Artificial General/Super Intelligence (AGI/ASI), it's unlikely that major labs like Google or OpenAI would pivot to such a specialized, high-risk financial project. This leaves the field open to a few specific players. A data behemoth like Bloomberg, which sits on a large proprietary dataset, is one obvious candidate. Another is a well-funded, AI-native quantitative hedge fund like High Flyer (which owns DeepSeek). They might have the resources, the incentive, and the right kind of DNA.

Smaller, more constrained versions could probably be built today to prove the concept. One could start by focusing on a single asset class, like FX or a specific sector of the stock market. But ultimately, I believe a comprehensive, all-encompassing fusion model is the only path to obtain the holy grail of market forecasting.

Psychohistory may still be science fiction, but with multimodal fusion, a market forecaster is closer to become real.

When LLMs Try to Draw: A Weird Zoo of AI Line Art

Ram Komarraju — Mon, 25 Aug 2025 03:05:36 GMT

A few days ago, I stumbled across an intriguing LessWrong post where someone used different LLMs to describe Earth's geography in the simplest way possible - just by asking it to tag particular latitude and longitude into land from water. And the results were fascinating.

This got me thinking: if language models can conceptualize geography, what happens when you ask them to create visual art? Not generate images through diffusion models, but actually construct line drawings using pure coordinate geometry. How well can an LLM, not explicitly trained, on this task can do this?

Thus began my weekend project: getting 9 different language models to draw animals using nothing but JSON polylines, then rendering those drawings into actual images. The twist? I wouldn't write a single line of code myself.

The Method: "Vibe Coding" with Cursor

Instead of painstakingly coding a pipeline, I “vibe coded” it in Cursor. My process was beautifully simple:

Open Cursor and describe my vision and let Cursor write the code. Nine models to start with (the list can be customized) where each model received identical prompts requesting JSON output with normalized coordinates (0-1 space) and named polyline strokes. The five animals I picked were: African elephant, giraffe, domestic dog, domestic cat, and the bald eagle.
More often than not Cursor is able to detect when things went wrong and fixed them itself, which is amazingly helpful! It handled: API integration quirks (GPT-5 models don't support temperature=0!?), error handling for malformed responses, batch processing logic, and even helpful emoji-filled progress indicators. The code quality is genuinely impressive, better than what many humans would write. Occasionally I have to explicitly describe a problem to Cursor to get it fixed, or instruct it with particular tweaks.
Repeat until everything worked

The result? An evaluation pipeline consisting of:

A JSON generation orchestrator that prompts different models and collect their JSON-based “line drawings”
A matplotlib-based renderer that converts JSON polylines into PNG images
A config loader for API key management and other configuration

Here’s the lineup of nine contestants: gpt-4o, gpt-5-nano, gpt-5-mini, gpt-5, gemini-1.5-pro, gemini-2.5-flash, gemini-2.5-pro, claude-sonnet-3.7, claude-sonnet-4

The Results: A Weird Zoo of Wobbly Outlines

Within the same family of models, bigger variants usually did better than the smaller ones. Smaller models often devolved into geometric blobs. As you see below GPT-4o did absolute the worst. Gemini-2.5-flash produced what I think are the best drawings. Claude Sonnet 4’s dog looked like an alien donkey with antennae. GPT-5-nano went for something rodent-like. Definitely a weird kind of zoo. Gemini-2.5-pro even added hexagonal “spots” - an imaginative touch!
Broadly speaking, Google’s Gemini generated better drawings than OpenAI’s GPT models. Anthropic’s Claude series had issues generating valid JSONs. A different prompt with more detailed instructions helped Claude prduce more workable drawings, but I haven’t included them to allow for apples to apples comparisons.
Latency and API woes. All GPT-5 variants were quite slow: mini and nano averaged just over 67 seconds per prompt, GPT-5 averaged over 137 seconds per prompt! API access felt throttled compared to simply pasting the same prompt into the chat interface, which responded faster. GPT-4o was the fastest at 5 seconds per prompt and performed worst! Gemini 2.5 models took under 45 seconds on average, with 1.5 Pro taking an average of 11 seconds.

Closing Thoughts

I started this experiment as a fun curiosity, vibe-coding my way through Cursor. Language models do seem to have a kind of mental map of animals. That map, however, is fuzzy, distorted, and full of amusing failures: diagonal rectangles for cats, alien-donkeys for dogs, and cartoonish polygons for bald eagles.

I did try providing more detailed instructions (e.g., “make sure the elephant has tusks and four legs”), but that didn’t really help much. Perhaps there exist better prompts that can help here!

What do you think - which of these sketches is your favorite? Any ideas worth pursuing in this general area?

Can LLMs Be Used for Financial Forecasting?

Ram Komarraju — Fri, 20 Jun 2025 01:21:48 GMT

«note: This article provides a high-level overview of our project. For more details, here's the full project report.»

1. Introduction

As part of the CS224R (Deep Reinforcement Learning) course that recently completed, I partnered with Jonathan Larkin and Tamika Bassman to work on a project where we dove headfirst into one of the most common yet difficult problems in finance: predicting stock performance.

More specifically, we explored whether state-of-the-art deep reinforcement learning could help a large language model (LLM) reason more effectively about future stock performance. Central to our approach is the use of reinforcement learning with verifiable rewards (RLVR) following supervised finetuning (SFT). We ran a baseline on Qwen3-1.7B, and then successively improved this baseline by (a) fine-tuning the model on synthetic reasoning traces, and (b) RLVR training using Group Relative Policy Optimization (GRPO) algorithm with binary correctness and format correctness rewards. Our goal was to measure the performance of this approach against the base model, a traditional linear model, and the predictions of human analysts.

2. The Architecture: A Pipeline for Financial Reasoning

We designed a multi-stage pipeline aimed at systematically building financial reasoning ability into a base LLM.

A. Data Preparation

We started with a dataset of over 120,000 earnings call transcripts from 3,000 U.S. stocks, spanning 20 years. To create our target labels ("STRONG BUY" to "STRONG SELL"), we calculated each stock's 1-month future return and then percentile-ranked it against its sector peers. This crucial step helped isolate company-specific performance by filtering out broader market or sector-wide trends.

B. Establishing the Baselines

We created three key baselines (higher F1 scores are better):

The performance of a linear TF-IDF model to see how much predictive power was in the text itself.
The performance of human analyst consensus ratings.
The performance of the base LLM - the pre-trained Qwen3 1.7B

C. The Training Method

Our approach wasn't just to throw the data at the model. We used a sequence of fine-tuning and reinforcement learning stages.

Supervised Fine-Tuning (SFT): To train our model to think like a financial analyst, we used a “frontier” model, Gemini Pro 2.5, to create synthetic “reasoning traces” — step-by-step explanations that show why a prediction makes sense, not just what it is. We generated these traces using two methods: rejection sampling (keeping only the traces that led to the correct answer) and a technique inspired by "Self-Taught Reasoner" (STaR), where we gave the model the correct answer and asked it to reason backward. We then fine-tuned our base Qwen3 model on these reasoning traces along with the answers.
Reinforcement Learning with Verifiable Rewards (RLVR): This was the final and most critical stage. We used Group Relative Policy Optimization (GRPO), an algorithm designed for enhanced reasoning tasks. We rewarded the model with a simple binary score (1.0 for a correct label, 0 for incorrect) and a smaller reward for getting the output format right.

For computational efficiency, we used Low-Rank Adaptation (LoRA) for all training stages and vLLM to speed up inference.

3. The Findings

So, how did we do? The results were both encouraging and humbling.

The most sobering finding was that our simple linear TF-IDF baseline, trained on the full 90,000-sample training set, outperformed every other model with a macro F1 score of 0.2597. This result provided a valuable insight: even without sophisticated modeling, the information in earnings calls transcripts holds real predictive power.

However, the story of our LLM pipeline is one of consistent, measurable improvement:

The base Qwen3 1.7B model performed poorly on its own (F1 score: 0.0897) and exhibited a strong "optimism bias," as indicated in the confusion matrix (a) of Figure 4 below.
Human analyst consensus ratings, when mapped to our labels, scored an F1 of 0.1743—better than the raw LLM, but still below the TF-IDF model.
Our LLM pipeline, even when trained on a tiny fraction of the data, surpassed the human benchmark. The final model, which went through both SFT on traces and RLVR optimization, achieved an F1 score of 0.2084.

While we didn't beat the TF-IDF baseline, we successfully demonstrated that SFT on synthetic data and then refinement with RLVR improved the model's performance over the base LLM. We were held back by compute constraints; our entire pipeline took about 20 hours to run on a single NVIDIA H100, and we were only able to fine-tune on 414 synthetic examples and train the RLVR stage on 1,000 samples. Given that the full training set has 90,000 samples, there's a clear opportunity to scale.

4. What More Could We Have Done?

Every research project leaves unexplored avenues, and ours is no exception.

Integrating Tools and External Data Sources: A human analyst doesn't just read transcripts; they use calculators, refer to financial data from other sources. A major next step would be to give the model "tools" to allow it to perform quantitative calculations and access external data, making its analysis much more robust.
Richer Reward Mechanisms: We used a simple binary reward for the final answer. Future work could employ an "LLM-as-a-judge" to provide more nuanced, process-based rewards that evaluate the quality of the reasoning itself, not just the outcome.
More Rigorous Ablation Studies: We acknowledge the need for more rigorous ablation studies to precisely measure the impact of each component of our pipeline including, perhaps, tweaking the prompt to indicate to the model to be more skeptical when judging the transcripts. We also need to look for and eliminate data leakage, as our 20-year dataset may contain information the base LLM saw in its original pre-training.
Scaling: Another obvious path forward is to apply this pipeline to larger foundation models and train it on our full dataset.

5. Lessons from the Project

Beyond the technical findings, a project like this teaches one a lot about the practical realities of AI research.

First, planning is everything. A multi-stage pipeline with dependencies on data processing, baseline generation, supervised fine-tuning, and reinforcement learning requires a clear and structured plan from the outset.

Second, expect fast evolving frameworks/libraries and incomplete or missing documentation. Frontier reasoning models are still very new and getting the prompt formatting right and tuning the hyperparameters can be quite tricky. Documentation on the libraries and different hyperparameters them is still very sparse - one should be prepared to adjust one’s plans as more information is learned during implementation.

Finally, the project reinforced the sheer time-consuming nature of model training and fine-tuning. A single training and evaluation run took nearly a full day. One experiment we attempted had an estimated run time of 75 hours. This isn't a field for the impatient or the undisciplined; one needs to design experiments carefully because compute time is a precious resource.

Despite the challenges, we've established a promising foundation. Our findings show that LLMs, when equipped with structured reasoning from reinforcement learning, hold significant potential in financial forecasting. With more compute and refined techniques, this approach could become a powerful complement to traditional financial analysis.

6. Appendix

A. Example Prompt:

<|im_start|>user 
You are an expert institutional equity analyst. 

Given the following text, predict the stock’s relative performance to stocks in the same sector over the next month. 

You may rate a stock STRONG BUY, BUY, HOLD, SELL, or STRONG SELL. 

Think carefully. You must end your output with one of: STRONG BUY, BUY, SELL, STRONG SELL, or HOLD. 

Input: 
"""Operator: Ladies and gentlemen, thank you for standing by and welcome to the Third Quarter Oil-Dri Corporation of America Earnings Conference Call. At this time, all participants are in a listen-only mode. After the speakers presentation, there will be a question-and-answer session. [Operator Instructions] I would now like to hand the call over to Dan Jaffee, President and Chief Executive Officer. Please go ahead.  

Dan Jaffee: Thank you. Welcome everybody to the Oil-Dri third quarter investor
teleconference joining me both physically and virtually...



...
Dan Jaffee: Great. And I hope you also spotted the share repurchase which again was a way of trying to deliver value back to our shareholders and we like we saw we are opportunistic so that we had a chance to buyback shares with higher dividend and our cash is earning 0.2% so and we have plenty of opportunity that we can deploy that cash on but we are going to still stay opportunistic on the share repurchase program so hopefully spotted that in the queue as well thank you guys its been a half an hour and we look forward to talking to you it will be our end of the year that will be our fourth quarter and fiscal year end I would tell you on one of our major metrics which is we look at pre tax pre bonus income and then we divvy up that income between the shareholders and then the teammates. We already, through nine months have made more than we have ever made and any fiscal year we have ever had so as I joke we are playing with a house of money at this point but we are having a record year it feels really good it feels really good that all the investments we have made in the last two and half years on people and infrastructure are why this is happening and its just it feels good it is predictable and we are up in front of our businesses and we just we have always taken a long term approach and the seeds we planted years ago are sprouting now and seeds we plant today will start in a few years so thank you for the long-time holders and we are happy to reward you with I think will results 17 year in a row of dividend increases which is fantastic. So thanks everybody. We will talk to you again in a quarter. Operator: Ladies and gentlemen, this concludes todays conference call. Thank you for participating. You may now disconnect. Everyone have a great day."""

Answer: 
<|im_end|> 
<|im_start|>assistant

B. Example Response with Thinking Trace:


Okay, let’s tackle this question. The user wants to predict the stock’s relative performance to stocks in the same sector over the next month, rating it as STRONG BUY, BUY, HOLD, SELL, or STRONG SELL. 

First, I need to understand the sector. The company mentioned is Oil-Dri Corporation of America, which is involved in cat litter, animal health products, and other related areas. The sector for this would be the consumer products or retail sector, specifically focusing on pet care and cat litter.

Looking at the earnings call, the company reported strong financial results, with a 11% increase in retail and wholesale product sales, 8% increase in net sales, and a 25% increase in animal health products. They also mentioned improvements in gross profit, lower freight and natural gas costs, and a strong income from operations. The CEO emphasized the team’s resilience during the pandemic and the company’s ability to meet demand. 

Now, the question is about relative performance compared to sector peers. The key factors here are the company’s financial performance, market trends, and sector-specific factors. The company is in the consumer goods sector, specifically pet care. 

Considering the sector’s general performance, the consumer products sector has been affected by the pandemic, with a shift towards e-commerce and online sales. However, the company’s performance in retail and wholesale products was strong, with a 11% increase. The animal health segment showed growth, particularly in the Americas, with a 25% increase in sales. 

The sector as a whole might be experiencing some challenges, like supply chain issues and market shifts, but the company’s strong financials and focus on value-driven products (like cat litter) could give them an edge. The CEO mentioned that the company is well-positioned to compete in the value segment, which is a key factor in the consumer products sector. 

However, there are risks. The company is in a sector that’s heavily influenced by consumer behavior and economic conditions. If the pandemic’s impact subsides, there might be competition from other brands, and the company’s ability to maintain pricing and quality could be tested. Also, the sector might face challenges with supply chain disruptions and the ongoing impact of the pandemic. 

Given the company’s strong financial results, improved gross margins, and focus on value, it’s possible that the stock could perform well relative to peers. However, the sector’s overall performance and market conditions could affect this. The user wants a relative performance prediction, so I need to consider if the stock is likely to outperform, underperform, or hold compared to others in the same sector. 

The answer should be based on the company’s strong performance, market position, and sector trends. Since the company is in a sector that’s adapting to digital trends and value strategies, and they have shown resilience, the relative performance might be positive. However, there’s a risk of sector-wide challenges, so the rating could be HOLD or BUY. But given the strong financial results and the company’s focus on value, a BUY might be appropriate. 


BUY

Forget 'Vibe Coding': Why Karpathy's Talk Shows Fundamentals Still Matter

Ram Komarraju — Fri, 20 Jun 2025 01:18:23 GMT

What a great talk by Andrej Karpathy - he's got to be one of the best communicators of this era of AI and LLMs! His talk is very high in information density - and there's something for everyone including the non-technical audience. Here are a few of my key takeaways and observations. Let me know what you think!⁣
⁣
𝟏. 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝟏.𝟎/𝟐.𝟎/𝟑.𝟎⁣
- Software 1.0 is the traditional programming paradigm that those of us in the industry are familiar with. Almost all the traditional software (e.g., operating systems, business applications, video games) that exists today is created in this paradigm. ⁣
- Software 2.0 uses neural nets and lots of data to solve problems that are very hard or impossible to solve using Software 1.0 like autonomous driving, natural language processing and speech recognition. LLMs, of course, are the best known examples of Software 2.0.⁣
- Software 3.0 builds on top of LLMs (software 2.0) and writes code with natural language prompts from users.⁣
⁣
𝟐. 𝐓𝐡𝐞 𝐟𝐮𝐧𝐝𝐚𝐦𝐞𝐧𝐭𝐚𝐥 𝐭𝐫𝐚𝐢𝐭𝐬 𝐨𝐟 𝐚 𝐠𝐫𝐞𝐚𝐭 𝐝𝐞𝐯𝐞𝐥𝐨𝐩𝐞𝐫 𝐚𝐫𝐞𝐧'𝐭 𝐜𝐡𝐚𝐧𝐠𝐢𝐧𝐠.⁣
- Vibe coding gets all the memes but the maximum value will be generated by developers who are well versed in all the three paradigms (or at least in 1.0 and 3.0).⁣
- Developers that have a clear and accurate mental model of what is needed and effectively communicate that to the LLM will be more productive than those that do not.⁣
- Given the current limitations of LLMs (like context window length and unreliability), Developers need to know how to implement the changes in manageable chunks and how to debug and fix issues when things eventually go wrong.⁣
- Of course, the ability to design and architect software with all the x'ties (maintainability, supportability, scalability, etc.) remains as valuable as ever.⁣
⁣
𝟑. 𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐭𝐨𝐨𝐥𝐬 𝐟𝐨𝐫 𝐭𝐡𝐞 "𝐣𝐚𝐠𝐠𝐞𝐝 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞" 𝐨𝐟 𝐋𝐋𝐌𝐬⁣
- AGI and ASI are getting all the attention with the promise and threat they pose to the world as we know it⁣
- But significant effort and innovation will be going into developing tools and applications with UI/UX capabilities that allow users to work with the "jagged intelligence" of LLMs with embedded verifiers and fine grained control on AI autonomy. ⁣
⁣
𝟒. 𝐁𝐞𝐜𝐨𝐦𝐢𝐧𝐠 𝐀𝐠𝐞𝐧𝐭-𝐍𝐚𝐭𝐢𝐯𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐠𝐫𝐞𝐚𝐭 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧⁣
- It took businesses over a decade to build a mature web presence, and another decade to go mobile-native.⁣
- I think the next major transformation for businesses is to become agent-native: both for exposing their services to external agents and for embedding agentic solutions in their internal processes.⁣
- IMO, this transformation is a marathon that will likely take the next decade to fully realize.⁣

The Evolution of Meta's Llama LLMs

Ram Komarraju — Thu, 08 May 2025 02:31:02 GMT

«All of this information presented in this article already out there, but seeing it all together might allow one to easily appreciate the evolution of the Llama family of LLMs.»

Meta's Llama family of large language models has rapidly evolved from a research project into a major force in the AI landscape with its open weights license. While it’s no longer the open weight LLM leader (that spot is firmly in the hands of DeepSeek and Qwen’s models), there’s little doubt that Meta is working hard to reclaim its spot at the front.

Using key data points presented in three tables, this post provides a quick overview of how Llama’s architecture evolved over the past three years, focusing on its release timeline, the scaling and choices of its technical architecture, and how its choices stack up against its major competitors.

Table 1: Release Timeline

Highlights from Table 1

Rapid Iteration and Increasing Specialization: The Llama 3 generation saw accelerated point releases (3.1, 3.2, 3.3) within months in 2024, indicating a shift towards faster, more diverse model rollouts supporting coding, vision, edge deployments etc.

Shift to Openness with Open Weights: Llama 1 started as a restricted research release, but Llama 2 marked a pivot towards broader commercial use, a strategy continued with subsequent versions, albeit with some restrictive licensing conditions.

Architectural Leaps: Llama 4 introduced fundamental changes like Mixture of Experts (MoE) and native multimodality, distinguishing it significantly from the Llama 1-3 generations.

Table 2: Detailed Technical Specifications Comparison

*Note: NS = Not Specified in provided sources.

Highlights from Table 2

Exponential Growth in Context Length: Context window size saw dramatic increases, from 2K in Llama 1 to 128K in Llama 3.1, and finally increasing to 1M/10M in Llama 4.
Massive Training Data Scaling: The volume of training data grew exponentially as well, from ~1.4T tokens for Llama 1 to 15T+ for Llama 3, and up to ~40T for Llama 4 Scout.
Architectural Refinements: Key changes include the adoption of RMSNorm and SwiGLU from Llama 1 onwards; the introduction of Grouped Query Attention (GQA) in Llama 2 (70B) and its standardization across Llama 3 models; and the shift to a much larger 128K+ vocabulary tokenizer from Llama 3 onwards.
Shift to Mixture of Experts (MoE): Llama 4 introduced MoE, decoupling active parameters (used per token, e.g., 17B) from total parameters (overall knowledge, e.g., 109B), aiming for greater efficiency at scale.
Multimodality Integration: While Llama 3.2 added vision capabilities, Llama 4 implemented native multimodality from the ground up.

Table 3: Comparative Technical Specifications (as of April 2025)

Highlights from Table 3

Architectural Diversity: Llama 4 and DeepSeek V3 utilize MoE architectures, while Qwen 2.5 and Mistral Large 2/Nemo stick to dense designs, showcasing different strategies for scaling. MoE models (Llama 4, DeepSeek V3) have lower active parameters compared to dense models (Qwen2.5, Mistral Large 2) of similar capability, aiming for inference, but differ in their active/total parameter ratios.

Context Length: Llama 4 Scout (10M) and Maverick (1M) offer significantly larger context windows than the 128K standard supported by Qwen 2.5, DeepSeek V3, and Mistral Large 2/Nemo.

Tokenizer Variations: Vocabulary sizes and tokenizer types vary, with Llama 4, DeepSeek V3, and Mistral Nemo using larger vocabularies (~128K-131K) compared to Mistral Large 2 (32K). Qwen 2.5 has the largest vocab (~151K).

Licensing Fragmentation: Licensing terms differ significantly, from the permissive Apache 2.0 (Mistral Nemo) and MIT/Commercial (DeepSeek V3) to custom licenses with MAU restrictions (Llama 4, Qwen 2.5) and research-focused licenses (Mistral Large 2 weights).

«That’s all folks! Hope you got something out of this one :)»

The End of the App Store?

Ram Komarraju — Fri, 11 Apr 2025 22:50:12 GMT

«Series Note: This article is part of a series of articles I’m writing on MCP. See my blog modelanalysis.ai for the complete collection.»

Introduction

Will our future be dominated by a single AI-driven “super app” that handles everything—from managing finances to entertainment to lifelong learning? It’s starting to look that way. In this article, I propose a scenario where our computing experience coalesces into one adaptive, AI-powered interface: a true all-in-one super app. I’ll explore how this shift might unfold and what such a transformative future could look like.

The Vision: From Many Apps to One

Today, we juggle multiple apps to achieve what we are looking to do—a few for shopping, another few for entertainment, yet another few for finances—all with unique logins, interfaces, and data silos. This fragmentation has become the norm, but it’s also a major source of friction.

But what if we could change the script? What if, instead of launching a dozen different apps, you could simply talk to one intelligent interface – an AI-powered "Super App" capable of handling almost any digital task?

For everyday users, this means unprecedented convenience and empowerment: the technology adapts to you, not the other way around. For service providers, it means rethinking how to offer services: instead of standalone apps, you might create modules that “plug into” a bigger AI ecosystem, with standard protocols like MCP ensuring compatibility.

The AI dynamically understands your natural language requests and invokes the necessary services behind the scenes. It's less about bundling existing apps and more about transcending the app paradigm altogether, moving towards a goal-oriented interaction model. The focus shifts from which app to open to what you want to achieve.

Peeking Inside the AI Super App

So, what might interaction with this AI Super App actually feel like? The key is blending conversational interactions with dynamic user interfaces that adapt to the task at hand.

Here’s how I envision the app interface:

The Conversation Stream: This is your primary form of input where you can talk or type. In the future, the Super App might even interpret visual cues like body language via video.
The Adaptive Workspace: This central area morphs to fit your task. Drafting an email? You’ll see rich text editing tools. Planning a trip? Expect maps, calendars, and flight options. Working on finances? Interactive charts and spreadsheets. Early versions of this idea already exist in tools like Claude Desktop and ChatGPT’s Canvas—but we’re just getting started.
Persistent, Accessible Memory: The AI learns your preferences (e.g., airlines, dietary needs, meeting times) and remembers relevant experiences (e.g., photos, past events, trips). This data—securely stored using a mix of local and cloud solutions—makes the AI feel like your personal assistant.
Privacy and Safety By Design: With so much data centralized, privacy can’t be an afterthought. The Super App should share only the minimum necessary information with external services, acting as a secure gatekeeper rather than a leak point.

MCP: A Key Enabler of the Super App Vision

So, how does an AI Super App seamlessly connect to and orchestrate potentially hundreds or even thousands of different tools and services? This is where the Model Context Protocol (MCP) comes in – paving the way for an ecosystem that is both extensible and secure:

Unified Access to Diverse Tools: MCP simplifies development by allowing a single uniform interface tools and services. Developers can more easily "wrap" their existing service or tool in an MCP server, making it instantly accessible to any AI application that speaks MCP.
Agentic Workflows with Guardrails: MCP is designed with agentic AI in mind – AIs that can take actions autonomously while allowing for guardrails across the ecosystem. The user, typically through the host application, controls which tools and services the AI can access. The tools and services themselves can enforce permissions, checking if the user (or the AI acting on their behalf) is authorized to perform the requested action.
Future-Proofing for AI Model Evolution: MCP decouples service integration from the specific AI model. You can upgrade the “brain” of your Super App without having to rewire every tool.

Beyond MCP: A Road Map for What’s Next

While MCP represents a major step forward, for AI agents to become more sophisticated and autonomous in increasingly more complex domains, the demands on the underlying protocols will inevitably grow. I foresee several key areas where MCP (or its successors) will need to evolve:

Standardized Discovery & Context: We need something akin to a universal directory listing and discovery protocol to enable AI Super Apps to reliably find out the tools and services.
More powerful and dynamic UI: We will need to figure out how to create powerful yet simple to use interactive task specific UI components within one workspace provided by the Super App. Today’s tools such as Claude Desktop and ChatGPT are early steps in that direction.
Multi-AI/Agent Collaboration: Complex tasks may require specialized AIs working together. We’ll need cross-agent protocols for context-sharing, task handoffs, and coordinated actions—perhaps through MCP Gateways. (Just as I write this, Google has released the open-source Agent Development Kit.)
Scalability & Multi-Tenancy: For MCP to thrive in SaaS environments, the protocol and supporting infrastructure will need robust multi-tenancy support.
Advanced Security, Authentication & Audit: We need more features such as standardized authentication methods between clients and servers, finer-grained permissions (especially for regulated industries like finance and healthcare), and robust, protocol-level audit trails.

Reality Check: Navigating the Hurdles Ahead

As compelling as this vision of an all-powerful AI Super App feels like, there are significant hurdles that still stand in the way. Here are some of the key challenges I see:

LLM Reliability: For a Super App to manage critical tasks requires improvements in model fact-checking, grounding, and potentially curated knowledge bases, alongside robust error handling and human oversight mechanisms.
The Fragmented Tool Ecosystem: Standardization is essential—but hard. Many companies maybe reluctant to open up or migrate legacy systems.
Sophisticated Inter-App Interactions: Supporting rich, interactive capabilities within a standard API is a major architectural challenge - think about collaborative document editing, intricate image manipulation, or managing complex project workflows.
User Trust: Asking users to trust one AI with everything is a big ask. Transparency and control are crucial to overcoming the “creepiness” factor.
Navigating the Regulatory Maze: Global Super Apps must navigate data protection laws like GDPR and tackle hard questions about liability when things go wrong.

So, Who Wins and Loses in This Scenario?

The emergence of AI-orchestrated Super Apps will rewire the economics of the entire software and services industry. The current app economy thrives on standalone applications vying for our attention, screen time, subscription dollars, and data. What happens when a single, intelligent interface becomes the primary gateway through which users access most, if not all, digital services?

I anticipate several major transformations in business models:

New Revenue Sharing and Monetization Models: We will see new revenue-sharing models emerge, analogous to current app store commissions with revenue distributed downstream based on the specific value add. Businesses might pay Super App platforms to have their services prioritized or featured.
Market for Specialized AI Plugins & Agents: Expert agents with domain knowledge in fields like medicine, law, or engineering could emerge—selling their expertise as plug-ins the Super App can call.
Intensified Platform Competition and Market Concentration: Whoever owns the dominant Super App platform could gain enormous market power—far beyond what today’s app stores, search engines, or even OS vendors have. Expect a turf war between Big Tech (Google, Microsoft, Apple, etc.) and upstarts like OpenAI or Anthropic, along with inevitable antitrust debates.

But what about the rest of us, the mainstream users?

For users, the biggest danger may be “learned helplessness.” If the AI handles everything, will we forget how to do things ourselves? A well-designed Super App, therefore, should aim to empower, not deskill. That means:

Transparency and Explainability: The AI should surface key decision points to the user rather than making all choices silently. When asked, it should be able to explain why it chose a particular tool, approach, or piece of information.
User Control Points: Users must always have the ability to "grab the controls" and direct the process. This means being able to see which tools or APIs the AI is using, modify the plan, or switch to a fully manual mode if desired.
Adaptive Interfaces: The system should cater to different user skill levels. Beginners get guidance; experts get control. The system should scale with user confidence and needs.

Conclusion

The AI-powered Super App isn't just a futuristic fantasy. Increasingly powerful and agentic AI models are demonstrating capabilities that seemed like science fiction only a few years ago. We're witnessing a fundamental paradigm shift towards a human computer interaction model where we state our goals and an AI figure out how to achieve them.

That said, the AI Super App is but one facet of the impact AI will have on our society. This future sounds exhilarating but also scary given the profound consequences it will have on every aspect of how individuals, modern economies, and societies function.

Llama 4 Is a Context Monster—Just Don’t Ask It to Think (Yet)

Ram Komarraju — Sun, 06 Apr 2025 14:41:27 GMT

tl;dr Meta's Llama 4 represents an evolutionary (not revolutionary) step forward in LLM development. Its move to a mixture of experts architecture and more native support for multimodal capabilities follow a well-trodden path. A context size ranging from 1M to 10M is probably the item that will grab the most attention, along with the fact that a reasoning model hasn't been released yet. The observed diminishing returns from scaling pre-training continues to highlight the need for new approaches to achieve substantial performance gains.

Here are some early observations for what it means for users:

1. Expanded Context Windows

Llama 4 models boast context windows ranging from 1 million to 10 million tokens, an 8x to 80x increase over previous models. This expansion, combined with its Needle in Haystack performance, allows entire codebases or large sets of documents to be processed in a single pass, reducing the need for complex retrieval-augmented generation (RAG) systems which have to deal with the chunking problem.

2. Multimodal Capabilities

The models support multimodal processing more natively with an early-fusion approach, enabling the integration of text, images, and audio. This positions Llama 4 competitively against other models, enabling applications that require simultaneous processing of diverse formats, though actual performance still remains to be seen.

3. Diminishing Returns in Base Model Scaling

Llama 4 confirms the trend already seen with other SOTA models - achieving substantial performance gains solely through scaling data and compute resources doesn’t seem possible anymore. Despite massive increase in parameters and training data, it demonstrated only modest improvements over its predecessors, confirming the need to find innovations beyond traditional scaling methods (RL-based test time training seems to be the one almost everyone’s betting on). Interestingly this generation’s largest model, Behemoth, which hasn’t yet been released for general use yet, continues to be trained and forms the basis for the smaller Maverick and Scout models.

4. Absence of a Dedicated Reasoning Model

While Llama 4 introduces newer/better models like Scout and Maverick, it does not include a dedicated reasoning model. But, I would expect Meta to release a reasoning model (just like everyone else!) later this year.

MCP Demystified: Bridging a Gap in Official Documentation

Ram Komarraju — Fri, 04 Apr 2025 18:39:10 GMT

«Terminology Note: Throughout this analysis, when I use the term "LLM" (Large Language Model), I’m referring to the class of foundation models that include not only text generation capabilities but also reasoning, planning, and multimodal capabilities (processing and generating images, audio, video, etc.). This expanded definition reflects the ongoing evolution of these models beyond pure language processing.

Series Note: This article is part of a series of articles I’m writing on MCP. See my blog for the complete collection.»

Anthropic’s Model Context Protocol (MCP) has gained significant traction over the past few months, but the official documentation provided on its modelcontextprotocol.io omits showing the role of LLM or conflates it with that of Claude Desktop (along with some other inconsistent terminology). This ambiguity can cause confusion, especially for new or casual readers trying to understand how MCP actually works — and can lead developers and architects to waste time muddling through with a hazy or incorrect mental model of MCP-based applications.

My goal in this article is to walk you through how things work under the hood, so that by the end, you’ll have a clearer and more accurate understanding of how MCP-based apps like Claude Desktop are designed.

(And if I’ve missed or misrepresented anything, I’d love to hear — I’ll keep this updated as things evolve.)

A Quick Overview of Model Context Protocol

Developed by Anthropic, MCP is an open protocol that is built from the ground up to help develop agentic applications powered by LLMs. Using traditional, well known techniques and best practices of software architecture, MCP enables seamless integration between LLM applications and external data sources and tools.

There are four major components of MCP:

MCP servers provide access to data sources (e.g., portfolio data, market data), tools (e.g., Filesystem, Google Maps, weather), and prompts (e.g., reusable prompt templates and workflows).
MCP clients reside within an MCP host process and connects to an MCP server.
MCP hosts run one or more MCP clients that connect with the MCP servers. MCP hosts act as both the technical orchestrators and the centralized dispatch system that enables LLMs to seamlessly interact with external data sources and tools while maintaining appropriate security boundaries.
LLMs (Large Language Models), provide the core “intelligence” for the system.
They receive prompts from the MCP hosts and generate context-aware responses.

An Illustrative Example of Claude Desktop Application

To illustrate how a typical MCP application can work, I’ll show below how Claude Desktop utilizes these MCP components to provide its functions. Claude Desktop is the application that sits on user’s computer and allows them to chat with one of Anthropic’s LLMs. It maintains their chat session context, chat history, and integrates with MCP based tools to enable the LLM’s agentic capabilities.

Initialization

User launches Claude Desktop application (the MCP Host Process).
Claude Desktop loads its configuration files either from local storage or from some remote config which include details about each MCP server (e.g. endpoints, authentication tokens, usage parameters, etc.
For every MCP Server in the config (Google Maps, Airbnb, Brave Search, Filesystem, or any other registered tool), Claude Desktop
1. loads the server’s metadata (e.g. what capabilities it offers, how to call it, what parameters are needed, etc.).
2. instantiates a small “MCP Client” inside the Claude Desktop process for each server.
Claude Desktop then sends the information about each tool to one of Anthropic’s LLM, say Claude 3.7 Sonnet. The LLM is essentially told, “Here are the available MCP servers (tools). Here’s how you can call them, which parameters they expect, and what they return.”

User Interaction with Claude Desktop

The user enters a prompt or message. Claude Desktop passes the prompt to the LLM.
The LLM analyzes the user’s message in the context of the metadata about the available MCP servers. The LLM checks if it can answer the question on its own or if it needs to invoke one or more of these servers (for tools/resources/prompts).
If the LLM doesn’t need external information, it sends the answering response to Claude Desktop which then displays the final answer back on the user interfaces.
If the LLM does need information or to perform an action (e.g. searching, looking up a route, checking Airbnb listings, reading a local file, etc.):
1. The LLMs send a response to Claude Desktop specifying which MCP server to invoke and with which parameters.
2. Claude Desktop (the “Host Process”) routes that request via the correct MCP client to the correct MCP Server.
3. The MCP Server (e.g. Google Maps) processes the request and returns the result to Claude Desktop.
4. Claude Desktop passes the result to the LLM.
5. The LLM uses this information to form an appropriate final answer to the user, and sends it back to Claude Desktop which displays it on the user interface.

Couple More Key Takeaways

A very important part of this architecture is the separation of concerns that enable implementation of guardrails and security controls. For example, Claude Desktop which is running with the user’s privileges can get explicit approvals from the user before invoking a server. Servers, on the other hand, can validate that users have appropriate privileges to make their requests. In addition, the host app/clients can validate the results returned by the LLMs and/or servers for additional business validation (not implemented by Claude Desktop at the moment).

At a high-level, I also find it useful to think of the LLMs as ‘business/user orchestrators’ and the hosts as ‘technical/system orchestrators.’

Conclusion

For developers and architects building LLM-powered applications, understanding the true mechanics of MCP is crucial. Rather than viewing it as a black box, recognizing the distinct roles and interactions of each component — the host, the client, the server, and the LLM — enables clearer implementations, tighter security, and more powerful user experiences.

As MCP evolves and gains broader adoption, its architectural model may well become a foundational pattern for building the next generation of agentic applications.

Why the Future of Agentic AI Won’t Be Chained: The Case for LLM-Driven Agents

Ram Komarraju — Thu, 03 Apr 2025 08:09:46 GMT

Series Note: This article is part of a series of articles I’m writing on MCP. See my blog for the complete collection.»

Introduction

Model Context Protocol (MCP) has been deservedly receiving significant attention lately. Anthropic calls MCP the "USB-C port for AI applications" in the sense that it provides a standardized way to connect AI models to different data sources and tools. However, another aspect of MCP remains underappreciated – and this is just as important as standardization. MCP positions the LLM at the center of agentic workflows, enabling a major paradigm shift. This paradigm shift is necessary to unlock the true potential of LLM-driven agentic workflows.

Meanwhile, many developers approach MCP after having worked with more traditional frameworks that put developers in complete control, viewing MCP as just another " USB-C port for AI applications." However, while both approaches aim to use LLM capabilities for orchestration, they differ significantly in philosophy, architecture, and implementation. Traditional frameworks favor a prescriptive approach, while MCP takes a descriptive approach, letting LLMs drive the workflows. Understanding how MCP compares to established frameworks like LangChain and its workflow extension LangGraph is key to realizing its full potential. (Note that MCP has many other key architectural features such as a clean separation of concerns along with an open standards-based protocol, but covering them isn't the purpose of this article.)

In this article, I will compare the two approaches and make predictions as to which approach is likely to dominate the future. These predictions may raise a few eyebrows, but I welcome your views on what I might have missed or misunderstood.

The Two Paradigms: Prescriptive vs. Descriptive

MCP lends itself naturally to a descriptive paradigm where capabilities are described without dictating a rigid path. The LLM's intelligence drives the workflow in real time given the current context and user input. In contrast, frameworks such as LangChain/LangGraph have been used primarily for prescriptive workflows where the developer defines the path. Under this prescriptive paradigm, the LLM has limited autonomy, following a fixed structure designed in code or graph.

1. "Prescriptive, Developer-Driven" Workflows

Developers define the flow of tasks (chains or agents) in code. Each step (or node in a chain) is explicitly laid out—what input it takes, which tool it calls, what output it returns. This means the workflow is prescribed by the developers. They decide exactly which tools are available, in what order, and under which conditions they are invoked.
The framework manages control flow using conditionals, loops, and branching logic written by the developers. While these “agents” do allow some LLM-based autonomy, the agent's environment (available tools and instructions) remains largely predetermined in code.
Together, these design choices reflect a developer-driven, prescriptive workflow: the path for the LLM is structured in advance through code, and each stage or choice point is orchestrated primarily by the developer's logic. LLM’s role is typically limited to responding to an exacting prompt further constrained by a narrow context supplied by the orchestrating framework (e.g., Langchain).

2. "Descriptive, Protocol-Centric, LLM-Driven" Workflows

A protocol enables tools, data sources, and prompts to describe their capabilities. An LLM is told the capabilities at its disposal and uses these capabilities appropriately without needing a pre-coded workflow. This architecture is descriptive: each capability "announces" what it can do, and the LLM decides how (or whether) to use it.
The LLM is effectively "in charge." It reads descriptions of available endpoints, interprets user requests, and decides which capability to call next—no fixed sequence is imposed by the developer. Control flow emerges from the LLM's reasoning, rather than being prescribed in a Python script or graph.
New capability endpoints can appear at any time, and as long as they adhere to the MCP protocol, the LLM can discover and use them. This approach leverages LLM intelligence to figure out the best sequence or combination of capabilities for any given user query, making the workflow LLM-driven and adaptive.

A comparison of Strengths and Limitations of the two paradigms:

Long-Term Strategic Prediction: Why the Descriptive Paradigm Will Likely Prevail

I predict that LLM-driven protocols like MCP that allow for descriptive approaches will ultimately emerge as the dominant paradigm for agentic orchestration in the medium to long term. Though significant effort will be needed to mature the implementation patterns for guardrails, workflow templating, and testing techniques.

The Case for the Descriptive Paradigm as the Long-Term Winner

1. Alignment with AI's Evolution Toward Greater Agency: As LLMs continue to advance in reasoning capabilities, the fundamental limitation will increasingly be the rigidity of the systems around them rather than the models themselves. Descriptive paradigms that empower the LLM to make orchestration decisions align more naturally with the trajectory of AI development toward greater agency and reasoning.

2. Scalability Across the Capability Explosion: If the past few months are any indicator, the coming years will see an explosion in the number and variety of specialized AI capabilities, tools, and services that support MCP natively. As the number of capabilities grows linearly, the number of possible workflows grows combinatorially. Explicitly programming all these workflows becomes untenable in a prescriptive approach.

3. Developer Experience and Productivity: while frameworks like LangChain offer familiar paradigms for developers, the long-term productivity advantages favor MCP like approaches. Clear separation between capability implementation (e.g., tools, resources, and prompts) and orchestration logic creates better modularity. This allows developers to focus on building and improving individual capabilities rather than maintaining complex workflow graphs. New capabilities can be added or enhanced incrementally without redesigning entire workflows.

4. Adaptability to User Needs: a long-term advantage of LLM-driven descriptive approaches is their ability to adapt to the infinite variability of user needs. Users can express their goals in natural language rather than conforming to predefined paths. The system can dynamically adapt based on the specific context of each interaction. And the LLM can incorporate user preferences expressed conversationally without requiring explicit personalization logic.

Counter-Arguments and Limitations

Despite these advantages, there are legitimate concerns about LLM-driven descriptive approaches that must be addressed:

1. Control and Reliability Concerns: The primary argument against LLM-driven approaches is the risk of unpredictable behavior when the LLM makes orchestration decisions. For regulated domains like finance or healthcare, organizations need guaranteed execution of critical compliance steps. It's difficult to test all possible paths an LLM might take through a set of capabilities. Handling errors gracefully is more complex when the workflow isn't explicitly defined.

2. Performance Considerations: Component-based approaches can potentially optimize performance in ways that dynamic orchestration cannot. Developers can create optimized sequences that minimize latency and resource usage. Explicit workflow definitions make it easier to identify opportunities for parallel execution.

3. Evolution of Prescriptive Frameworks: As both ecosystems evolve, we're likely to see convergence in some areas. LangChain/LangGraphic can support increasing use of LLMs to dynamically select between predefined subgraphs or to modify graph structure at runtime. I’m sure there’ll be efforts to allow these frameworks to be used with MCP.

Long Term Architecture and Transition Timelines

The most likely long-term outcome is an architecture where:

the core architecture remains protocol-centric, with clear separation of responsibilities, and standardized interfaces for capability discovery and invocation.
developers can define recommended workflow templates that guide but don't constrain the LLM.
instead of rigid workflows validated with traditional testing, the system provides sophisticated guardrails that ensure critical requirements are met while allowing flexibility within those constraints.
organizations can dial the level of LLM autonomy based on domain requirements, starting with more constrained workflows and gradually enabling more dynamic orchestration as trust builds.

The transition will likely follow this pattern:

Short-term (1-2 years): Prescriptive frameworks like LangChain/LangGraph will continue to dominate due to developer familiarity and concerns around the reliability of LLMs, especially for domains with strict, well-defined processes and high regulatory scrutiny. However, firms will likely try descriptive, MCP-like, solutions for use cases where the flexibility and power of a descriptive framework can be naturally paired with a human in the loop.
Medium-term (2-5 years): As the industry’s experience grows in adapting MCP-like architectures (e.g., how to build workflow templates and guardrails effectively), and LLMs become more reliable and capable, the descriptive paradigm will dominate, with prescriptive approaches used selectively for special cases.
Long-term (5+ years): Will AGI eventually emerge and obviate the need for any of these discussions?

Conclusion

Both MCP and LangChain/LangGraph represent significant advancements in LLM-powered applications, but with fundamentally different philosophies. LangChain/LangGraph excels in scenarios where workflows need to be precisely defined and controlled, while MCP enables more adaptive, emergent workflows driven by the LLM's understanding of user needs and available capabilities.

The prescriptive paradigm "works around AI" while the descriptive paradigm "works with AI." The history of technology shows that flexible, adaptable architectures that take advantage of the growing sophistication of underlying technologies (LLMs in this case) tend to outlast more rigid, prescriptive approaches in the long run.