Can LLMs Be Used for Financial Forecasting?

How We Used Fine-Tuning with Synthetic Reasoning Traces and RLVR/GRPO to Train a Financial Reasoning Model

Jun 20, 2025

«note: This article provides a high-level overview of our project. For more details, here's the full project report.»

1. Introduction

As part of the CS224R (Deep Reinforcement Learning) course that recently completed, I partnered with Jonathan Larkin and Tamika Bassman to work on a project where we dove headfirst into one of the most common yet difficult problems in finance: predicting stock performance.

More specifically, we explored whether state-of-the-art deep reinforcement learning could help a large language model (LLM) reason more effectively about future stock performance. Central to our approach is the use of reinforcement learning with verifiable rewards (RLVR) following supervised finetuning (SFT). We ran a baseline on Qwen3-1.7B, and then successively improved this baseline by (a) fine-tuning the model on synthetic reasoning traces, and (b) RLVR training using Group Relative Policy Optimization (GRPO) algorithm with binary correctness and format correctness rewards. Our goal was to measure the performance of this approach against the base model, a traditional linear model, and the predictions of human analysts.

2. The Architecture: A Pipeline for Financial Reasoning

We designed a multi-stage pipeline aimed at systematically building financial reasoning ability into a base LLM.

A. Data Preparation

We started with a dataset of over 120,000 earnings call transcripts from 3,000 U.S. stocks, spanning 20 years. To create our target labels ("STRONG BUY" to "STRONG SELL"), we calculated each stock's 1-month future return and then percentile-ranked it against its sector peers. This crucial step helped isolate company-specific performance by filtering out broader market or sector-wide trends.

B. Establishing the Baselines

We created three key baselines (higher F1 scores are better):

The performance of a linear TF-IDF model to see how much predictive power was in the text itself.
The performance of human analyst consensus ratings.
The performance of the base LLM - the pre-trained Qwen3 1.7B

C. The Training Method

Our approach wasn't just to throw the data at the model. We used a sequence of fine-tuning and reinforcement learning stages.

Supervised Fine-Tuning (SFT): To train our model to think like a financial analyst, we used a “frontier” model, Gemini Pro 2.5, to create synthetic “reasoning traces” — step-by-step explanations that show why a prediction makes sense, not just what it is. We generated these traces using two methods: rejection sampling (keeping only the traces that led to the correct answer) and a technique inspired by "Self-Taught Reasoner" (STaR), where we gave the model the correct answer and asked it to reason backward. We then fine-tuned our base Qwen3 model on these reasoning traces along with the answers.
Reinforcement Learning with Verifiable Rewards (RLVR): This was the final and most critical stage. We used Group Relative Policy Optimization (GRPO), an algorithm designed for enhanced reasoning tasks. We rewarded the model with a simple binary score (1.0 for a correct label, 0 for incorrect) and a smaller reward for getting the output format right.

For computational efficiency, we used Low-Rank Adaptation (LoRA) for all training stages and vLLM to speed up inference.

3. The Findings

So, how did we do? The results were both encouraging and humbling.

The most sobering finding was that our simple linear TF-IDF baseline, trained on the full 90,000-sample training set, outperformed every other model with a macro F1 score of 0.2597. This result provided a valuable insight: even without sophisticated modeling, the information in earnings calls transcripts holds real predictive power.

However, the story of our LLM pipeline is one of consistent, measurable improvement:

The base Qwen3 1.7B model performed poorly on its own (F1 score: 0.0897) and exhibited a strong "optimism bias," as indicated in the confusion matrix (a) of Figure 4 below.
Human analyst consensus ratings, when mapped to our labels, scored an F1 of 0.1743—better than the raw LLM, but still below the TF-IDF model.
Our LLM pipeline, even when trained on a tiny fraction of the data, surpassed the human benchmark. The final model, which went through both SFT on traces and RLVR optimization, achieved an F1 score of 0.2084.

While we didn't beat the TF-IDF baseline, we successfully demonstrated that SFT on synthetic data and then refinement with RLVR improved the model's performance over the base LLM. We were held back by compute constraints; our entire pipeline took about 20 hours to run on a single NVIDIA H100, and we were only able to fine-tune on 414 synthetic examples and train the RLVR stage on 1,000 samples. Given that the full training set has 90,000 samples, there's a clear opportunity to scale.

4. What More Could We Have Done?

Every research project leaves unexplored avenues, and ours is no exception.

Integrating Tools and External Data Sources: A human analyst doesn't just read transcripts; they use calculators, refer to financial data from other sources. A major next step would be to give the model "tools" to allow it to perform quantitative calculations and access external data, making its analysis much more robust.
Richer Reward Mechanisms: We used a simple binary reward for the final answer. Future work could employ an "LLM-as-a-judge" to provide more nuanced, process-based rewards that evaluate the quality of the reasoning itself, not just the outcome.
More Rigorous Ablation Studies: We acknowledge the need for more rigorous ablation studies to precisely measure the impact of each component of our pipeline including, perhaps, tweaking the prompt to indicate to the model to be more skeptical when judging the transcripts. We also need to look for and eliminate data leakage, as our 20-year dataset may contain information the base LLM saw in its original pre-training.
Scaling: Another obvious path forward is to apply this pipeline to larger foundation models and train it on our full dataset.

5. Lessons from the Project

Beyond the technical findings, a project like this teaches one a lot about the practical realities of AI research.

First, planning is everything. A multi-stage pipeline with dependencies on data processing, baseline generation, supervised fine-tuning, and reinforcement learning requires a clear and structured plan from the outset.

Second, expect fast evolving frameworks/libraries and incomplete or missing documentation. Frontier reasoning models are still very new and getting the prompt formatting right and tuning the hyperparameters can be quite tricky. Documentation on the libraries and different hyperparameters them is still very sparse - one should be prepared to adjust one’s plans as more information is learned during implementation.

Finally, the project reinforced the sheer time-consuming nature of model training and fine-tuning. A single training and evaluation run took nearly a full day. One experiment we attempted had an estimated run time of 75 hours. This isn't a field for the impatient or the undisciplined; one needs to design experiments carefully because compute time is a precious resource.

Despite the challenges, we've established a promising foundation. Our findings show that LLMs, when equipped with structured reasoning from reinforcement learning, hold significant potential in financial forecasting. With more compute and refined techniques, this approach could become a powerful complement to traditional financial analysis.

6. Appendix

A. Example Prompt:

<|im_start|>user 
You are an expert institutional equity analyst. 

Given the following text, predict the stock’s relative performance to stocks in the same sector over the next month. 

You may rate a stock STRONG BUY, BUY, HOLD, SELL, or STRONG SELL. 

Think carefully. You must end your output with one of: STRONG BUY, BUY, SELL, STRONG SELL, or HOLD. 

Input: 
"""Operator: Ladies and gentlemen, thank you for standing by and welcome to the Third Quarter Oil-Dri Corporation of America Earnings Conference Call. At this time, all participants are in a listen-only mode. After the speakers presentation, there will be a question-and-answer session. [Operator Instructions] I would now like to hand the call over to Dan Jaffee, President and Chief Executive Officer. Please go ahead.  

Dan Jaffee: Thank you. Welcome everybody to the Oil-Dri third quarter investor
teleconference joining me both physically and virtually...

<removed rest of the transcript of nearly six pages for brevity>

...
Dan Jaffee: Great. And I hope you also spotted the share repurchase which again was a way of trying to deliver value back to our shareholders and we like we saw we are opportunistic so that we had a chance to buyback shares with higher dividend and our cash is earning 0.2% so and we have plenty of opportunity that we can deploy that cash on but we are going to still stay opportunistic on the share repurchase program so hopefully spotted that in the queue as well thank you guys its been a half an hour and we look forward to talking to you it will be our end of the year that will be our fourth quarter and fiscal year end I would tell you on one of our major metrics which is we look at pre tax pre bonus income and then we divvy up that income between the shareholders and then the teammates. We already, through nine months have made more than we have ever made and any fiscal year we have ever had so as I joke we are playing with a house of money at this point but we are having a record year it feels really good it feels really good that all the investments we have made in the last two and half years on people and infrastructure are why this is happening and its just it feels good it is predictable and we are up in front of our businesses and we just we have always taken a long term approach and the seeds we planted years ago are sprouting now and seeds we plant today will start in a few years so thank you for the long-time holders and we are happy to reward you with I think will results 17 year in a row of dividend increases which is fantastic. So thanks everybody. We will talk to you again in a quarter. Operator: Ladies and gentlemen, this concludes todays conference call. Thank you for participating. You may now disconnect. Everyone have a great day."""

Answer: 
<|im_end|> 
<|im_start|>assistant

B. Example Response with Thinking Trace:

<think>
Okay, let’s tackle this question. The user wants to predict the stock’s relative performance to stocks in the same sector over the next month, rating it as STRONG BUY, BUY, HOLD, SELL, or STRONG SELL. 

First, I need to understand the sector. The company mentioned is Oil-Dri Corporation of America, which is involved in cat litter, animal health products, and other related areas. The sector for this would be the consumer products or retail sector, specifically focusing on pet care and cat litter.

Looking at the earnings call, the company reported strong financial results, with a 11% increase in retail and wholesale product sales, 8% increase in net sales, and a 25% increase in animal health products. They also mentioned improvements in gross profit, lower freight and natural gas costs, and a strong income from operations. The CEO emphasized the team’s resilience during the pandemic and the company’s ability to meet demand. 

Now, the question is about relative performance compared to sector peers. The key factors here are the company’s financial performance, market trends, and sector-specific factors. The company is in the consumer goods sector, specifically pet care. 

Considering the sector’s general performance, the consumer products sector has been affected by the pandemic, with a shift towards e-commerce and online sales. However, the company’s performance in retail and wholesale products was strong, with a 11% increase. The animal health segment showed growth, particularly in the Americas, with a 25% increase in sales. 

The sector as a whole might be experiencing some challenges, like supply chain issues and market shifts, but the company’s strong financials and focus on value-driven products (like cat litter) could give them an edge. The CEO mentioned that the company is well-positioned to compete in the value segment, which is a key factor in the consumer products sector. 

However, there are risks. The company is in a sector that’s heavily influenced by consumer behavior and economic conditions. If the pandemic’s impact subsides, there might be competition from other brands, and the company’s ability to maintain pricing and quality could be tested. Also, the sector might face challenges with supply chain disruptions and the ongoing impact of the pandemic. 

Given the company’s strong financial results, improved gross margins, and focus on value, it’s possible that the stock could perform well relative to peers. However, the sector’s overall performance and market conditions could affect this. The user wants a relative performance prediction, so I need to consider if the stock is likely to outperform, underperform, or hold compared to others in the same sector. 

The answer should be based on the company’s strong performance, market position, and sector trends. Since the company is in a sector that’s adapting to digital trends and value strategies, and they have shown resilience, the relative performance might be positive. However, there’s a risk of sector-wide challenges, so the rating could be HOLD or BUY. But given the strong financial results and the company’s focus on value, a BUY might be appropriate. 
</think>

BUY

ModelAnalysis.ai

Discussion about this post