Equity Analysts are Feeling the Heat of Gen AI: Machines beating Wall St

3 studies and their implications investors must know on LLM capabilities in stock picking, earnings prediction and stock price forecasting

Sep 08, 2025

For over a century, investors have relied on human analysts - the men and women of Wall Street who pore over balance sheets, earnings calls, and price charts - to guide decisions. But a new challenger has entered the arena. Generative AI models, once thought of mainly as language tools, are now showing startling competence in tasks once considered the exclusive domain of seasoned analysts. Studies from the University of Chicago, J.P. Morgan, and other leading institutions suggest that these models can not only match Wall Street’s best but in the majority of cases outperform them.

This shift is more than a curiosity. It raises a profound question for investors: if an AI can analyze earnings, forecast stock performance, and generate ratings with less bias and more consistency, what does that mean for the future of investing?

Let’s dive into the independent evidence from 3 key studies on typical investment tasks.

Large Language Models as Investment Analysts

Recent research has tested large language models (LLMs) such as GPT-4 on tasks that sit at the heart of equity analysis: predicting earnings, issuing stock ratings, and forecasting prices.

The common theme across these studies is clear: AI, when prompted and structured correctly, performs at least as well as human analysts. In some contexts, it delivers superior results by stripping away the behavioral biases and inconsistencies that often color human judgment.

Predicting Earnings from Financial Statements

Earnings surprises (when companies beat or miss expectations) are among the strongest drivers of stock returns. A 2024 study by Kim and colleagues from the University of Chicago put GPT-4 to the test, providing it with anonymized financial statement data for more than 15,000 firms across 150,000 firm-year observations.

Two approaches were tested:

Zero-shot prompts – The model was asked to predict earnings direction with no additional reasoning guidance.
Chain-of-Thought (CoT) prompts – The model was guided through a structured reasoning process, much like a human analyst might work step by step.

Results

Zero-shot accuracy roughly matched analysts at ~52%.
CoT prompting raised accuracy to ~60%, outperforming analysts and matching advanced neural networks.
When considering only high-confidence forecasts, accuracy climbed further to ~62%.

What’s remarkable is not only the accuracy but also the model’s interpretability. GPT-4 identified ratios such as operating margin, asset turnover, and current ratio as central to its reasoning—similar to what a skilled analyst would emphasize.

Key Takeaways

Structured prompts unlock the full analytical value of LLMs.
AI shines where humans are most prone to bias or disagreement.
Analysts still retain an edge in contexts requiring “soft” qualitative insight—such as small, volatile, or unusual firms.
The collaborative “human + machine” model consistently produces the best results.

Perhaps most striking, when forecasts were translated into a trading strategy, the GPT-4 CoT model generated an annualized alpha of 10% with a Sharpe ratio of 3.36, outperforming both human analysts and other machine learning models.

Generating Buy, Hold, and Sell Ratings

Another familiar task for analysts is the production of stock ratings. A 2024 J.P. Morgan study by Papasotiriou et al. tested whether GPT-4 could generate these ratings more effectively than humans.

The model was fed different combinations of inputs: pricing data, company fundamentals, news summaries, and sentiment scores. It was then asked to issue ratings on a five-point scale, from Strong Sell to Strong Buy, across horizons of 1 to 18 months.

Results

Every LLM variant outperformed Wall Street analysts. Even the “vanilla” version, using only basic pricing data, beat the consensus ratings from over 100 brokerage firms.
The best results came from combining fundamentals with sentiment, yielding the lowest mean absolute error across 3-, 6-, and 12-month horizons.
Analyst ratings had the highest error of all groups tested—suggesting persistent human biases.

The implication is stark: with nothing more than publicly available data and careful prompting, AI models can generate stock recommendations that are, on average, more accurate than those of highly paid professionals.

Forecasting Stock Prices

Long before ChatGPT, researchers explored AI’s potential in forecasting. In 2021, Cao and colleagues built a bespoke “AI Analyst” trained on firm, industry, macroeconomic, and sentiment data from 1996 to 2016.

Results

The AI Analyst outperformed 53.7% of human analysts.
A long–short portfolio based on AI vs. human forecast differences delivered 10% annualized alpha.
Combining AI forecasts with analyst consensus boosted performance further, outperforming 57.3% of analysts.

Importantly, the study highlighted circumstances where humans excel: analyzing small-cap firms, navigating uncertain downturns, and interpreting intangible factors. Conversely, AI thrived in large, data-rich settings and over longer horizons.

Themes Emerging from the Research

Across these studies, several consistent themes emerge:

LLMs outperform humans in structured, data-rich tasks.
When the problem is well-defined—such as forecasting earnings from financial statements—AI not only keeps pace but often surpasses human accuracy.
Bias reduction is a key advantage.
Human analysts are prone to herd behavior, overconfidence, and optimism bias. AI models, when properly anonymized and structured, avoid these pitfalls.
Interpretability is improving.
Unlike traditional “black box” models, LLMs provide narrative reasoning, allowing investors to understand how predictions were formed.
The “human + machine” partnership is most effective.
AI models excel at crunching numbers and identifying patterns. Humans bring context, intuition, and qualitative judgment. Together, they produce superior outcomes.
Performance improves with model generation.
GPT-4 consistently outperforms GPT-3.5, while Gemini and other models are catching up. As these systems evolve, accuracy and interpretability will only get stronger.

Conclusion: The Future of Stock Picking

The evidence is no longer anecdotal. Generative AI models are proving themselves capable of rivaling, and in many contexts outperforming, human analysts in the core tasks of investment analysis. From predicting earnings to issuing stock ratings to forecasting returns, AI has crossed the threshold from experimental novelty to practical edge.

But this is not the end of human investing. If anything, these findings reinforce a powerful truth: the future belongs to investors who combine the strengths of both man and machine. Analysts bring context, nuance, and the ability to read between the lines. AI brings consistency, scale, and freedom from bias. Together, they represent a new era of disciplined, evidence-based investing.

As generative AI continues to advance, investors who integrate these tools into their process will not only keep up—they may well pull ahead. The gatekeepers of insight are no longer confined to Wall Street. The toolkit is now open to all.

Discussion about this post

Ready for more?