Sounding a WARNING on an Emerging Risk to using Gemini for Investment Research
Google's Gemini has Developed a Major Information Retrieval Problem - this Matters for You as an Investor
By reading this article, readers acknowledge the terms of our full legal disclaimer. The information provided herein is for educational and general informational purposes only and does not constitute professional financial or investment advice nor a recommendation or solicitation to trade in any security mentioned.
When Gemini 3.0 launched in 2025, I wrote about how is was a game changer for investment research. Its speed, native access via Google search to live information, instruction attunement and highly structured output presentation in my opinion, pushed it ahead of ChatGPT in investment workflows. However in 2026 it appears that Gemini 3.1 has scored a game defining own goal. Recent iterations of the model appear to have dialed back the Google search functionality while obscuring this fact, and this is creating major problems with hallucinated responses that many investors are still simply unaware of. I break it down here, identifying the problems and present suggested actions that investors need to take to ensure reliability of output with Gemini.
If you know others using Gemini for investment research, share this post to them as in my recent discussions with a broad range of investment professionals, not a single one was aware of this problem yet. Investors are relying on outputs to certain investment related workflows, that may be unfit for purpose.
What’s Broken at Gemini 3.1?
Over the past several months, I have noted, alongside a growing number of Gemini users, that the model, especially in Thinking and Pro modes, often fails to perform live web retrieval even when explicitly instructed to do so as a requirement of the prompt. Instead, it answers from its internal training data, often presenting stale information as if it were current, inventing citations, or giving confident but unsupported summaries of what it says are recent events but actually date back before mid 2024 (its training cut-off). Even when prompted to undertake a research audit of its response and confirm citations, whether live web search was conducted and specify the number of sites visited, Gemini can hallucinate a response, telling the user that it may have visited up to 15 sites but in fact did not visit any. This seems to be a distinct change from Gemini 3’s early days when its native Google search functionality was one of its unique advantages and is at odds with most people’s understanding of Gemini’s Google Search link.
Users are not imagining the problem. Across Reddit, Google’s own community forums, and other public discussions, people have described Gemini ignoring instructions to browse, hallucinating current events, failing to follow retrieval instructions, and sometimes producing answers that look polished but are detached from live sources. Some reports describe Gemini insisting that outdated information is correct even when confronted with newer evidence. Others describe missing source panels, disappearing related links, or cases where the model appears to answer confidently without any genuine retrieval behind the response. I personally, have experienced all these behaviors in recent months.
This problem matters far more in finance than in ordinary consumer use, because investment research is built on the difference between what was true last quarter and what is true now. Google’s own documentation makes clear that live web retrieval is a distinct grounding capability but it now mentions that it is not something the base chat experience guarantees on every turn, even if specified. In exploring this further I sought to see whether, via the Gemini API that provides more granular tool calling controls, Google Search grounding can be enforced as a tool on a query. However, I have noted that Gemini decides itself whether search would improve its response and in many cases chooses incorrectly, even with explicit instructions to perform a live web search and only use information from a specified recent period.
An example: A workflow requesting live web search to gather, summarize and explore the implications of analyst reports, recommendations and price target changes for a stock over the last 1 month repeatedly hallucinated its response even though Google Search Grounding was enabled in the API. The model simply chose to ignore the instruction that up to date data from the last 30 days was required and information prior to this was to be ignored. The entire response was fabricated by the model. In interrogating the response, Gemini admitted the hallucinations but proceeded to continue the practice through multiple requests. Eventually, after a couple of attempts I was able to force the model to undertake a web search, but it was limited in scope and the response was fairly useless as it was incomplete.
This experience was repeated through a number of different types of requests such as recent news distillation even though (1) web search was specifically instructed, and (2) the tool was enabled. Specifying the reason behind the need for live data and the negative implications of using stale information to influence the models reward function, improved the frequency of web RAG but did not completely remove the issue. Using Fast mode, instead of Thinking and Pro modes did unlock web search functionality but not on every request, resulting in a lack of trust in outputs.
The finance implications of this problem are obvious once you think about the kinds of tasks investors use AI for - particularly when they don’t know this is happening. If you ask Gemini for “current sell-side views on hyperscaler capex,” or “what happened in the last two weeks after earnings,” and you pick the wrong mode, a response generated from stale model memory can do real damage. It can cite numbers from an old 10-Q as if they are current, discuss last year’s consensus as if it still stands, miss a guidance revision, or summarize a thesis that the market has already invalidated.
In investing, stale data is not just inaccurate. It can be structurally misleading because it creates the illusion that you are acting on a live information edge when in fact you are acting on obsolete context. Google explicitly says grounding with Google Search connects Gemini to real-time web content and helps it provide more accurate answers with verifiable sources beyond its knowledge cutoff. The problem is that this is only reliably undertaken by Gemini now in one mode that is rate limited. I’ll take you through that below.
Why is this happening?
There are two intertwined reasons for this problem. The first is technical, and the second is Google’s current strategy.
Competition amongst the LLM’s is intense. Users are not sticky (I use 4 models currently and switch regularly) and there is a fixation with Benchmark scores on every model iteration. It has already been reported in news sources that LLM’s are optimizing for benchmark supremacy, often over user experience. While not linked to Gemini, it has even been reported that some models may have a “benchmark” mode where, when detecting benchmark related tests, a model can optimize its responses to achieve the highest score - I likened this in a substack note to VW-gate from some years ago for LLMs - when VW was found to have its cars enter a special low powered “eco” mode when detecting regulatory emissions tests.
The relevance of this is reasoning scores and capability. Google’s own research has found that external web search retrieval provides fresh facts and citations but can disrupt deep "chain-of-thought" logic particularly when the retrieved snippets are noisy. Consequently they have lent in to their Thinking and Pro modes recently to emphasize “thinking / reasoning” capability and inference-time compute over the external information retrieval that was previously a defining strength of Gemini when compared with ChatGPT. This means that these modes are optimized for problem solving (and achieving high benchmark reasoning scores) NOT for balanced live research anymore. It is also likely that in the agentic era we are entering, Gemini needed to be more reasoning capable to win enterprise mandates and therefore needed to sacrifice its web search edge given Google’s own research.
The second reason is purely my conjecture based on what we are seeing recently in Alphabet’s results, coupled with the coincidental timing of this problem appearing and being noticed by users after “AI mode” was launched and ramped in Google Search. A feature of Alphabet’s recent results has been the re-acceleration of Google Search ad revenues. This trend has been greeted warmly by the market and was one key reason for the multiple re-rating Alphabet enjoyed in 2025 after the AI search disruption fears that emerged in 2024.
Alphabet knows it needs its search business to grow and dispel AI disruption fears, but providing unfettered native search functionality within Gemini would likely limit the monetization of search given Google is not following the ChatGPT path of ads inside model responses. So what is a logical response? Build its “AI mode” within the Google Search Engine which CAN be monetized, while now focusing most modes in Gemini on reasoning / problem solving - not live web retrieval. This strategy is a change from Gemini 2.5 and early 3 days and may be designed to better segment user behavior and maximize revenue growth. Users for most common queries end up going back to Google search for live web retrieval requests (increasing ad revenues) rather than using the Gemini app. On the other side, Gemini’s reasoning capability and benchmark scores presumably improve with the new model optimization for “thinking” over retrieval which can win more enterprise mandates. Live information is really a unique requirement of investment workflows - not most other types of model requests that form the bulk of LLM query volumes. So in effect, Gemini has chosen a strategy that has weakened its usability for everyday investment research significantly.
The trouble is - most investors still don’t recognize that this is happening with Gemini 3.1 and as I outlined, the result is an increase from my observations and usage, in hallucinated responses, with Gemini even telling users it undertook live search when in fact it did not. The simple fact is that if you are using Gemini in its normal modes to research stocks right now, the data it presents to you as current and that it uses to synthesize its responses MAY be totally out of date or simply made up.
How can you tell if Gemini has hallucinated your response?
There’s no perfect way other than user verification of each data point it presents, however there are a few key signposts.
(1) No sources or citations in the response. This is an almost certain indicator that its internal memory has been used solely to synthesize the response. Remember that Gemini’s training cut off was back in mid 2024 - nearly 2 YEARS AGO!
(2) Sources or citations without embedded links. When I suspected a hallucinated response to my Emerging Market Narrative workflow due to a lack of citations being attached, I asked Google to provide the list of sources used. It synthesized a citation list with no links. Further interrogation resulted in the model admitting that it made them up. Ie you can request citations within your prompt, but that does not enforce live information retrieval.
(3) Sources or citations with old dates: On one request, I got the model in Thinking mode to provide linked citations to a workflow that required live web search of news from the last four weeks, with sources linked. I received a response that was suspect due to referencing old events and on clicking each citation found that every single one of them was prior to mid 2024 (rather than fresh from the last month as requested). Ie the model had once again ignored instructions and provided information only from internal memory that existed prior to model launch.
What’s the Solution for Gemini Users?
So what should investors do in the current generation of Gemini? The first principle is simple: do not treat ordinary Gemini chat modes as reliably web retrieval-enforced. Treat them as reasoning engines that may or may not call search. This is still useful - Gemini still performs very well with user attached source documents due to its multi-modal capabilities. This remains the most reliable way of avoiding hallucinations and getting useful output. Ie Bring Your Own Data. Many Inferential Investor workflows are designed this way to constrain sources, avoid web search and thus create reliably informative output.
But if web search is required for a task then there’s only one way. If your task is inherently time-sensitive, such as current macro data, current product pricing, recent management commentary, news, or anything involving “latest,” and you cannot provide your own sources because the information is widely dispersed, then you need to use the one remaining mode that explicitly includes live search.
Google now positions its Deep Research setting as the mode designed for “in-depth and real-time research,” with Google Search included by default. My testing appears to verify that this is the only reliable way now to ensure live web search is undertaken by Gemini. The trouble is that Deep Research is slow, outputs very long reports and typically ignores user output formatting requirements. It is also not clear how partnering Deep Research mode with “Thinking” or “Pro” modes actually works as their reward functions seem at odds now. What wins? Web search for fresh data or logical reasoning that gets distracted by web search?
Users tied in to Gemini can undertake a split strategy. First use Deep Research to gather the dispersed data you need. Then attach the deep research report to a new task query in Thinking or PRO mode to process and analyze the data separately. This is time consuming but can also work. Of course - you can also just switch to Claude Sonnet 4.6 or Opus 4.6 but you will find yourself rate limiting out extremely quickly on a PRO plan. One of my workflows can use 50% of your 6 hour rate limit in a single go! ChatGPT, although I detest its response formatting, seems to undertake limited web searching to queries more readily (but in the past was never as extensive as Gemini).
Conclusion
Be Very Careful How You Use Gemini for Investment Research Right Now
This article helps explain the frustration many users are reporting with Gemini at the moment. A lot of investors still assume that if they ask Gemini to “search the web first” or “use live sources only,” the model will reliably do so as it did in the past. But both recent experience and the model’s documentation now suggests that ordinary prompting does not force retrieval and the model in most cases makes its own decision on the need for fresh data. That’s a real problem in the investment research domain. The model will not tell you when it has ignored your web search instructions or tool call and it has shown itself to hide the fact that it ignored your instructions. One would think that if this is Google’s new strategy for Gemini, that it would train the model to inform the user if web search was requested but not undertaken in the chosen mode. Unfortunately it has not chosen to do this.
Gemini’s tool-use behavior is mediated by system-level routing and heuristics and my experience suggests these settings have shifted. Thinking and Pro modes particularly are presented by Google as higher-reasoning modes, and now seem to actively avoid acting as grounded search agents. In other words, many users are trying to use reasoning modes (Thinking and PRO particularly) as if they were search-enforced because they “sound more capable” and because Gemini does not explicitly guide users on the limitations of each mode within the response environment. The result is unpredictable and unreliable responses to broader, information gathering, investment research queries.
Why does this matter so much in investment research? Because AI mistakes in finance are rarely random. They cluster around exactly the details that drive security prices: dates, deltas, revisions, guidance, management commentary, valuation context, and market reaction. A hallucinated answer can be dangerous in at least three ways. First, it can invent a fact, such as a supposed target price change or a nonexistent data point on backlog, margins, or demand. Second, it can serve stale facts as current, which is arguably worse because the output sounds plausible and passes a quick sniff test. Third, it can mislead users into thinking the answer is sourced when the model only appended generic or years outdated links. In each case, the investor is at risk of importing false confidence into a process that depends on precision and recency.
The solution - Bring Your Own Data (BYOD) or if you can't then use Deep Research mode. Be very aware that Fast, Thinking and Pro modes in the normal web app user interface can be unreliable now for broad information gathering exercises.
I believe this problem is important enough that it should be shared to anyone you know that uses Gemini within the investment domain. As always - know your model, BYOD wherever possible and recognize that models can change behavior as they are updated. Its always best to have 2 model options to avoid being overly reliant on any one solution when problems like this arise.
Andy West
The Inferential Investor




