Gemini 3.5: The New Top Dog AI Model Family for Investment Research?
plus the most important finance development out of Google I/O 2026
If you had to pick one (and only one) AI model family to support your investment research, how would you choose? You would want the model that is frontier level smart, broadly capable across tasks (research, data calling, financial calculations, modelling), fast, cost effective and backed by a company that is investing large amounts on continual improvements while lowering costs for you over time. I don’t think that’s particularly controversial.
The challenge in answering this question comes when that description could be applied to multiple model families that keep leap frogging each other with each successive model release. Every investor or asset manager could justify their choice whether it was OpenAI, Anthropic or Gemini. That has been the situation for some time but its possible the decision framework changed this week with a couple of big announcements by Google at its I/O conference. That is:
The release of Gemini 3.5 Flash (I address this first)
The release of Gemini Spark (addressed at the end but very important for your ability to set up agentic research yourself)
Surely GPT 5.5 or Opus 4.7 trounces a Flash model!
Not any more. At least according to my favorite Financial Research benchmark test, Vals AI Finance Agent v2. This is an enterprise-grade agentic testing suite for a range of real world investment research activities. It moves away from just having a model search through static “text and table” inputs (like an annual report) and answer questions. Instead it evaluates the model as an active, autonomous investment research assistant across multi-step tasks that require agents and tool calling - exactly how we ideally want AI to operate for us as we navigate ever changing financial markets. We want AI to be a force multiplier, rather than a simple search layer with calculator attached that still requires manual hand-holding.
What Finance Agent v2.0 tests for: Tool orchestration, iterative planning and reasoned financial conclusions. The AI is given an open mandate and tools, including live SEC EDGAR access, web search tools, price feeds and the ability to retrieve html pages and the data within them) and tasked with complex research objectives (e.g., building a competitor comparison table or conducting a risk review). It explicitly tests if the agent can handle multi-step workflows, navigate API data feeds, self-correct when a search fails, and manage error accumulation across a 5-to-10-step execution path.
Here’s the results from Vals.ai:
Some important points to note:
Gemini Flash 3.5 beat the next nearest model GPT 5.5 by 6 percentage points which is a material margin.
We’re comparing here the latest Gemini Flash Model to the flagship reasoning models from OpenAI and Anthropic (GPT 5.5 and Opus 4.7). Gemini Pro 3.5 doesn’t even get released till next month. It will be fascinating to see where that comes out.
Finance Agent v2.0 is a punishing test. It penalizes the models for any mistake made along the way of multi-step workflows. This is seen in the raw scores which are all below 60%. On the old v1.1 test, models could score >80% regularly.
See the Cost/Test and Latency Scores. Gemini Flash is better but 40% cheaper under these tests and faster. In actual fact, Gemini Flash 3.5 under the API is priced at $1.50/$9 per 1M input/output tokens vs GPT 5.5’s $5/$30 per 1M input/output tokens and Opus 4.7 priced similarly to GPT 5.5. That means for a lot of simpler workflows, Gemini 3.5 Flash will be up to 70% cheaper!
Who Leads Different Categories of Investment Tasks?
Surprisingly, Flash 3.5 has a fairly consistent lead based on accuracy across activities. Across earnings analysis, quantitative analysis, market analysis, financial adjustments and normalizations, comparable company analysis, M&A precedent transaction analysis and financial modelling, Flash 3.5 has moved to the top of the leader board. Claude Sonnet 4.6 wins out in analysis of qualitative information and disclosures to financial statements.
How Capable is it Across Real World Finance Tasks:
The testing involves very specific tasks that analysts and investors need to do routinely. Here are the example tasks from Val AI to build some confidence in the results:
Extract enterprise values and EV/EBITDA multiples for recent industrial distribution acquisitions and rank the transactions by pre-synergy multiple.
Rank major U.S. banks by excess CET1 ratio relative to their regulatory minimums.
Measure Sun Communities’ stock reaction after the announced sale of Safe Harbor Marinas, then relate the move to the company’s stated use of proceeds.
Compare Home Depot and Lowe’s FY2024 inventory efficiency and calculate the difference in days inventory outstanding.
Reconcile Honeywell’s GAAP operating income to segment profit across annual releases and identify newly introduced adjustment categories.
Compare Rapid7’s Q3 2025 actuals against prior revenue, non-GAAP operating income, and ARR guidance.
Track Boeing’s segment reporting and 787 cost-recovery disclosures across FY2022-FY2024 10-K filings.
Assess whether Ralph Lauren could justify a distressed acquisition of Capri under stated synergy, margin, and valuation assumptions.
Where does Gemini 3.5 Flash fall back?
Its in the qualitative analysis mainly. The top 3 models are typically higher reasoning models including Sonnet 4.6, GPT 5.5 and Opus 4.7. I suspect that the release of Gemini 3.5 PRO will put the Gemini model family back toward the top here. Qualitative analysis in this test is the summarization and comparison of fundamental filing sections: business model, risk factors, MD&A, and standard disclosures across companies.
Example:
Compare Walmart, Costco, and Target’s capital allocation priorities across capex, dividends, share repurchases, and debt management.
With this task, the model is incredibly fast but less accurate which isn’t surprising. Such a task requires greater comparative reasoning capabilities, which also requires a model to take more time “thinking”. For that you’d typically prefer a higher reasoning setting or model variant (ie PRO, Opus or GPT 5.5).
Hallucination risk remains a problem when you don’t provide the source materials for the model to answer the task. The most common area is when you rely on it to web search widely dispersed data. I asked Flash 3.5 to web search up to date benchmark scores across 7 finance benchmarks for a list of 7 AI models and it hallucinated the entire table of scores. Why they haven’t been able to sort this out, I don’t know. But, the rule remains that using web search with AI for investment research is highly problematic whichever model you use. You can provide the model with the exact html address with the data you need, but then that’s kind of defeating the purpose of “search”, isnt it.
What are We Witnessing Here in these Leader Board Results?
High level performance at high speed and low cost is the result of Google’s vertically integrated full stack AI strategy - own DCs, own cloud, own inference specific chips (TPUs). I wrote about this some time ago here:
We’re also specifically seeing the result of new techniques in Reinforcement Learning and Supervised Fine-Tuning that Google has reportedly been working on with these models to tailor them for agentic execution tasks. This suits the investment research domain (as seen in the leap in benchmark results) because of the need for multiple tool calling to answer complex queries: call an MCP data connection, call a financial calculator tool to calculate metrics, search the web for an economic data point etc, all within a single multi-step process. Flash 3.5 can make 20 tool calls within a single task but not lose its way as it may have done in the past. Just also know that the 3.5 model series hasn’t been pre-trained on new data per se. Its model training cut-off is still back in Jan 2025.
Gemini Spark: Fully Cloud Based Agents Running for you 24/7
That last point about the RL/SFT techniques employed to train the model for multi-step agentic execution dovetails into perhaps the biggest announcement at I/O and one that received a massive round of applause from the audience indicating that its been hotly anticipated.
Gemini Spark’s launch marks a significant shift from conversational, prompt-based chatbots that only act when you ask them something, into proactive, persistent background assistants that take action on your behalf even when your devices are closed or asleep. It is Google's always-on, 24/7 personal AI agent designed to handle complex, multi-step tasks autonomously in the background and importantly, runs on Gemini Flash 3.5 to make it highly capable. Crucially, it also is able to be set up by individuals with no coding requirements. Google’s previous agentic products like Antigravity required you to basically have a computer science degree. Spark utilizes the power of Antigravity, but makes it easy to use for anyone.
Why is this a step forward? Right now, the best approach to set up your own research agent, if you are not someone with your own IT department, is to do it in Claude Cowork. This allows you to house prompts, give it access to files to ground the research and easily schedule a repeating task. I recently wrote a post giving readers my full stack set up for a technical market timing agent that runs each morning on Cowork. Its great, but the trouble is Cowork feels a lot like we’ve stepped back a decade to before the cloud age.
You use the Claude desktop app and have to set up project folders on your drive. With some tasks you even have to go into a terminal environment and start typing DOS commands. If your One Drive connection is constantly syncing it often stuffs op routines. It works, but it feels like a transitional solution. Its essentially a UI slapped on top of Claude code, the developer tool. Because it runs on your PC, it has to be on, with the desktop app open to do anything. What if you are traveling, or simply don’t want your laptop on 24/7. We spent a decade moving everything to the cloud only to move it back to the PC? Its fairly clear that Cowork is not the end solution here.
Gemini Spark, instead allows you to set up everything in the Google Cloud and this has multiple advantages:
24/7: You research agents run their tasks throughout the day / night, whether you are at work, home, traveling or asleep.
Can integrate with MCPs and data APIs: the research agents can access data all the time as both your data sources and their own environment are in the cloud. Perfect for monitoring live prices and news across day and night trading sessions.
Native integration with Google Workspace: Gmail (including under your own email domain), Docs, Sheets and Drive can all be made accessible to the agents for task sourcing. You can have it summarize, extract EPS revisions and recommendation changes and synthesize the key themes from those 200+ broker research emails you might wake up to each morning, while you rest.
Security: This separates the agents from any network or PC where you have sensitive data, such as client PII. You wouldn’t allow OpenClaw access to your network and even Cowork warns you if you set the system to be able to act autonomously without asking. This separation removes this issue, opening up full agentic AI capabilities to every investor, including boutique firms where compliance burdens are already huge.
Communicate with your agents via email. Spark reportedly allows you to chat with your agents via email whenever you want to coordinate their activity and ask questions of their outputs. In a way they become your employees.
By moving an agentic orchestration model to the cloud and backing it with Flash 3.5 and a trusted operator that places a high value on security and privacy, Google has opened up a world of possibilities for semi-autonomous investment research. You’re still going to meed a Gemini Ultra subscription running you to $1200-$2400 USD a year plus data APIs that cost anywhere from the same amount again to 20x that, depending whether you just needs some news and closing prices or the whole shebang of fundamentals, estimates, sentiment, investor relations materials etc. But at least the infrastructure is now there (from next month that is) to create it yourself, and backed by a highly capable model suited to the investment domain.
Keep grinding.
Andy West
The Inferential Investor








