Ingest & Pre-process Dense Financial Source Documents
Pre-process 10-Ks, 10-Qs. semi- and annual reports and prospectuses to ensure correct data treatment in analysis tasks.
Last updated: 17 October 2025
Objective:
Extract, tag and verify financial data from dense source documents prior to undertaking analysis tasks to improve analysis outcomes and minimize hallucinations. This prompt improves the ability of AI models to interpret long, complex and data dense documents such as annual reports, particularly where the analysis prompt itself calls for complex, multi-step calculations and tasks that have been shown in many cases to exceed model capabilities.
Explanation:
In writing the Inferential Investor, we have undertaken hundreds of hours of prompt engineering for investment analysis tasks and routinely run into repetitive constraints. The most common of these is Generative AI models’ ability to handle long complex financial source documents attached to complex investment analysis workflows.
While models claim large context windows of 200k-2m tokens which should be sufficient for the prompts, source documents and output, the reality is there are other constraints. Problems such as PDF document structures, variable markdown formats, changing currencies and table units, inconsistencies in financial note presentations and the sheer density of both numerical and text data has been shown to (and experienced personally) break the models when the required prompt is itself large and multi-step. The familiar result is the model hanging with the revolving circle of death.
The problem lies in RAG processes (retrieval augmented generation) where the model is simultaneously trying to retrieve data needles within a source document haystack while performing complex multi-step, and multi-data point computations within an autoregessive model structure (ie it can’t inherently look ahead in the prompt and prepare for the entire task before commencing). RAG is improving all the time but when the model hits a brick wall in retrieving specific data it requires and cannot think of an alternative path, it often freezes.
The answer lies in separating the data retrieval processes from the analysis tasks. This prompt, designed to be run first for complex documents, solely focuses on breaking down the financial accounts into manageable pieces (“chunks”) and ingesting them into the context window. Once the data has been extracted from the PDF and the remainder of the document mapped with links back to endpoints where further data can be found, research has shown that hallucinations are hugely reduced and the models then can apply all their processing power to computations, reasoning and analysis rather than retrieval - creating more rigorous, reliable and insightful output.
As always, be aware that models can make mistakes. At each step, examine the response and challenge information or conclusions that appear erroneous before proceeding to any subsequent steps. If in doubt use a second model with the same prompt to verify the information and generate challenge questions and answers (CoVe process) to correct interpretations of data.
Link to blog post explanation:
Hallucinations happen: Here’s How Investors can Keep AI Honest.
Preferred Model(s):
ChatGPT-5+ and Gemini 2.5+
Important Execution Notes:
Attach source document(s) to the prompt (eg 10-K, Annual Report, Prospectus, 10-Q). It is best to name the files with ticker, period and type for easy model citations. (eg AMZN FY25 10K. pdf)
Insert the Company Ticker, Exchange and Name where indicated (Inputs section).
If the model hangs when pre-processing multiple attached documents due to combined size, then separate them out and run one document at a time within the same chat/query window.
The objective is to extract and bring all the necessary financial data out of the document into the context window itself. From there, the model more easily handles complex multi-step analysis tasks and computations.
Sample Output:
Copy/Paste Prompt Set:
Important note: Subscribers can use this prompt set for their own analysis. However, the prompt is copyrighted by The Inferential Investor, paywalled, and must not be shared without permission.



