From documents to intelligence — built your way.
An open-source, API-first knowledge extraction engine that converts any financial document format into queryable knowledge graphs.
Built for data engineers, quants, and fintech teams. MarketFoundry bridges the gap between documents in VDRs, regulatory filings, and internal reports, and the RAG systems, dashboards, and decision tools that need to reason over them.
Watch a Document Become a Graph
Scroll through a sample financial document, then explore the extracted knowledge graph with zoom controls and selectable nodes.
Selected Node
Traditional analysts spend hours manually reading through filings, transcripts, and reports to piece together what's driving revenue, risk, and growth. It's slow, it doesn't scale, and it keeps valuable context out of the tools that actually make decisions.
The obvious shortcut is to paste documents into a large language model (LLM) and ask it to extract the key facts. But LLMs are generative by nature: they interpolate, paraphrase, and sometimes confabulate details that sound plausible but cannot be traced back to the source. There is no guarantee that a relationship an LLM describes actually appears in the document, no structured output you can query, no audit trail, and no way to verify what was grounded in text versus inferred from pretraining. For financial workflows where a misattributed figure or fabricated causal link carries real consequences, that is not an acceptable tradeoff.
The firms building the next generation of financial intelligence know that unstructured text is the last great unlocked data layer. The context buried in earnings calls, filings, and internal reports holds the ground truth that numbers alone cannot explain. The next leap in Natural Language Processing (NLP) is grounding language models in structured, verifiable, domain-specific knowledge so they stop hallucinating and start reasoning. MarketFoundry is part of that infrastructure layer: turning the narrative behind financial decisions into something machines can actually work with.
MarketFoundry classifies each document, sections it intelligently, and routes it to specialized schemas that extract entities, relationships, and causal structures, where present. The result is a knowledge graph (a connected map of facts and relationships across your entire document set) that your existing tools can actually work with.
Same RAG system, sharper retrieval. Same dashboard, richer context. Built open-source so you can integrate it, extend it, and make it your own.
Turn Unstructured Documents Into a Structured Knowledge Graph
MarketFoundry uses a document-aware pipeline that intelligently processes each financial document type to extract the most valuable causal relationships.
Upload PDF, DOCX, HTML, JSON, or TXT files. MarketFoundry handles any financial document format; earnings calls, SEC filings, press releases, research reports, and news articles.
An ensemble classifier model automatically detects document type and routes it to the appropriate extraction schema. No manual configuration needed.
LLMs extract topics dynamically and auto-generate YAML templates tailored to each section, preserving the context that generic text chunking would lose.
Each section is matched to a specialized extraction schema based on its topic, ensuring the right relationships get captured for the right content.
OneKE (a schema-guided knowledge extraction framework) + Qwen2.5-7B extracts structured triples: subject, predicate, object, with source citations so every insight is traceable back to its document.
All triples are stored in Neo4j, creating a unified graph across your entire document corpus. Query it directly, plug it into your RAG system, or pipe it into your analytics stack.
MarketFoundry was evaluated across two independent experiments: the stacked ensemble classifier and the YAML dynamic schema generation pipeline. Each is reported separately below.
Our ensemble classifier was evaluated on a curated corpus of 71 labeled financial documents spanning earnings call transcripts, SEC filings, research papers, press releases, and news articles, covering publications from 2008 to 2025, ranging from 2 to 290 pages. The system achieved 78.9% overall classification accuracy.
A Macro F1 of 0.802 reflects strong average performance across all five document classes weighted equally, meaning gains on high-frequency classes like SEC filings do not mask weaker recall on underrepresented ones like press releases.
| Document Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| Earnings Call | 0.923 | 0.923 | 0.923 | 13 |
| Research Paper | 0.875 | 1.000 | 0.933 | 7 |
| SEC Filing | 1.000 | 0.947 | 0.973 | 19 |
| News Article | 0.500 | 0.917 | 0.647 | 12 |
| Press Release | 0.800 | 0.400 | 0.533 | 20 |
⚠️ Press release recall (40%) is a known limitation: press releases are drafted to mimic journalistic style, causing their embeddings to overlap with news articles in vector space. Future versions will add keyword-based override rules (e.g., wire service datelines) to resolve this.
TF-IDF logistic regression excels at detecting document-type-specific vocabulary patterns, such as the dense regulatory language of SEC filings or the Q&A structure of earnings calls. However, it is sensitive to surface-level noise: scanned PDFs often carry OCR artifacts, inconsistent spacing, and garbled characters that distort term frequency signals. The k-nearest neighbors model, which scores each document by its similarity to reference examples of each class, is more robust to this noise because it encodes meaning rather than exact tokens. Instead of directly comparing raw probabilities and cosine similarity scores (which are not on the same scale), their outputs are passed into a trained stacker that produces the final multiclass prediction. If the document remains ambiguous between a press release and a news article, a specialized binary reranker refines only that boundary using a small set of hand-engineered signals. This two-stage design isolates the hardest class pair without disrupting the general classifier.
Keeps the model focused on factual relationships directly stated in the source text.
Generates document-specific extraction plans from each document's topic map.
We benchmarked MarketFoundry against three state-of-the-art models — Gemini 3 Flash, GPT 5.3, and Sonnet 4.6 — using the same task: extract structured knowledge triples from a Capital One SEC Form 8-K. All models received the same system prompt designed for our pipeline.
| Model / Pipeline | Total Triples | Volume vs. Market Foundry | Unique Relation Types |
|---|---|---|---|
| MarketFoundry (Qwen3-4b) | 68 | 100% | 18 |
| Sonnet 4.6 | 32 | 47% | 6 |
| GPT 5.3 | 22 | 32% | 11 |
| Gemini 3 Flash | 13 | 19% | 8 |
Frontier models tend to skim for a narrative summary. MarketFoundry's structured pipeline extracts the full data layer of the document, including sections like Forward-Looking Statements and Risk Factors that general-purpose models largely ignored, yielding over 20 additional triples from those sections alone.
MarketFoundry was the only system to capture document-internal structure. For example, linking Exhibit 2.1 as a named entity to the Agreement and Plan of Merger. No frontier model identified or mapped this relationship.
These results were achieved with a 4B parameter model. Swapping in a larger backend would likely push extraction quality further, particularly for complex financial nuances and niche administrative identifiers.
* Benchmark conducted on the Capital One SEC Form 8-K using identical system prompts across all models. Triple counts reflect valid structured outputs only.
revenue_growth while another produces revenue_increase or revenue_acceleration for an identical fact. These are treated as separate edge types in the graph, so nodes that should be connected are not, reducing cross-document comparability and limiting the graph's ability to aggregate patterns without a normalization layer.
Q3 2024 revenue and Q3 revenue as separate nodes when they refer to the same fact.
See how MarketFoundry transforms real financial documents into structured knowledge graphs.
Developers can integrate the knowledge extraction pipeline directly through our REST API. The endpoint accepts a document and automatically extracts knowledge graph triples and stores them in a Neo4j database. Click the link below to view the API documentation and get started:
curl -X POST "https://marija-vukic--market-foundry-api-fastapi-app.modal.run/process" \
-F "file=@/path/to/your/document.pdf" \
-F "neo4j_uri=neo4j+s://xxxx.databases.neo4j.io" \
-F "neo4j_username=neo4j" \
-F "neo4j_password=yourpassword"
See MarketFoundry in action: from document upload to knowledge graph query. Two paths, same pipeline.
Clone the repo, configure your environment, and run the full pipeline on your own documents using Conda or Docker.
Send documents directly to the MarketFoundry API and receive structured knowledge graph triples in return, no local setup required.
Running the pipeline locally allows you to inspect intermediate outputs, modify prompts or models, and experiment with different extraction configurations while keeping all documents on your own machine.
The API provides a centralized deployment where you can submit documents and receive structured extraction results without installing dependencies or managing the full pipeline locally. It also supports more scalable processing for larger workloads or multi-user scenarios.
# Clone the repository
git clone https://github.com/jessicabat/market-foundry
cd market-foundry
We welcome you to come see us present at Session One of our HDSI Capstone Showcase on March 13 at the Price Center East Ballroom, UC San Diego.