From documents to intelligence — built your way.

An open-source, API-first knowledge extraction engine that converts any financial document format into queryable knowledge graphs.

Built for data engineers, quants, and fintech teams. MarketFoundry bridges the gap between documents in VDRs, regulatory filings, and internal reports, and the RAG systems, dashboards, and decision tools that need to reason over them.

Watch a Document Become a Graph

From Source Document to Queryable Graph

Scroll through a sample financial document, then explore the extracted knowledge graph with zoom controls and selectable nodes.

Input Document
ADBE Q3FY25 Earnings Script and Slides.pdf
Open PDF in new tab ↗
What MarketFoundry reads: headers, sections, financial entities, causal language, product references, and numerical claims grounded in the source text.
Output Knowledge Graph
Loading…
Click a node · hover an edge
Company
Metric
Product
Theme
Building knowledge graph…

Selected Node

Adobe

Company
Central entity extracted from the earnings call. Other nodes represent products, metrics, strategic themes, and financial outcomes linked back to the document.

Why It Matters

The Challenge

Traditional analysts spend hours manually reading through filings, transcripts, and reports to piece together what's driving revenue, risk, and growth. It's slow, it doesn't scale, and it keeps valuable context out of the tools that actually make decisions.

The obvious shortcut is to paste documents into a large language model (LLM) and ask it to extract the key facts. But LLMs are generative by nature: they interpolate, paraphrase, and sometimes confabulate details that sound plausible but cannot be traced back to the source. There is no guarantee that a relationship an LLM describes actually appears in the document, no structured output you can query, no audit trail, and no way to verify what was grounded in text versus inferred from pretraining. For financial workflows where a misattributed figure or fabricated causal link carries real consequences, that is not an acceptable tradeoff.

The firms building the next generation of financial intelligence know that unstructured text is the last great unlocked data layer. The context buried in earnings calls, filings, and internal reports holds the ground truth that numbers alone cannot explain. The next leap in Natural Language Processing (NLP) is grounding language models in structured, verifiable, domain-specific knowledge so they stop hallucinating and start reasoning. MarketFoundry is part of that infrastructure layer: turning the narrative behind financial decisions into something machines can actually work with.

Document-aware extraction, built for modern financial pipelines

MarketFoundry classifies each document, sections it intelligently, and routes it to specialized schemas that extract entities, relationships, and causal structures, where present. The result is a knowledge graph (a connected map of facts and relationships across your entire document set) that your existing tools can actually work with.

Same RAG system, sharper retrieval. Same dashboard, richer context. Built open-source so you can integrate it, extend it, and make it your own.

Turn Unstructured Documents Into a Structured Knowledge Graph

System Scope

What MarketFoundry Does

  • Document-type classification (5 types)
  • Dynamic topic sectioning & YAML schema generation
  • Schema-guided triple extraction via OneKE (a triple is a structured fact: Subject, Predicate, Object; for example, Interest Rate Hike triggered Market Volatility)
  • Neo4j knowledge graph construction (a graph database where entities become nodes and relationships become queryable edges)
  • Multi-format ingestion (PDF, DOCX, HTML, JSON, TXT)

What MarketFoundry Doesn't Do

  • Query UI or chat interface
  • Real-time or streaming data feeds
  • Post-hoc ontology normalization
  • Cross-document co-reference resolution at scale

How It Works

MarketFoundry uses a document-aware pipeline that intelligently processes each financial document type to extract the most valuable causal relationships.

1

Document Ingestion

Upload PDF, DOCX, HTML, JSON, or TXT files. MarketFoundry handles any financial document format; earnings calls, SEC filings, press releases, research reports, and news articles.

2

Document Classification

An ensemble classifier model automatically detects document type and routes it to the appropriate extraction schema. No manual configuration needed.

3

Intelligent Sectioning

LLMs extract topics dynamically and auto-generate YAML templates tailored to each section, preserving the context that generic text chunking would lose.

4

Schema Routing

Each section is matched to a specialized extraction schema based on its topic, ensuring the right relationships get captured for the right content.

5

Extraction

OneKE (a schema-guided knowledge extraction framework) + Qwen2.5-7B extracts structured triples: subject, predicate, object, with source citations so every insight is traceable back to its document.

6

Knowledge Graph

All triples are stored in Neo4j, creating a unified graph across your entire document corpus. Query it directly, plug it into your RAG system, or pipe it into your analytics stack.

Results

MarketFoundry was evaluated across two independent experiments: the stacked ensemble classifier and the YAML dynamic schema generation pipeline. Each is reported separately below.

Stacked Classifier Performance

Our ensemble classifier was evaluated on a curated corpus of 71 labeled financial documents spanning earnings call transcripts, SEC filings, research papers, press releases, and news articles, covering publications from 2008 to 2025, ranging from 2 to 290 pages. The system achieved 78.9% overall classification accuracy.

78.9% Overall Classification Accuracy
0.802 Macro F1 Score

A Macro F1 of 0.802 reflects strong average performance across all five document classes weighted equally, meaning gains on high-frequency classes like SEC filings do not mask weaker recall on underrepresented ones like press releases.

Classifier Performance by Document Type

Document Type Precision Recall F1 Score Support
Earnings Call0.9230.9230.92313
Research Paper0.8751.0000.9337
SEC Filing1.0000.9470.97319
News Article0.5000.9170.64712
Press Release0.8000.4000.53320

⚠️ Press release recall (40%) is a known limitation: press releases are drafted to mimic journalistic style, causing their embeddings to overlap with news articles in vector space. Future versions will add keyword-based override rules (e.g., wire service datelines) to resolve this.

Why a Stacked Approach

TF-IDF logistic regression excels at detecting document-type-specific vocabulary patterns, such as the dense regulatory language of SEC filings or the Q&A structure of earnings calls. However, it is sensitive to surface-level noise: scanned PDFs often carry OCR artifacts, inconsistent spacing, and garbled characters that distort term frequency signals. The k-nearest neighbors model, which scores each document by its similarity to reference examples of each class, is more robust to this noise because it encodes meaning rather than exact tokens. Instead of directly comparing raw probabilities and cosine similarity scores (which are not on the same scale), their outputs are passed into a trained stacker that produces the final multiclass prediction. If the document remains ambiguous between a press release and a news article, a specialized binary reranker refines only that boundary using a small set of hand-engineered signals. This two-stage design isolates the hardest class pair without disrupting the general classifier.

Extraction Pipeline and Knowledge Graph

964 Unique Entities
423 Distinct Relationship Types
112 Documents Processed (YAML Pipeline)

How Extraction is Guided

OneKE Extraction Prompt

Keeps the model focused on factual relationships directly stated in the source text.

  • Extracts concrete financial relationships only: results, ownership, products, partnerships, regulatory events
  • No inferred causation, sentiment, or analytical conclusions
  • Subjects and objects must be real-world entities: companies, metrics, products
  • Output enforced as strict JSON for direct graph ingestion

YAML Generator Prompt

Generates document-specific extraction plans from each document's topic map.

  • Defines entity types and relation types tailored to that document's content
  • Entity types must be stable real-world categories, no abstract concepts
  • Number of entity and relation types is capped to reduce fragmentation
  • Multiple plans generated per document for broad topical coverage

Extraction Performance vs. Frontier Models

We benchmarked MarketFoundry against three state-of-the-art models — Gemini 3 Flash, GPT 5.3, and Sonnet 4.6 — using the same task: extract structured knowledge triples from a Capital One SEC Form 8-K. All models received the same system prompt designed for our pipeline.

5.2× More Triples than Gemini 3 Flash
2.1× More Triples than Sonnet 4.6
More Unique Relation Types than Sonnet 4.6
Model / Pipeline Total Triples Volume vs. Market Foundry Unique Relation Types
MarketFoundry (Qwen3-4b)68100%18
Sonnet 4.63247%6
GPT 5.32232%11
Gemini 3 Flash1319%8

High-Density Extraction

Frontier models tend to skim for a narrative summary. MarketFoundry's structured pipeline extracts the full data layer of the document, including sections like Forward-Looking Statements and Risk Factors that general-purpose models largely ignored, yielding over 20 additional triples from those sections alone.

Structural Awareness

MarketFoundry was the only system to capture document-internal structure. For example, linking Exhibit 2.1 as a named entity to the Agreement and Plan of Merger. No frontier model identified or mapped this relationship.

Lightweight Architecture

These results were achieved with a 4B parameter model. Swapping in a larger backend would likely push extraction quality further, particularly for complex financial nuances and niche administrative identifiers.

* Benchmark conducted on the Capital One SEC Form 8-K using identical system prompts across all models. Triple counts reflect valid structured outputs only.

What We Built vs. What We Used

Original Work (Built by Our Team)

  • Stacked ensemble document classifier (TF-IDF logistic regression + KNN)
  • Document-type-aware pipeline routing logic
  • Dynamic topic extraction and multi-perspective YAML schema generation
  • Engineering robustness layer: JSON validation, fallback schemas, edge-case handling
  • Incremental result persistence and reproducibility tooling

Third-Party Tools (Integrated and Extended)

  • OneKE, base schema-guided extraction framework (Luo et al., 2025)
  • Qwen2.5-7B and Meta-Llama, open-source LLM inference
  • Neo4j, graph database and storage layer
  • SentenceTransformers, semantic embeddings for the k-NN classifier
  • Modal, serverless GPU infrastructure for API hosting

Limitations

Open-schema design produced 423 distinct relationship types across 15 documents. Because schemas are generated dynamically per document, the same underlying relationship can appear under slightly different names across files: one document may produce revenue_growth while another produces revenue_increase or revenue_acceleration for an identical fact. These are treated as separate edge types in the graph, so nodes that should be connected are not, reducing cross-document comparability and limiting the graph's ability to aggregate patterns without a normalization layer.
Dynamic YAML generation succeeded for 47.3% of documents. Long regulatory filings and earnings call transcripts caused context-window saturation, leading the LLM to time out or return malformed JSON. The pipeline catches this and falls back to a pre-built general-purpose config for that document type, so no document is dropped, but the extraction schema is less tailored than what dynamic generation would have produced.
In our tests topic extraction uses Qwen2.5-1.5B-Instruct and triple extraction uses Qwen3-4B, both selected to fit within our consumer-grade memory constraints. This limited our ability to benchmark larger alternatives. Upgrading to models such as Qwen2.5-72B, Llama 3.3-70B, or GPT-4o would likely improve coreference resolution, relation naming consistency, and extraction depth on complex long-form documents.
Schema-guided prompting substantially reduces hallucinated outputs, but cannot eliminate them entirely. Three failure modes were observed in practice:
Fabricated triples: the model occasionally extracts a relationship that sounds financially plausible but has no grounding in the source text, particularly in dense filings with forward-looking language.
Entity boundary errors: the model splits or merges entities incorrectly, for example treating Q3 2024 revenue and Q3 revenue as separate nodes when they refer to the same fact.
Predicate inversion: subject and object roles occasionally get swapped, producing a triple with the correct entities but the wrong causal direction.
MarketFoundry is fully open-source and free to run locally. Hosted API access is limited by infrastructure costs: we run GPU inference on Modal.com using Nvidia A10 GPUs at $1.10 per hour. We would love to expand API availability and throughput as the project grows, but current usage is constrained by budget.

Next Steps

1 Ontology NormalizationBuild a post-processing consolidation layer to reduce schema fragmentation and improve cross-document comparability of relationship types.
2 Two-Stage ExtractionAdopt schema caching strategies to improve YAML generation stability on long regulatory filings and earnings call transcripts.
3 Stronger Schema EnforcementAdd a validation layer to prevent the Reflection Agent from renaming relation labels outside the defined schema, reducing hallucination-driven inconsistencies.
4 Larger Model EvaluationTest extraction quality on larger hosted models to assess gains in coreference resolution, relation naming consistency, and coverage on complex documents.
5 Expanded API AvailabilityScale hosted GPU capacity on Modal.com beyond the current Nvidia A10 tier to increase API throughput and reduce latency for end users.

Sample Outputs

See how MarketFoundry transforms real financial documents into structured knowledge graphs.

⚠️ Heads up: MarketFoundry uses LLMs for extraction. While we do our best to constrain outputs through schema-guided prompting, hallucinations can still occur. Treat extracted triples as a starting point for analysis, not as a substitute for source verification.

Private Equity Acquisition Data Room

Private Equity Acquisition Data Room Knowledge Graph

Connections Extracted Across Documents:

  • Revenue Concentration Riskdisclosed_inCustomer Contract Footnote (Doc 47)
  • Pending Litigationimpacts_valuation_ofEBITDA Projection (Doc 12)
  • Key-Man ClauseconstrainsPost-Acquisition Integration Plan

PayPal Earnings Call

PayPal Earnings Call Knowledge Graph

Extracted Triples:

  • PayPal Holdings, Inc.offersagentic commerce services
  • PayPal Holdings, Inc.operates_productBuy Now, Pay Later
  • Buy Now, Pay LaterachievesNPS of 80

Doom Loop News Article

Doom Loop Article Knowledge Graph

Causal Relationships:

  • Trade warsexacerbatesFinancial panic
  • Financial panictriggersInflation
  • InflationincreasesGlobal inequality

Stock Portfolio Optimization

Stock Portfolio Optimization Research Paper Knowledge Graph

Cross-Document Insights:

  • Portfoliooptimizes_formaximizing return on investment
  • MVO-based Neural Networkachievesdominating performance over other models
  • Datasetconsists_ofdaily information from Yahoo Finance

API Access

Developers can integrate the knowledge extraction pipeline directly through our REST API. The endpoint accepts a document and automatically extracts knowledge graph triples and stores them in a Neo4j database. Click the link below to view the API documentation and get started:

⚠️ Note on document size: The API is optimized for documents up to ~50 pages. Very large documents (100+ pages) may exceed GPU memory limits and fail to process. If you need to process large documents, consider splitting them into sections before submitting, or deploy your own instance on Modal with a larger GPU (e.g. A100 80GB).

Example Request

curl -X POST "https://marija-vukic--market-foundry-api-fastapi-app.modal.run/process" \
-F "file=@/path/to/your/document.pdf" \
-F "neo4j_uri=neo4j+s://xxxx.databases.neo4j.io" \
-F "neo4j_username=neo4j" \
-F "neo4j_password=yourpassword"

API Parameters & Functionality

Parameters

  • file – Document to process (PDF, DOCX, HTML, JSON)
  • neo4j_uri – URI for the Neo4j database
  • neo4j_username – Neo4j database username
  • neo4j_password – Neo4j database password

What the API Does

  • Classifies the document type automatically
  • Extracts entities, relationships, and causal structures
  • Constructs a knowledge graph with triples
  • Stores the graph in your Neo4j database

Demo

See MarketFoundry in action: from document upload to knowledge graph query. Two paths, same pipeline.

Set Up Locally

Clone the repo, configure your environment, and run the full pipeline on your own documents using Conda or Docker.

Use Our API

Send documents directly to the MarketFoundry API and receive structured knowledge graph triples in return, no local setup required.

Which Should You Use?

Run It Locally

Open Source

Running the pipeline locally allows you to inspect intermediate outputs, modify prompts or models, and experiment with different extraction configurations while keeping all documents on your own machine.

  • Working with sensitive or confidential documents that cannot leave your environment
  • Full control over model selection, schema configuration, and graph storage
  • Integrating MarketFoundry into an existing internal pipeline
  • GPU access available and want to avoid per-request API costs
  • Researcher or developer extending the system for a new domain

Use the API

Hosted

The API provides a centralized deployment where you can submit documents and receive structured extraction results without installing dependencies or managing the full pipeline locally. It also supports more scalable processing for larger workloads or multi-user scenarios.

  • Easier integration with external applications and data pipelines
  • Prototyping a connection to a dashboard, RAG system, or data platform
  • No local GPU access or prefer not to manage model dependencies
  • Processing larger document volumes without local infrastructure
  • Note: hosted availability is subject to GPU budget constraints on Modal.com

Quickstart

# Clone the repository
git clone https://github.com/jessicabat/market-foundry
cd market-foundry
View Full Quickstart Guide

We welcome you to come see us present at Session One of our HDSI Capstone Showcase on March 13 at the Price Center East Ballroom, UC San Diego.

About Us

Matthew Wong headshot

Matthew Wong

Jessica Batbayar headshot

Jessica Batbayar

Marija Vukic headshot

Marija Vukic