MarketFoundry - Causal Knowledge Graphs from Financial Documents

Watch a Document Become a Graph

From Source Document to Queryable Graph

Scroll through a sample financial document, then explore the extracted knowledge graph with zoom controls and selectable nodes.

Input Document

Open PDF in new tab ↗

What MarketFoundry reads: headers, sections, financial entities, causal language, product references, and numerical claims grounded in the source text.

Output Knowledge Graph

Company

Metric

Product

Theme

Building knowledge graph…

Selected Node

Adobe

Company

Central entity extracted from the earnings call. Other nodes represent products, metrics, strategic themes, and financial outcomes linked back to the document.

Why It Matters

The Challenge

Traditional analysts spend hours manually reading through filings, transcripts, and reports to piece together what's driving revenue, risk, and growth. It's slow, it doesn't scale, and it keeps valuable context out of the tools that actually make decisions.

The obvious shortcut is to paste documents into a large language model (LLM) and ask it to extract the key facts. But LLMs are generative by nature: they interpolate, paraphrase, and sometimes confabulate details that sound plausible but cannot be traced back to the source. There is no guarantee that a relationship an LLM describes actually appears in the document, no structured output you can query, no audit trail, and no way to verify what was grounded in text versus inferred from pretraining. For financial workflows where a misattributed figure or fabricated causal link carries real consequences, that is not an acceptable tradeoff.

The firms building the next generation of financial intelligence know that unstructured text is the last great unlocked data layer. The context buried in earnings calls, filings, and internal reports holds the ground truth that numbers alone cannot explain. The next leap in Natural Language Processing (NLP) is grounding language models in structured, verifiable, domain-specific knowledge so they stop hallucinating and start reasoning. MarketFoundry is part of that infrastructure layer: turning the narrative behind financial decisions into something machines can actually work with.

Document-aware extraction, built for modern financial pipelines

MarketFoundry classifies each document, sections it intelligently, and routes it to specialized schemas that extract entities, relationships, and causal structures, where present. The result is a knowledge graph (a connected map of facts and relationships across your entire document set) that your existing tools can actually work with.

Same RAG system, sharper retrieval. Same dashboard, richer context. Built open-source so you can integrate it, extend it, and make it your own.

Turn Unstructured Documents Into a Structured Knowledge Graph

System Scope

What MarketFoundry Does

Document-type classification (5 types)
Dynamic topic sectioning & YAML schema generation
Schema-guided triple extraction via OneKE (a triple is a structured fact: Subject, Predicate, Object; for example, Interest Rate Hike triggered Market Volatility)
Neo4j knowledge graph construction (a graph database where entities become nodes and relationships become queryable edges)
Multi-format ingestion (PDF, DOCX, HTML, JSON, TXT)

What MarketFoundry Doesn't Do

Query UI or chat interface
Real-time or streaming data feeds
Post-hoc ontology normalization
Cross-document co-reference resolution at scale

How It Works

MarketFoundry uses a document-aware pipeline that intelligently processes each financial document type to extract the most valuable causal relationships.

1

Document Ingestion

Upload PDF, DOCX, HTML, JSON, or TXT files. MarketFoundry handles any financial document format; earnings calls, SEC filings, press releases, research reports, and news articles.

2

Document Classification

An ensemble classifier model automatically detects document type and routes it to the appropriate extraction schema. No manual configuration needed.

3

Intelligent Sectioning

LLMs extract topics dynamically and auto-generate YAML templates tailored to each section, preserving the context that generic text chunking would lose.

4

Schema Routing

Each section is matched to a specialized extraction schema based on its topic, ensuring the right relationships get captured for the right content.

5

Extraction

OneKE (a schema-guided knowledge extraction framework) + Qwen2.5-7B extracts structured triples: subject, predicate, object, with source citations so every insight is traceable back to its document.

6

Knowledge Graph

All triples are stored in Neo4j, creating a unified graph across your entire document corpus. Query it directly, plug it into your RAG system, or pipe it into your analytics stack.

Results

MarketFoundry was evaluated across two independent experiments: the stacked ensemble classifier and the YAML dynamic schema generation pipeline. Each is reported separately below.

Stacked Classifier Performance

Our ensemble classifier was evaluated on a curated corpus of 71 labeled financial documents spanning earnings call transcripts, SEC filings, research papers, press releases, and news articles, covering publications from 2008 to 2025, ranging from 2 to 290 pages. The system achieved 78.9% overall classification accuracy.

78.9% Overall Classification Accuracy

0.802 Macro F1 Score

A Macro F1 of 0.802 reflects strong average performance across all five document classes weighted equally, meaning gains on high-frequency classes like SEC filings do not mask weaker recall on underrepresented ones like press releases.

Classifier Performance by Document Type

Document Type	Precision	Recall	F1 Score	Support
Earnings Call	0.923	0.923	0.923	13
Research Paper	0.875	1.000	0.933	7
SEC Filing	1.000	0.947	0.973	19
News Article	0.500	0.917	0.647	12
Press Release	0.800	0.400	0.533	20

⚠️ Press release recall (40%) is a known limitation: press releases are drafted to mimic journalistic style, causing their embeddings to overlap with news articles in vector space. Future versions will add keyword-based override rules (e.g., wire service datelines) to resolve this.

Why a Stacked Approach

TF-IDF logistic regression excels at detecting document-type-specific vocabulary patterns, such as the dense regulatory language of SEC filings or the Q&A structure of earnings calls. However, it is sensitive to surface-level noise: scanned PDFs often carry OCR artifacts, inconsistent spacing, and garbled characters that distort term frequency signals. The k-nearest neighbors model, which scores each document by its similarity to reference examples of each class, is more robust to this noise because it encodes meaning rather than exact tokens. Instead of directly comparing raw probabilities and cosine similarity scores (which are not on the same scale), their outputs are passed into a trained stacker that produces the final multiclass prediction. If the document remains ambiguous between a press release and a news article, a specialized binary reranker refines only that boundary using a small set of hand-engineered signals. This two-stage design isolates the hardest class pair without disrupting the general classifier.

Extraction Pipeline and Knowledge Graph

964 Unique Entities

423 Distinct Relationship Types

112 Documents Processed (YAML Pipeline)

How Extraction is Guided

OneKE Extraction Prompt

Keeps the model focused on factual relationships directly stated in the source text.

•Extracts concrete financial relationships only: results, ownership, products, partnerships, regulatory events
•No inferred causation, sentiment, or analytical conclusions
•Subjects and objects must be real-world entities: companies, metrics, products
•Output enforced as strict JSON for direct graph ingestion

YAML Generator Prompt

Generates document-specific extraction plans from each document's topic map.

•Defines entity types and relation types tailored to that document's content
•Entity types must be stable real-world categories, no abstract concepts
•Number of entity and relation types is capped to reduce fragmentation
•Multiple plans generated per document for broad topical coverage

Extraction Performance vs. Frontier Models

We benchmarked MarketFoundry against three state-of-the-art models — Gemini 3 Flash, GPT 5.3, and Sonnet 4.6 — using the same task: extract structured knowledge triples from a Capital One SEC Form 8-K. All models received the same system prompt designed for our pipeline.

5.2× More Triples than Gemini 3 Flash

2.1× More Triples than Sonnet 4.6

3× More Unique Relation Types than Sonnet 4.6

Model / Pipeline	Total Triples	Volume vs. Market Foundry	Unique Relation Types
MarketFoundry (Qwen3-4b)	68	100%	18
Sonnet 4.6	32	47%	6
GPT 5.3	22	32%	11
Gemini 3 Flash	13	19%	8

High-Density Extraction

Frontier models tend to skim for a narrative summary. MarketFoundry's structured pipeline extracts the full data layer of the document, including sections like Forward-Looking Statements and Risk Factors that general-purpose models largely ignored, yielding over 20 additional triples from those sections alone.

Structural Awareness

MarketFoundry was the only system to capture document-internal structure. For example, linking Exhibit 2.1 as a named entity to the Agreement and Plan of Merger. No frontier model identified or mapped this relationship.

Lightweight Architecture

These results were achieved with a 4B parameter model. Swapping in a larger backend would likely push extraction quality further, particularly for complex financial nuances and niche administrative identifiers.

* Benchmark conducted on the Capital One SEC Form 8-K using identical system prompts across all models. Triple counts reflect valid structured outputs only.

What We Built vs. What We Used

Original Work (Built by Our Team)

Stacked ensemble document classifier (TF-IDF logistic regression + KNN)
Document-type-aware pipeline routing logic
Dynamic topic extraction and multi-perspective YAML schema generation
Engineering robustness layer: JSON validation, fallback schemas, edge-case handling
Incremental result persistence and reproducibility tooling

Third-Party Tools (Integrated and Extended)

OneKE, base schema-guided extraction framework (Luo et al., 2025)
Qwen2.5-7B and Meta-Llama, open-source LLM inference
Neo4j, graph database and storage layer
SentenceTransformers, semantic embeddings for the k-NN classifier
Modal, serverless GPU infrastructure for API hosting

Limitations

Open-schema design produced 423 distinct relationship types across 15 documents. Because schemas are generated dynamically per document, the same underlying relationship can appear under slightly different names across files: one document may produce revenue_growth while another produces revenue_increase or revenue_acceleration for an identical fact. These are treated as separate edge types in the graph, so nodes that should be connected are not, reducing cross-document comparability and limiting the graph's ability to aggregate patterns without a normalization layer.

Dynamic YAML generation succeeded for 47.3% of documents. Long regulatory filings and earnings call transcripts caused context-window saturation, leading the LLM to time out or return malformed JSON. The pipeline catches this and falls back to a pre-built general-purpose config for that document type, so no document is dropped, but the extraction schema is less tailored than what dynamic generation would have produced.

In our tests topic extraction uses Qwen2.5-1.5B-Instruct and triple extraction uses Qwen3-4B, both selected to fit within our consumer-grade memory constraints. This limited our ability to benchmark larger alternatives. Upgrading to models such as Qwen2.5-72B, Llama 3.3-70B, or GPT-4o would likely improve coreference resolution, relation naming consistency, and extraction depth on complex long-form documents.

Schema-guided prompting substantially reduces hallucinated outputs, but cannot eliminate them entirely. Three failure modes were observed in practice:

Fabricated triples: the model occasionally extracts a relationship that sounds financially plausible but has no grounding in the source text, particularly in dense filings with forward-looking language.

Entity boundary errors: the model splits or merges entities incorrectly, for example treating Q3 2024 revenue and Q3 revenue as separate nodes when they refer to the same fact.

Predicate inversion: subject and object roles occasionally get swapped, producing a triple with the correct entities but the wrong causal direction.

MarketFoundry is fully open-source and free to run locally. Hosted API access is limited by infrastructure costs: we run GPU inference on Modal.com using Nvidia A10 GPUs at $1.10 per hour. We would love to expand API availability and throughput as the project grows, but current usage is constrained by budget.

Next Steps

1 Ontology NormalizationBuild a post-processing consolidation layer to reduce schema fragmentation and improve cross-document comparability of relationship types.

2 Two-Stage ExtractionAdopt schema caching strategies to improve YAML generation stability on long regulatory filings and earnings call transcripts.

3 Stronger Schema EnforcementAdd a validation layer to prevent the Reflection Agent from renaming relation labels outside the defined schema, reducing hallucination-driven inconsistencies.

4 Larger Model EvaluationTest extraction quality on larger hosted models to assess gains in coreference resolution, relation naming consistency, and coverage on complex documents.

5 Expanded API AvailabilityScale hosted GPU capacity on Modal.com beyond the current Nvidia A10 tier to increase API throughput and reduce latency for end users.

Sample Outputs

See how MarketFoundry transforms real financial documents into structured knowledge graphs.

⚠️ Heads up: MarketFoundry uses LLMs for extraction. While we do our best to constrain outputs through schema-guided prompting, hallucinations can still occur. Treat extracted triples as a starting point for analysis, not as a substitute for source verification.

Private Equity Acquisition Data Room

Connections Extracted Across Documents:

Revenue Concentration Risk → disclosed_in → Customer Contract Footnote (Doc 47)
Pending Litigation → impacts_valuation_of → EBITDA Projection (Doc 12)
Key-Man Clause → constrains → Post-Acquisition Integration Plan

PayPal Earnings Call

Extracted Triples:

PayPal Holdings, Inc. → offers → agentic commerce services
PayPal Holdings, Inc. → operates_product → Buy Now, Pay Later
Buy Now, Pay Later → achieves → NPS of 80

Doom Loop News Article

Causal Relationships:

Trade wars → exacerbates → Financial panic
Financial panic → triggers → Inflation
Inflation → increases → Global inequality

Stock Portfolio Optimization

Cross-Document Insights:

Portfolio → optimizes_for → maximizing return on investment
MVO-based Neural Network → achieves → dominating performance over other models
Dataset → consists_of → daily information from Yahoo Finance

API Access

Developers can integrate the knowledge extraction pipeline directly through our REST API. The endpoint accepts a document and automatically extracts knowledge graph triples and stores them in a Neo4j database. Click the link below to view the API documentation and get started:

View API Documentation

⚠️ Note on document size: The API is optimized for documents up to ~50 pages. Very large documents (100+ pages) may exceed GPU memory limits and fail to process. If you need to process large documents, consider splitting them into sections before submitting, or deploy your own instance on Modal with a larger GPU (e.g. A100 80GB).

Example Request


    curl -X POST "https://marija-vukic--market-foundry-api-fastapi-app.modal.run/process" \

      -F "file=@/path/to/your/document.pdf" \

      -F "neo4j_uri=neo4j+s://xxxx.databases.neo4j.io" \

      -F "neo4j_username=neo4j" \

      -F "neo4j_password=yourpassword"

API Parameters & Functionality

Parameters

file – Document to process (PDF, DOCX, HTML, JSON)
neo4j_uri – URI for the Neo4j database
neo4j_username – Neo4j database username
neo4j_password – Neo4j database password

What the API Does

Classifies the document type automatically
Extracts entities, relationships, and causal structures
Constructs a knowledge graph with triples
Stores the graph in your Neo4j database

Demo

See MarketFoundry in action: from document upload to knowledge graph query. Two paths, same pipeline.

Set Up Locally

Clone the repo, configure your environment, and run the full pipeline on your own documents using Conda or Docker.

Use Our API

Send documents directly to the MarketFoundry API and receive structured knowledge graph triples in return, no local setup required.

Which Should You Use?

Run It Locally

Open Source

Running the pipeline locally allows you to inspect intermediate outputs, modify prompts or models, and experiment with different extraction configurations while keeping all documents on your own machine.

Working with sensitive or confidential documents that cannot leave your environment
Full control over model selection, schema configuration, and graph storage
Integrating MarketFoundry into an existing internal pipeline
GPU access available and want to avoid per-request API costs
Researcher or developer extending the system for a new domain

Use the API

Hosted

The API provides a centralized deployment where you can submit documents and receive structured extraction results without installing dependencies or managing the full pipeline locally. It also supports more scalable processing for larger workloads or multi-user scenarios.

Easier integration with external applications and data pipelines
Prototyping a connection to a dashboard, RAG system, or data platform
No local GPU access or prefer not to manage model dependencies
Processing larger document volumes without local infrastructure
Note: hosted availability is subject to GPU budget constraints on Modal.com

Quickstart


# Clone the repository

git clone https://github.com/jessicabat/market-foundry

cd market-foundry

View Full Quickstart Guide

We welcome you to come see us present at Session One of our HDSI Capstone Showcase on March 13 at the Price Center East Ballroom, UC San Diego.

About Us

Matthew Wong

GitHub LinkedIn

Jessica Batbayar

GitHub LinkedIn

Marija Vukic

GitHub LinkedIn

UCSD DSC Capstone

Domain: NLP & IR

Mentor: Biwei Huang

Course: Data Science Capstone Project

From Source Document to Queryable Graph

Adobe

Why It Matters

The Challenge

Document-aware extraction, built for modern financial pipelines

System Scope

What MarketFoundry Does

What MarketFoundry Doesn't Do

How It Works

Document Ingestion

Document Classification

Intelligent Sectioning

Schema Routing

Extraction

Knowledge Graph

Results

Stacked Classifier Performance

Classifier Performance by Document Type

Why a Stacked Approach

Extraction Pipeline and Knowledge Graph

How Extraction is Guided

OneKE Extraction Prompt

YAML Generator Prompt

Extraction Performance vs. Frontier Models

High-Density Extraction

Structural Awareness

Lightweight Architecture

What We Built vs. What We Used

Original Work (Built by Our Team)

Third-Party Tools (Integrated and Extended)

Limitations

Schema Fragmentation

Long-Document Instability

Compute Constraints

LLM Hallucination Risk

API Hosting Budget

Next Steps

Sample Outputs

Private Equity Acquisition Data Room

Connections Extracted Across Documents:

PayPal Earnings Call

Extracted Triples:

Doom Loop News Article

Causal Relationships:

Stock Portfolio Optimization

Cross-Document Insights:

API Access

Example Request

API Parameters & Functionality

Parameters

What the API Does

Demo

Set Up Locally

Use Our API

Which Should You Use?

Run It Locally

Use the API

Quickstart

About Us

Matthew Wong

Jessica Batbayar

Marija Vukic

UCSD DSC Capstone