TrustTrace is a provenance tracking system that helps AI companies identify the origins of their training data. By fingerprinting content, detecting similarities with known sources, and recording line
AI companies face a growing crisis: they don't know where their training data comes from. Recent lawsuits from NYT, artists, and content creators highlight a critical gap in the AI supply chain. Companies cannot answer basic questions:
"Does our dataset contain copyrighted content?"
"Where did this data originally come from?"
"Are we compliant with EU AI Act requirements?"
This creates massive legal liability ($200M+ in recent lawsuits) and blocks enterprise adoption of AI technology.
TrustTrace creates an immutable provenance layer for AI training data through a four-step process:
Text content is converted to unique signatures using MinHash and sentence-transformers, creating cryptographic fingerprints that are robust to paraphrasing and minor edits.
CrewAI-powered agents compare query fingerprints against a database of 102+ known sources (NYT, Wikipedia, Reddit), using Jaccard similarity to detect content origins.
System returns similarity scores, license types (COPYRIGHT, CC-BY-SA, NONE), and risk levels (LOW/MEDIUM/HIGH/CRITICAL) to help companies understand legal exposure.
All lineage findings are immutably stored on Mantle L2 blockchain at contract 0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1, creating an auditable trail for compliance and dispute resolution.
┌─────────────────────────────────────────────────────────┐
│ FRONTEND (Next.js + TypeScript) │
│ Query & Lineage Viewer │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ BACKEND (FastAPI + Python) │
│ ┌─────────────────────────────────────────────────────┐│
│ │ CrewAI Orchestrator ││
│ │ → Tracer Agent (similarity search) ││
│ │ → Registry Agent (blockchain writes) ││
│ └─────────────────────────────────────────────────────┘│
└─────────────┬──────────────────┬────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ SQLite DB │ │ Mantle L2 │
│ 102 fingerprints│ │ (Sepolia) │
└─────────────────┘ └─────────────────┘
Layer | Technology | Purpose |
|---|---|---|
Fingerprinting | MinHash, sentence-transformers | Content similarity detection |
Agents | CrewAI | Orchestration & automation |
Backend | FastAPI | REST API server |
Database | SQLite | Fingerprint storage |
Blockchain | Mantle L2, Web3 | On-chain provenance |
Frontend | Next.js, Tailwind CSS | User interface |
Query: User pastes text into the web interface
Analysis: CrewAI agents fingerprint and compare against known sources
Results: System detects matches (e.g., 87% similarity to NYT article with HIGH copyright risk)
Verification: Full lineage tree displayed with on-chain proof link to Mantle Explorer
Input:
"The New York Times reported today on the ongoing developments in the technology sector, highlighting key innovations and market trends."
Output:
{
"matches": [
{
"source": "nyt-article-00042",
"similarity": 0.91,
"license": "COPYRIGHT",
"risk": "HIGH"
}
],
"risk_assessment": "HIGH",
"on_chain_proof": "0xcb3d0be2..."
}
Input: 1dc950094c6b6b36e7b93e5527ee5bf7c19e66d98d96e9cdac8d045a811be40f
Pay-per-query API for enterprises training AI models:
Pre-deployment compliance: Check datasets before training
Continuous monitoring: Scan data pipelines for copyright risks
Audit support: Generate lineage reports for regulators and legal teams
Low gas fees: Cost-effective on-chain recording for high-volume data pipelines
High throughput: Handles thousands of lineage records per second
Modular architecture: Scalable from testnet to mainnet production deployments
EVM compatibility: Seamless integration with existing Web3 tooling
Contract: ProvenanceRegistry.sol deployed on Mantle Sepolia
Contract Address: 0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1
Explorer: https://sepolia.mantlescan.xyz/address/0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1
Test Data: 102 pre-seeded fingerprints from NYT, Wikipedia, Reddit
Status: ✅ Fully functional MVP
TrustTrace enables the responsible AI ecosystem by:
Reducing legal risk: Identify copyright issues before deployment
Ensuring compliance: Meet EU AI Act data documentation requirements
Building trust: Provide transparency for AI model consumers
Enabling licensing: Fair attribution and compensation for content creators