TrustTrace
TrustTrace is a provenance tracking system that helps AI companies identify the origins of their training data. By fingerprinting content, detecting similarities with known sources, and recording line
视频
技术栈
描述
The Problem
AI companies face a growing crisis: they don't know where their training data comes from. Recent lawsuits from NYT, artists, and content creators highlight a critical gap in the AI supply chain. Companies cannot answer basic questions:
"Does our dataset contain copyrighted content?"
"Where did this data originally come from?"
"Are we compliant with EU AI Act requirements?"
This creates massive legal liability ($200M+ in recent lawsuits) and blocks enterprise adoption of AI technology.
The Solution
TrustTrace creates an immutable provenance layer for AI training data through a four-step process:
1. Fingerprint
Text content is converted to unique signatures using MinHash and sentence-transformers, creating cryptographic fingerprints that are robust to paraphrasing and minor edits.
2. Trace
CrewAI-powered agents compare query fingerprints against a database of 102+ known sources (NYT, Wikipedia, Reddit), using Jaccard similarity to detect content origins.
3. Assess
System returns similarity scores, license types (COPYRIGHT, CC-BY-SA, NONE), and risk levels (LOW/MEDIUM/HIGH/CRITICAL) to help companies understand legal exposure.
4. Record
All lineage findings are immutably stored on Mantle L2 blockchain at contract 0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1, creating an auditable trail for compliance and dispute resolution.
Architecture
┌─────────────────────────────────────────────────────────┐
│ FRONTEND (Next.js + TypeScript) │
│ Query & Lineage Viewer │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ BACKEND (FastAPI + Python) │
│ ┌─────────────────────────────────────────────────────┐│
│ │ CrewAI Orchestrator ││
│ │ → Tracer Agent (similarity search) ││
│ │ → Registry Agent (blockchain writes) ││
│ └─────────────────────────────────────────────────────┘│
└─────────────┬──────────────────┬────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ SQLite DB │ │ Mantle L2 │
│ 102 fingerprints│ │ (Sepolia) │
└─────────────────┘ └─────────────────┘
Tech Stack
Layer | Technology | Purpose |
|---|---|---|
Fingerprinting | MinHash, sentence-transformers | Content similarity detection |
Agents | CrewAI | Orchestration & automation |
Backend | FastAPI | REST API server |
Database | SQLite | Fingerprint storage |
Blockchain | Mantle L2, Web3 | On-chain provenance |
Frontend | Next.js, Tailwind CSS | User interface |
Demo Experience
Query: User pastes text into the web interface
Analysis: CrewAI agents fingerprint and compare against known sources
Results: System detects matches (e.g., 87% similarity to NYT article with HIGH copyright risk)
Verification: Full lineage tree displayed with on-chain proof link to Mantle Explorer
Sample Query
Input:
"The New York Times reported today on the ongoing developments in the technology sector, highlighting key innovations and market trends."
Output:
{
"matches": [
{
"source": "nyt-article-00042",
"similarity": 0.91,
"license": "COPYRIGHT",
"risk": "HIGH"
}
],
"risk_assessment": "HIGH",
"on_chain_proof": "0xcb3d0be2..."
}
Sample Lineage Data Hash
Input: 1dc950094c6b6b36e7b93e5527ee5bf7c19e66d98d96e9cdac8d045a811be40f
Business Model
Pay-per-query API for enterprises training AI models:
Pre-deployment compliance: Check datasets before training
Continuous monitoring: Scan data pipelines for copyright risks
Audit support: Generate lineage reports for regulators and legal teams
Why Mantle L2
Low gas fees: Cost-effective on-chain recording for high-volume data pipelines
High throughput: Handles thousands of lineage records per second
Modular architecture: Scalable from testnet to mainnet production deployments
EVM compatibility: Seamless integration with existing Web3 tooling
Deployment Status
Contract: ProvenanceRegistry.sol deployed on Mantle Sepolia
Contract Address:
0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1Explorer: https://sepolia.mantlescan.xyz/address/0xefA667dB730A3aFbaE3Dbbe71bdf2268F5A627E1
Test Data: 102 pre-seeded fingerprints from NYT, Wikipedia, Reddit
Status: ✅ Fully functional MVP
Impact
TrustTrace enables the responsible AI ecosystem by:
Reducing legal risk: Identify copyright issues before deployment
Ensuring compliance: Meet EU AI Act data documentation requirements
Building trust: Provide transparency for AI model consumers
Enabling licensing: Fair attribution and compensation for content creators