hackquest logo

DataForge

The synthetic data market is around $1.5B today growing at ~30% CAGR. Every AI lab, enterprise ML team, and research group needs DataForge

Videos

Project image 1

Tech Stack

React
Next
Solidity

Description

🧠 DATAFORGE: A Synthetic Data Subnet on Bittensor

Overview

Synapse is a proposed subnet in the Bittensor ecosystem dedicated exclusively to high-quality, domain-specific synthetic data generation.

While compute and inference subnets already exist within Bittensor, there is currently no subnet whose primary output is structured, benchmark-validated synthetic training data arguably the most scarce and valuable resource in modern AI systems.

Synapse fills this gap.

Our thesis is simple:

The future bottleneck of AI is not compute it is high-quality data.

Synthetic data that measurably improves downstream model performance represents one of the purest and most objective forms of Proof of Intelligence.


1. Subnet Design Proposal


1.1 Incentive & Mechanism Design

Core Mechanism

Miners generate synthetic datasets.
Validators objectively score them via downstream model performance.

The fundamental loop:

  1. Task issued (e.g., domain-specific dataset generation)

  2. Miner submits structured synthetic dataset

  3. Validator trains a small reference model on submitted data

  4. Model is evaluated on a hidden benchmark set

  5. Score = measurable performance improvement

  6. Emissions distributed proportionally to contribution quality

This makes model improvement the proof itself.


1.2 Emission & Reward Logic

Emission Allocation

  • 70% → Miners (performance-weighted)

  • 25% → Validators (scoring accuracy & participation consistency)

  • 5% → Protocol treasury (future research, benchmark curation)

Miner Reward Formula

Let:

  • B = baseline benchmark performance

  • M = model performance after training on miner dataset

  • Δ = M - B

Reward ∝ normalized Δ across all miners in epoch.

Miners are paid strictly for measurable intelligence contribution.

Validator Reward Formula

Validators are rewarded based on:

  • Scoring consistency with peer validators

  • Proper evaluation execution

  • Timely participation

Validators with statistically deviant scoring are slashed.


1.3 Incentive Alignment

For Miners

  • Incentivized to generate:

    • High-quality

    • Diverse

    • Domain-accurate

    • Non-duplicative

  • Low-quality data yields zero benchmark lift → zero reward

For Validators

  • Incentivized to:

    • Accurately measure performance

    • Avoid collusion

    • Maintain benchmark integrity

  • Misaligned scoring leads to slashable stake


1.4 Anti-Adversarial Mechanisms

  • Hidden evaluation sets (rotated periodically)

  • Duplicate detection across submissions

  • Randomized validator cross-evaluation

  • Stake-weighted consensus scoring

  • Slashing for outlier or malicious validators

Garbage data cannot pass because:

  • If it doesn’t improve the model → it doesn’t get paid.

  • If it copies public datasets → benchmark overfitting detection flags it.


1.5 Proof of Intelligence

Synthetic data generation that improves downstream models requires:

  • Semantic reasoning

  • Domain expertise

  • Distribution modeling

  • Error correction

  • Diversity control

Since rewards are tied to measurable performance lift, this subnet constitutes a genuine:

Proof of Intelligence via Model Improvement

This is one of the cleanest PoI implementations possible in Bittensor.


1.6 High-Level Algorithm

Task Assignment

  • Validator posts structured generation task:

    • Domain

    • Schema

    • Size requirement

    • Constraints

Miner Submission

Input:

{
  "task_id": "...",
  "domain": "legal_summarization",
  "schema": {...},
  "num_samples": 500
}

Output:

{
  "dataset": [
    {"input": "...", "output": "..."},
    ...
  ]
}

Validation Flow

  1. Train reference model on dataset

  2. Evaluate on hidden benchmark set

  3. Compute Δ performance

  4. Normalize across miners

  5. Distribute rewards


2. Miner Design

Miner Tasks

  • Instruction-following dataset generation

  • Domain-specific Q&A creation

  • Multi-turn dialogue synthesis

  • Edge-case generation

  • Structured reasoning examples

Input → Output Format

Input:
Task specification JSON

Output:
Structured dataset in defined schema

Performance Dimensions

  • Quality (benchmark lift)

  • Diversity

  • Schema compliance

  • Dataset size adherence

  • Submission latency (secondary)

Primary metric = performance improvement.


3. Validator Design

Scoring Methodology

Validators:

  1. Train standardized lightweight reference model

  2. Use fixed hyperparameters

  3. Evaluate on secret test set

  4. Publish score

Final score = median of validator scores.

Evaluation Cadence

  • Epoch-based (e.g., every 6–12 hours)

  • Benchmark sets rotated weekly

  • Domains rotated per cycle

Validator Incentive Alignment

  • Stake required

  • Outlier detection

  • Slashing for dishonest scoring

  • Rewards tied to scoring accuracy


4. Business Logic & Market Rationale

The Problem

AI systems are running out of high-quality, domain-specific training data.

Web-scraped data:

  • Is saturated

  • Is legally constrained

  • Lacks domain precision

Synthetic data is the solution — but centralized providers dominate.

Market Opportunity

  • Synthetic data market ≈ $1.5B

  • ~30% CAGR

  • Growing demand from:

    • AI labs

    • Enterprise ML teams

    • Research institutions

Demand is structural and accelerating.

Competing Solutions

Within Bittensor:

  • Compute subnets

  • Inference subnets

  • Model-serving subnets

None focus on data as the primary output.

Outside Bittensor:

  • Centralized synthetic data startups

  • Enterprise internal data teams

Synapse differentiates via:

  • Decentralized intelligence

  • Market-based quality filtering

  • Objective performance benchmarking

Why This Fits Bittensor

Bittensor rewards measurable intelligence contributions.

There is no cleaner measurable signal than:

“Did this data improve model performance?”

This subnet transforms data generation into a verifiable intelligence market.

Long-Term Sustainability

Revenue streams:

  • Dataset licensing to AI labs

  • API access to top-ranked synthetic data pools

  • Enterprise custom task bounties

  • Research partnerships

Emission incentives bootstrap supply.
Enterprise demand sustains it long-term.


5. Go-To-Market Strategy

Initial Target Users

  • Open-source AI teams

  • Fine-tuning communities

  • Domain AI startups (legal, medical, finance)

  • Academic labs

Anchor Use Cases

  • Legal summarization datasets

  • Financial compliance QA datasets

  • Healthcare triage dialogue data

  • Enterprise customer support simulation

Distribution Channels

  • AI research communities

  • Open-source LLM ecosystem

  • ML engineering forums

  • Strategic partnerships

Incentives for Early Participation

For Miners

  • Early emission bonuses

  • Founder multiplier epochs

For Validators

  • Increased early validator rewards

  • Benchmark governance participation

For Users

  • Free initial dataset access

  • Early partner pricing

Progress During Hackathon

Currntly on the ideation Phase before proceeding to the building phase

Fundraising Status

looking to start fundraising with bittensor

Team Leader
Hhazardkrypto
Project Link
Sector
AIDeFiOther