🎙️ VoiceNet — Decentralized Audio & Video Transcription on Bittensor

Turning idle GPUs into a self-optimizing transcription marketplace — 70% cheaper than centralized APIs, with genuine proof-of-intelligence.

🧩 The Problem

Audio and video transcription is a foundational AI service powering meeting tools, podcasts, accessibility software, legal archives, and more. Yet it remains expensive, centralized, and opaque:

OpenAI Whisper API charges $0.006/min — $3,600/month at 10,000 hours
AssemblyAI & Rev.ai charge up to $0.09/min for premium quality
All audio flows through a single centralized server, creating privacy risks and single points of failure
Millions of idle GPUs worldwide could be transcribing right now — but there's no coordination mechanism

💡 The Solution: VoiceNet Subnet

VoiceNet is a Bittensor subnet that creates a decentralized marketplace for high-quality transcription. GPU miners compete to produce the most accurate and fastest transcripts. Validators objectively score quality using Word Error Rate (WER). TAO emissions flow proportionally to performance.

No central server. No lock-in. Just proof-of-intelligence — at scale.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     BITTENSOR NETWORK                       │
│                                                             │
│   ┌──────────────┐             ┌──────────────────────┐     │
│   │  VALIDATORS  │──── audio ─▶│    MINERS (ASR)      │     │
│   │              │◀── result ──│                      │     │
│   │  • Dispatch  │             │  • Whisper large-v3  │     │
│   │  • Score WER │             │  • wav2vec2 / custom │     │
│   │  • Set wts   │             │  • RTX 3090/4090+    │     │
│   └──────┬───────┘             └──────────────────────┘     │
│          │ weights                                          │
│          ▼                                                  │
│   ┌──────────────┐             ┌──────────────────────┐     │
│   │YUMA CONSENSUS│──── emit ──▶│    TAO EMISSIONS     │     │
│   │  (metagraph) │             │  (miners by rank)    │     │
│   └──────────────┘             └──────────────────────┘     │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  VOICENET GATEWAY API  (external developers)        │   │
│   │  Python SDK  •  Node.js SDK  •  REST endpoint       │   │
│   └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

⚙️ Mechanism Design

Evaluation Flow (runs every ~60 seconds)

Step	Action
1	Validator selects 10–50 audio segments from curated evaluation pool
2	Segments broadcast to all online miners via Bittensor dendrite
3	Miners run ASR inference → return transcript + word timestamps + confidence scores
4	Validator computes WER against ground truth + measures latency
5	Score smoothed with EMA, weights submitted to metagraph
6	Yuma Consensus emits TAO to miners proportional to rank

Scoring Formula

Score  = (1 − WER) × 0.8  +  LatencyScore × 0.2

Weight = 0.7 × Score(t)  +  0.3 × EMA(t−1)

WER (80%) — Word Error Rate against held-out ground truth. Lower is better.
Latency (20%) — Real-time factor. Responding in under 0.5× audio duration scores full points.
EMA smoothing — Prevents reward gaming from single lucky rounds.

⛏️ Miner Design

Task: Transcribe audio segments with word-level timestamps.

Field	Specification
Input	Base64-encoded audio (MP3, WAV, M4A, FLAC)
Output	Transcript text + `[{start, end, word, confidence}]` + metadata
Timeout	Max 2× audio duration (real-time factor ≤ 0.5)
Hardware	GPU recommended (RTX 3090/4090, A100, H100)
Reference model	Whisper large-v3 (open-source, runs on consumer GPU)

Bonus multipliers:

🌍 Multilingual — 1.2× for correctly transcribing non-English audio
🗣️ Speaker diarization — 1.1× for accurate speaker labeling

✅ Validator Design

Evaluation pool sources:

Public benchmarks: LibriSpeech, CommonVoice, TED-LIUM, VoxPopuli
Synthetically generated audio (TTS + noise injection)
Real-world API submissions (with consent)

Anti-gaming mechanisms:

Defense	How it works
Hidden ground truth	Reference transcripts never revealed to miners
Replay prevention	Timestamp hash embedded per request; miners must echo it back
Hallucination detection	Phoneme alignment check catches invented words
Speed plausibility	Responses faster than theoretical minimum are discounted
Pool rotation	25% of evaluation pool replaced weekly
Validator staking	Min. 1,000 TAO stake required to participate

📊 Market Opportunity

Metric	Value
Global transcription market (2030)	$18B+
Market CAGR	35%
VoiceNet target pricing	$0.001–$0.003/min
vs. OpenAI Whisper API	70%+ cheaper
Target WER	< 5%
Language support target	50+ languages

🚀 Go-To-Market

Early adopters:

🎙️ Podcasters & content creators (replacing Rev/Descript)
💼 Enterprise meeting intelligence tools (Fireflies-style)
⚕️ Healthcare & legal transcription (high-accuracy niche)
🌍 Global localization companies (high-volume, price-sensitive)
👩‍💻 Indie developers (previously priced out by centralized APIs)

Distribution:

Drop-in Python & Node.js SDKs replacing Whisper API
Launch on HackerNews, r/LocalLLaMA, IndieHackers
Native Bittensor ecosystem integration

🗺️ Roadmap

Phase 1 — Testnet (Mar 2026)
  ✦ Deploy functional subnet on Bittensor testnet
  ✦ 50+ concurrent miners, <6% WER, real-time performance
  ✦ Full scoring & anti-gaming mechanism validated

Phase 2 — Mainnet (Q2 2026)
  ✦ 100+ miners onboarded
  ✦ VoiceNet Gateway API live (Python + Node.js SDKs)
  ✦ 1M+ minutes/month API volume target

Phase 3 — Expansion (Q3 2026)
  ✦ Speaker diarization scoring
  ✦ 50+ language evaluation pools
  ✦ 3+ commercial partnerships

Phase 4 — Ecosystem (Q4 2026+)
  ✦ Open evaluation pool contributions
  ✦ Bittensor storage subnet integration
  ✦ DAO governance for subnet parameters

🧠 Why This Is Genuine Proof-of-Intelligence

Accurate transcription of real-world audio — with noise, accents, overlapping speech, and domain-specific vocabulary — requires sophisticated learned acoustic models. WER cannot be gamed without running actual ASR inference. A miner that copies, hallucinates, or caches results is detected and penalized automatically.

Every TAO earned on VoiceNet represents real computational intelligence applied to a real-world task.

🛠️ Tech Stack

Layer	Technology
Subnet framework	Bittensor SDK (Python)
Reference miner model	OpenAI Whisper large-v3
Gateway API	Node.js + REST
Evaluation scoring	Python (jiwer, phoneme-align)
Validator pool	LibriSpeech + CommonVoice + synthetic

📁 Project Documents

Full proposal and pitch deck are available in the link below:

📂 VoiceNet — Proposal & Pitch Deck (Google Drive)

Includes:

📄 Subnet Design Proposal (DOCX) — full mechanism design, miner/validator specs, business logic & GTM
📊 Pitch Deck (PPTX) — 10-slide investor-ready presentation

VoiceNet

ビデオ

テックスタック

説明