VoiceNet: decentralized audio & video transcription on Bittensor. GPU miners compete on accuracy, validators score WER, TAO rewards flow to the best. 70% cheaper than Whisper API.
Turning idle GPUs into a self-optimizing transcription marketplace — 70% cheaper than centralized APIs, with genuine proof-of-intelligence.
Audio and video transcription is a foundational AI service powering meeting tools, podcasts, accessibility software, legal archives, and more. Yet it remains expensive, centralized, and opaque:
OpenAI Whisper API charges $0.006/min — $3,600/month at 10,000 hours
AssemblyAI & Rev.ai charge up to $0.09/min for premium quality
All audio flows through a single centralized server, creating privacy risks and single points of failure
Millions of idle GPUs worldwide could be transcribing right now — but there's no coordination mechanism
VoiceNet is a Bittensor subnet that creates a decentralized marketplace for high-quality transcription. GPU miners compete to produce the most accurate and fastest transcripts. Validators objectively score quality using Word Error Rate (WER). TAO emissions flow proportionally to performance.
No central server. No lock-in. Just proof-of-intelligence — at scale.
┌─────────────────────────────────────────────────────────────┐
│ BITTENSOR NETWORK │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ VALIDATORS │──── audio ─▶│ MINERS (ASR) │ │
│ │ │◀── result ──│ │ │
│ │ • Dispatch │ │ • Whisper large-v3 │ │
│ │ • Score WER │ │ • wav2vec2 / custom │ │
│ │ • Set wts │ │ • RTX 3090/4090+ │ │
│ └──────┬───────┘ └──────────────────────┘ │
│ │ weights │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │YUMA CONSENSUS│──── emit ──▶│ TAO EMISSIONS │ │
│ │ (metagraph) │ │ (miners by rank) │ │
│ └──────────────┘ └──────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ VOICENET GATEWAY API (external developers) │ │
│ │ Python SDK • Node.js SDK • REST endpoint │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Step | Action |
|---|---|
1 | Validator selects 10–50 audio segments from curated evaluation pool |
2 | Segments broadcast to all online miners via Bittensor dendrite |
3 | Miners run ASR inference → return transcript + word timestamps + confidence scores |
4 | Validator computes WER against ground truth + measures latency |
5 | Score smoothed with EMA, weights submitted to metagraph |
6 | Yuma Consensus emits TAO to miners proportional to rank |
Score = (1 − WER) × 0.8 + LatencyScore × 0.2
Weight = 0.7 × Score(t) + 0.3 × EMA(t−1)
WER (80%) — Word Error Rate against held-out ground truth. Lower is better.
Latency (20%) — Real-time factor. Responding in under 0.5× audio duration scores full points.
EMA smoothing — Prevents reward gaming from single lucky rounds.
Task: Transcribe audio segments with word-level timestamps.
Field | Specification |
|---|---|
Input | Base64-encoded audio (MP3, WAV, M4A, FLAC) |
Output | Transcript text + |
Timeout | Max 2× audio duration (real-time factor ≤ 0.5) |
Hardware | GPU recommended (RTX 3090/4090, A100, H100) |
Reference model | Whisper large-v3 (open-source, runs on consumer GPU) |
Bonus multipliers:
🌍 Multilingual — 1.2× for correctly transcribing non-English audio
🗣️ Speaker diarization — 1.1× for accurate speaker labeling
Evaluation pool sources:
Public benchmarks: LibriSpeech, CommonVoice, TED-LIUM, VoxPopuli
Synthetically generated audio (TTS + noise injection)
Real-world API submissions (with consent)
Anti-gaming mechanisms:
Defense | How it works |
|---|---|
Hidden ground truth | Reference transcripts never revealed to miners |
Replay prevention | Timestamp hash embedded per request; miners must echo it back |
Hallucination detection | Phoneme alignment check catches invented words |
Speed plausibility | Responses faster than theoretical minimum are discounted |
Pool rotation | 25% of evaluation pool replaced weekly |
Validator staking | Min. 1,000 TAO stake required to participate |
Metric | Value |
|---|---|
Global transcription market (2030) | $18B+ |
Market CAGR | 35% |
VoiceNet target pricing | $0.001–$0.003/min |
vs. OpenAI Whisper API | 70%+ cheaper |
Target WER | < 5% |
Language support target | 50+ languages |
Early adopters:
🎙️ Podcasters & content creators (replacing Rev/Descript)
💼 Enterprise meeting intelligence tools (Fireflies-style)
⚕️ Healthcare & legal transcription (high-accuracy niche)
🌍 Global localization companies (high-volume, price-sensitive)
👩💻 Indie developers (previously priced out by centralized APIs)
Distribution:
Drop-in Python & Node.js SDKs replacing Whisper API
Launch on HackerNews, r/LocalLLaMA, IndieHackers
Native Bittensor ecosystem integration
Phase 1 — Testnet (Mar 2026)
✦ Deploy functional subnet on Bittensor testnet
✦ 50+ concurrent miners, <6% WER, real-time performance
✦ Full scoring & anti-gaming mechanism validated
Phase 2 — Mainnet (Q2 2026)
✦ 100+ miners onboarded
✦ VoiceNet Gateway API live (Python + Node.js SDKs)
✦ 1M+ minutes/month API volume target
Phase 3 — Expansion (Q3 2026)
✦ Speaker diarization scoring
✦ 50+ language evaluation pools
✦ 3+ commercial partnerships
Phase 4 — Ecosystem (Q4 2026+)
✦ Open evaluation pool contributions
✦ Bittensor storage subnet integration
✦ DAO governance for subnet parameters
Accurate transcription of real-world audio — with noise, accents, overlapping speech, and domain-specific vocabulary — requires sophisticated learned acoustic models. WER cannot be gamed without running actual ASR inference. A miner that copies, hallucinates, or caches results is detected and penalized automatically.
Every TAO earned on VoiceNet represents real computational intelligence applied to a real-world task.
Layer | Technology |
|---|---|
Subnet framework | Bittensor SDK (Python) |
Reference miner model | OpenAI Whisper large-v3 |
Gateway API | Node.js + REST |
Evaluation scoring | Python (jiwer, phoneme-align) |
Validator pool | LibriSpeech + CommonVoice + synthetic |
Full proposal and pitch deck are available in the link below:
📂 VoiceNet — Proposal & Pitch Deck (Google Drive)
Includes:
📄 Subnet Design Proposal (DOCX) — full mechanism design, miner/validator specs, business logic & GTM
📊 Pitch Deck (PPTX) — 10-slide investor-ready presentation