hackquest logo

VoiceNet

VoiceNet: decentralized audio & video transcription on Bittensor. GPU miners compete on accuracy, validators score WER, TAO rewards flow to the best. 70% cheaper than Whisper API.

ビデオ

テックスタック

Python
Node
Web3

説明

🎙️ VoiceNet — Decentralized Audio & Video Transcription on Bittensor

Turning idle GPUs into a self-optimizing transcription marketplace — 70% cheaper than centralized APIs, with genuine proof-of-intelligence.


🧩 The Problem

Audio and video transcription is a foundational AI service powering meeting tools, podcasts, accessibility software, legal archives, and more. Yet it remains expensive, centralized, and opaque:

  • OpenAI Whisper API charges $0.006/min — $3,600/month at 10,000 hours

  • AssemblyAI & Rev.ai charge up to $0.09/min for premium quality

  • All audio flows through a single centralized server, creating privacy risks and single points of failure

  • Millions of idle GPUs worldwide could be transcribing right now — but there's no coordination mechanism


💡 The Solution: VoiceNet Subnet

VoiceNet is a Bittensor subnet that creates a decentralized marketplace for high-quality transcription. GPU miners compete to produce the most accurate and fastest transcripts. Validators objectively score quality using Word Error Rate (WER). TAO emissions flow proportionally to performance.

No central server. No lock-in. Just proof-of-intelligence — at scale.


🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     BITTENSOR NETWORK                       │
│                                                             │
│   ┌──────────────┐             ┌──────────────────────┐     │
│   │  VALIDATORS  │──── audio ─▶│    MINERS (ASR)      │     │
│   │              │◀── result ──│                      │     │
│   │  • Dispatch  │             │  • Whisper large-v3  │     │
│   │  • Score WER │             │  • wav2vec2 / custom │     │
│   │  • Set wts   │             │  • RTX 3090/4090+    │     │
│   └──────┬───────┘             └──────────────────────┘     │
│          │ weights                                          │
│          ▼                                                  │
│   ┌──────────────┐             ┌──────────────────────┐     │
│   │YUMA CONSENSUS│──── emit ──▶│    TAO EMISSIONS     │     │
│   │  (metagraph) │             │  (miners by rank)    │     │
│   └──────────────┘             └──────────────────────┘     │
│                                                             │
│   ┌─────────────────────────────────────────────────────┐   │
│   │  VOICENET GATEWAY API  (external developers)        │   │
│   │  Python SDK  •  Node.js SDK  •  REST endpoint       │   │
│   └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

⚙️ Mechanism Design

Evaluation Flow (runs every ~60 seconds)

Step

Action

1

Validator selects 10–50 audio segments from curated evaluation pool

2

Segments broadcast to all online miners via Bittensor dendrite

3

Miners run ASR inference → return transcript + word timestamps + confidence scores

4

Validator computes WER against ground truth + measures latency

5

Score smoothed with EMA, weights submitted to metagraph

6

Yuma Consensus emits TAO to miners proportional to rank

Scoring Formula

Score  = (1 − WER) × 0.8  +  LatencyScore × 0.2

Weight = 0.7 × Score(t)  +  0.3 × EMA(t−1)
  • WER (80%) — Word Error Rate against held-out ground truth. Lower is better.

  • Latency (20%) — Real-time factor. Responding in under 0.5× audio duration scores full points.

  • EMA smoothing — Prevents reward gaming from single lucky rounds.


⛏️ Miner Design

Task: Transcribe audio segments with word-level timestamps.

Field

Specification

Input

Base64-encoded audio (MP3, WAV, M4A, FLAC)

Output

Transcript text + [{start, end, word, confidence}] + metadata

Timeout

Max 2× audio duration (real-time factor ≤ 0.5)

Hardware

GPU recommended (RTX 3090/4090, A100, H100)

Reference model

Whisper large-v3 (open-source, runs on consumer GPU)

Bonus multipliers:

  • 🌍 Multilingual — 1.2× for correctly transcribing non-English audio

  • 🗣️ Speaker diarization — 1.1× for accurate speaker labeling


✅ Validator Design

Evaluation pool sources:

  • Public benchmarks: LibriSpeech, CommonVoice, TED-LIUM, VoxPopuli

  • Synthetically generated audio (TTS + noise injection)

  • Real-world API submissions (with consent)

Anti-gaming mechanisms:

Defense

How it works

Hidden ground truth

Reference transcripts never revealed to miners

Replay prevention

Timestamp hash embedded per request; miners must echo it back

Hallucination detection

Phoneme alignment check catches invented words

Speed plausibility

Responses faster than theoretical minimum are discounted

Pool rotation

25% of evaluation pool replaced weekly

Validator staking

Min. 1,000 TAO stake required to participate


📊 Market Opportunity

Metric

Value

Global transcription market (2030)

$18B+

Market CAGR

35%

VoiceNet target pricing

$0.001–$0.003/min

vs. OpenAI Whisper API

70%+ cheaper

Target WER

< 5%

Language support target

50+ languages


🚀 Go-To-Market

Early adopters:

  • 🎙️ Podcasters & content creators (replacing Rev/Descript)

  • 💼 Enterprise meeting intelligence tools (Fireflies-style)

  • ⚕️ Healthcare & legal transcription (high-accuracy niche)

  • 🌍 Global localization companies (high-volume, price-sensitive)

  • 👩‍💻 Indie developers (previously priced out by centralized APIs)

Distribution:

  • Drop-in Python & Node.js SDKs replacing Whisper API

  • Launch on HackerNews, r/LocalLLaMA, IndieHackers

  • Native Bittensor ecosystem integration


🗺️ Roadmap

Phase 1 — Testnet (Mar 2026)
  ✦ Deploy functional subnet on Bittensor testnet
  ✦ 50+ concurrent miners, <6% WER, real-time performance
  ✦ Full scoring & anti-gaming mechanism validated

Phase 2 — Mainnet (Q2 2026)
  ✦ 100+ miners onboarded
  ✦ VoiceNet Gateway API live (Python + Node.js SDKs)
  ✦ 1M+ minutes/month API volume target

Phase 3 — Expansion (Q3 2026)
  ✦ Speaker diarization scoring
  ✦ 50+ language evaluation pools
  ✦ 3+ commercial partnerships

Phase 4 — Ecosystem (Q4 2026+)
  ✦ Open evaluation pool contributions
  ✦ Bittensor storage subnet integration
  ✦ DAO governance for subnet parameters

🧠 Why This Is Genuine Proof-of-Intelligence

Accurate transcription of real-world audio — with noise, accents, overlapping speech, and domain-specific vocabulary — requires sophisticated learned acoustic models. WER cannot be gamed without running actual ASR inference. A miner that copies, hallucinates, or caches results is detected and penalized automatically.

Every TAO earned on VoiceNet represents real computational intelligence applied to a real-world task.


🛠️ Tech Stack

Layer

Technology

Subnet framework

Bittensor SDK (Python)

Reference miner model

OpenAI Whisper large-v3

Gateway API

Node.js + REST

Evaluation scoring

Python (jiwer, phoneme-align)

Validator pool

LibriSpeech + CommonVoice + synthetic


📁 Project Documents

Full proposal and pitch deck are available in the link below:

📂 VoiceNet — Proposal & Pitch Deck (Google Drive)

Includes:

  • 📄 Subnet Design Proposal (DOCX) — full mechanism design, miner/validator specs, business logic & GTM

  • 📊 Pitch Deck (PPTX) — 10-slide investor-ready presentation

チームリーダー
AAl Hadad
業界
InfraAI