2 minutes
Voice to text using private AI
In this article we are going to use Whisper model to build a private speech recognition and private speech translation system. To do a quick test the easiest option is to use Jupyter lab.
Installation
Python version used
3.11.8
To run the model, first install the Transformers library
pyenv local 3.11.8
python --version
which python
cd ~/Dev/speech-to-text/
python -m venv ./whisper-venv
source ./whisper-venv/bin/activate
pip install ipykernel
python -m ipykernel install --user --name=whisper-projet --display-name "Python (whisper-projet)"
Create a requirements.txt file containing
# Core PyTorch stack (déjà compatible CUDA 13 chez toi)
torch==2.11.0
torchaudio==2.11.0
torchvision==0.26.0
# Transformers Whisper
transformers==4.44.2
accelerate>=0.33.0
# Audio / preprocessing (Whisper dépend fortement de ça)
datasets[audio]==2.21.0
soundfile>=0.12.1
librosa>=0.10.2
ffmpeg-python>=0.2.0
pydub==0.25.1
# Utilities
numpy>=1.26.0
tqdm>=4.66.0
sentencepiece>=0.2.0
# Optional (utile pour perf / stabilité audio)
scipy>=1.11.0
pip install -r requirements.txt
Test
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
audio = "./recording.mp3"
pipe = pipeline(
"automatic-speech-recognition",
model=model,
chunk_length_s=30, # split in 30s segments
stride_length_s=5, # overlap to ovoid cuts
return_timestamps=True,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device="cuda"
)
result = pipe(audio)
print(result["text"])
Output:
[...]
Thank you. Have a nice day. Bye-bye.
To take it further, you can use a wake word using pip install openwakeword
Basic example:
import openwakeword
from openwakeword.model import Model
import pyaudio, numpy as np
openwakeword.utils.download_models() # dl the models
model = Model(wakeword_models=["hey_bob"])
pa = pyaudio.PyAudio()
stream = pa.open(rate=16000, channels=1,
format=pyaudio.paInt16, input=True,
frames_per_buffer=1280)
print("Waiting...")
while True:
audio = np.frombuffer(stream.read(1280), dtype=np.int16)
prediction = model.predict(audio)
for key, score in prediction.items():
if score > 0.5:
print(f"Wake word '{key}' detected ! (score: {score:.2f})")
# → Trigger Whisper now
Other
Convert m4a file to mp3
from pydub import AudioSegment
audio = AudioSegment.from_file("Locmaria.m4a", format="m4a")
audio.export("locmaria.mp3", format="mp3")
Models location
ls -al ~/.cache/huggingface/hub/