speech-recognition
Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark.
EmoBox, a groundbreaking multilingual multi-corpus speech emotion recognition (SER) toolkit designed to streamline research in this field. EmoBox is accompanied by a meticulously curated benchmark tailored for both intra-corpus and cross-corpus evaluation settings.
Al powered voice to text.
Write 3x faster, without lifting a finger.
Related contents:
Whispering is an open-source speech-to-text application. Press a keyboard shortcut, speak, and your words will transcribe, transform, then copy and paste at the cursor.
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
Kyutai STT is a streaming speech-to-text model architecture, providing an unmatched trade-off between latency and accuracy, perfect for interactive applications. Its support for batching allows for processing hundreds of concurrent conversations on a single GPU.
Related contents:
🌟 OpenVoiceOS is an open-source platform for smart speakers and other voice-centric devices.
OpenVoiceOS is a community-driven, open-source voice AI platform for creating custom voice-controlled interfaces across devices with NLP, a customizable UI, and a focus on privacy and security.
speak into any text field.
A free, open source, and extensible speech-to-text application that works completely offline.
Handy is a cross-platform desktop application built with Tauri (Rust + React/TypeScript) that provides simple, privacy-focused speech transcription. Press a shortcut, speak, and have your words appear in any text field—all without sending your voice to the cloud.
Hertz-dev is an open-source, first-of-its-kind base model for full-duplex conversational audio.
Llama3.1 learns to Listen. Local real-time voice AI (Formerly llama3-s).
🍓 Ichigo is an open, ongoing research experiment to extend a text-based LLM to have native "listening" ability. Think of it as an open data, open weight, on device Siri.
Neon Core extends Mycroft core with more modular code, extended multi-user support, and more.
Neon AI is an open source voice assistant.
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Accurate AI Transcriptions in Minutes.
Web service proposing to transcribe video and/or audio content using AI
Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.
Related contents:
say is always on, recording and transcribing your voice 24/7. Whenever inspiration strikes, just say it.
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
Related contents:
Audio & Video Transcription | Speech-to-text. Smarter subtitling and transcription. We combine artificial and human intelligence to bring you accurate and fast transcripts, captions, and translated subtitles with ease.