This is how to generate perfect transcripts for audio podcasts for free
does anyone know if it can output srt format?
Yes, for the original whisper:
I don't know the language options but this site is really good for transcribing my Spanish material.
It's free, just upload video or audio to it. No account required. No trial limitations as of yet. I just had it transcribe a 1 hour 11 minute audio file. Transcribes really well but I haven't compared it to Whisper. I'm not in the mood to go through a whisper installaiton process yet so this is a good quick option for anyone.
Good news! Openai announced their 'Whisper API':
, the speech-to-text model we open-sourced in September 2022, has received immense praise from the developer community but can also be hard to run. We’ve now made the large-v2 model available through our API, which gives convenient on-demand access priced at $0.006 / minute. In addition, our highly-optimized serving stack ensures faster performance compared to other services.
Whisper API is available through our
transcriptions(transcribes in source language) or
translations(transcribes into English) endpoints, and accepts a variety of formats (m4a, mp3, mp4, mpeg, mpga, wav, webm):
Example using curl:
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F model="whisper-1" \
I had trouble getting whisper to work on my Linux machine, and the windows computer I use does not have a compatible GPU, so it was very slow. I found this description of how to run whisper in a free virtual machine through Google Colabs and it's worked extremely well. It's very easy to set up and run.
This is fantastic. I am now free (the freesubtitles site is currently overrun by 90+ requests, so basically unusable). Now I can run it myself :)
If you want to do this on a Mac, MacWhisper is an excellent implementation that is really easy to install (you just download it like any other app). There is a paid version, but I found that I got nearly perfect transcriptions on the most accurate free model. Highly recommended.
Hello. Thank you from me, too. I installed python today on my Windows 11 box, and used the "base" multilingual model to produce some Chinese and Russian text files from audio files. I started with Chinese mp3s from voachinese.com (Voice of America) and Russian mp3s from archive.org. It works! I can tolerate some errors in the output text. Maybe I'll be able to use lingq to read Старик Хоттабыч , a famous children's book. I have a hardback book copy of Старик Хоттабыч so errors in the computer text file would be easy to overcome.
I open the python output file this way:
f= open("c:\\somewhere\\" + args.output_file, "w", encoding="utf-8")
My Russian output has punctuation, but my Chinese output does not. Seeing that I had several thousand Chinese characters in my output and no new lines, I threw in a call to python textwrap.wrap() and that makes things nicer. I get maybe 80-85% correct characters in Chinese with "base" model and the voachinese.com sound files. I'll experiment with the "medium" model, eight times slower, some day soon. I don't own hardware that can hold the "large" model in VRAM.
All of these items (lingq, youtube, whisper, ffmpeg, etc) are part of a giant toolbox and it is fun to see what can be done.
Thanks for this. I had a bunch of shows that I had no subtitles for and this worked brilliantly. I used ffmpeg to rip the audio track into a wav
ffmpeg -i episode.mp4 episode.wav
Then I used whisper to convert them into SRTs. It took about 20 minutes an episode with my GPU and the results were impressive.
For anyone finding this thread that doesn't know what Python or C++ are, but wants to give this a try on their own, this video gives step-by-step instructions for how to install and use Whisper.
Super interesting -- does it run on MacOS?
The original whisper works but is rather slow on macOS, because the underlying PyTorch doesn't take advantage of MPS or the accelerate framework. I assume the situation will improve in the future, see this for more information: https://github.com/pytorch/pytorch/issues/77764#issue-1240333853
I myself run whisper.cpp (https://github.com/ggerganov/whisper.cpp) on a Mac and have no problems. Compilation instructions are in the readme. Happy transcribing!
My current experience with it in Chinese is mixed. Yes, >90% is correct, but the mistakes [totally wrong character] are somewhat annoying as you have to either ignore them or manually correct them. With more difficult texts and new words I am sometimes at a loss what was actually meant. I would have thought ChatGPT/whisper would proof-read its transcript for meaning (!?). It should detect those wrong characters as those non-words do not make any sense....
Sorry to hear that. Among the supported languages Chinese is just at the threshold of being usable and the results certainly cannot be compared to what can be achieved in English, Spanish or even German.
When you say 90%, that is actually a great outcome, even Openai gives us a number of around 85% afaik, and that is on pristine test data. As you say, the vast majority of mistakes are substitutions, but insertions or omissions would be even worse. I myself find the accuracy to be fine for my purposes, although I don't use LingQ for reading transcripts anymore, because, indeed, the inaccuracies combined with LingQ's way of handling Chinese, create a bit of a mess. When I did read transcripts here, I used to disable highlighting to reduce distractions and be able to focus on the text.
I never tried objectively difficult content (just stuff I find difficult to understand) but one suggestion would be to try the large model, if you haven't already, it is really slow and memory intensive but definitely more accurate.
Regarding proof-reading, I don't think we should ascribe Whisper anything resembling intelligence, it just makes predictions - basically like a weatherman it is sometimes right and sometimes wrong. The audio is split into 30 second chunks and they are processed one after the other in sequence, Whisper couldn't go back and correct itself, even if it had any understanding of language. It doesn't really know what word "makes sense" in a given context. Certainly, my knowledge of machine learning is cursory at best but I don't think I'm mischaracterizing its capabilities here.
That being said, I have seen a few instances of emergent capabilities in regards to translations, which is rather fascinating.
Great reply as always :)
Yeah, I am not sure if it is 90% or 85%, I just randomly estimated that number.
I noticed it especially when I read transcripts from medical podcasts. The medical terms are the terms most likely to be wrong. This is a bummer because the whole point is to learn those ;) I will try to use the large model...
You said you "don't use LingQ for reading transcripts anymore", is this just for transcripts or in general. If so, are you advanced enough to no longer need Lingq or have you found a better solution?
I currently use the Pleco app (available on iOS / Android) to read transcripts. Some advantages are the built in dictionaries (not user provided definitions) those are offline as well so I can read without needing Wifi. But the interface is rather clunky. It currently works for me although I'll continue to experiment. Another alternative would be to just open a text file in a Webbrowser and use a pop-up dictionary. LingQ often creates a mess and gets in the way, also my word count is already inflated, so I only import the occasional YouTube video and books (languages other than Chinese or Japanese are obviously not affected.
Regarding Whisper, the creators are silent on the source of their training data, but it is reasonably certain that it consisted mainly of YouTube videos with subtitles. I suspect that there isn't much medical content on there, especially in Chinese. That would mean that the model is just not very familiar with specialist terms.
Here are some random ideas (that I haven't tested):
- create an initial prompt containing some of the special terms, this might help push the model in the right direction (not supported by freesubtitles ai)
- fine tune your own whisper model: https://huggingface.co/blog/fine-tune-whisper
- try another fine tuned model: https://huggingface.co/models?sort=downloads&search=whisper
- try a professional service by Big Tech, e.g. AWS transcribe, they allow you to provide custom vocabulary
Thanks. I read most of my Chinese on PC, so Pleco is not really an option and I also enjoy that I can mark strings of words as Lingqs. This is really more important for me than 1-2 characters. I do use a mouse-over pop-up dictionary on top of Lingq and I find it essential for the reasons you mentioned.
I have updated my post to include all of your findings. Thank you everyone
Hi Andrey, whisper is really great, but the the original PyTorch implementation doesn't seem to be particularly optimized for use on CPU. You have mentioned your processor, but not your GPU - this is in fact the most important factor.
Because I don't have a suitable GPU, I use a re-implementation of the original whisper called 'whisper.cpp' https://github.com/ggerganov/whisper.cpp It is optimized for use on CPUs and handily outperforms the original on my system (Apple Mac mini M1). The medium model transcribes here in about 3x real time (i.e. 60m of audio take 20m) and the large one in about real time.
Maybe give that a try.
My laptop has only intel hd graphics, that is why I mentioned about the cpu only.
Thank you for the link, I will use it from now on since it is faster and optimized :)
Thanks. I wish there was just a simple .exe file I could install this with... sigh!
I don't really know about the original whisper, but since we discussed whisper.cpp previously here is an idea on how this might work. Although it has to be said I don't know Windows and don't have access to such a machine currently (I'm winging it). So you will have to figure some things out on your own, but you can always ask Google or ChatGPT (e.g. "how do I run FFmpeg on Windows?").
You should be able to just download a Windows executable from the release page on GitHub (https://github.com/ggerganov/whisper.cpp/releases), look under Assets for something like: 'whisper-blas-bin-x64.zip'
Once unzipped you should see 'main.exe' which is the actual transcription software. Of course you need to have a model, the easiest way might be to download one from here: https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main
I mainly use the medium model for Chinese, which reaches good enough accuracy while running at moderate speed.
The (input) audio has to be in a compatible format as well, I use FFmpeg for the conversion, but there are certainly other ways. You can grab a current Windows binary from here: https://github.com/BtbN/FFmpeg-Builds/releases look for something like 'ffmpeg-master-latest-win64-gpl.zip' or LGPL, but nothing containing "shared" or "Linux".
You wouldn't really install this program like a regular Windows program, instead you can drag-and-drop the exe file in a terminal window (command line, not Powershell), type '-i' drag-and-drop the audio file and then add the following '-ar 16000 -ac 1 -c:a pcm_s16le output.wav'
The end result should look something like this:
\C\Users\Path\to\ffmpeg.exe -i \C\Users\Path\to\input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Now everything should be in place to run whisper; again, by either dropping the 'main.exe' into the terminal window or by typing the path directly. You will have to do the same with the model file.
Here is how it looks for me:
./main -m models/ggml-medium.bin -f /Users/bamboo/test.wav -l zh -osrt
./main would become something like: \C\Users\Path\to\whisper.main
Similarly for the model file \C\Users\Path\to\ggml-medium.bin
And the audio file.
I don't know if this makes any sense. If you are looking for a simpler way, I'm sure many people have already built applications around whisper, that are one-click installable. Maybe look around a bit.
Thanks so much. Stupid me is using Chatgpt all the time now and I never thought of asking it for help with the install...
Thank you for hinting that there is an optimized version of whisper
I've used it a couple of times to try out and indeed it works great. If someone doesn't want to set up themselves they can use the following linke (which is using Whisper behind the scenes):
You do have to wait in a queue as a lot of people are using it, but it's not been too bad in my experience.
I'm the developer of freesubtitles.ai, glad you like it! The amount of people in the queue has fluctuated quite a bit, right now I'm just paying out of pocket for the servers to be able to let people transcribe content for free, I'm planning to offer some paid features that should hopefully bring in enough revenue whereby I can scale the project more and decrease the queue time. The goal is to have people using the public queue wait no more than 30 minutes (or, hopefully less). So far though people have already transcribed thousands of hours for free (the numbers on the site aren't accurate since I had to move to a new server) so thus far I would consider it a success!
About a year ago my colleagues and me tested some ways to transcribe text, results were really poor then. At my university we have hundrets of people waiting for good methods to transcribe various types of voice recordings from noisy old field recordings to good quality modern studio recorded interviews. I got quite excited when I read about Whisper and I will run some tests soon. Just now I tested it with freesubtitles.ai - thanks for that, this is really brilliant!
It's cool to see how fast technology in this field moves forward. At the moment there are quite some errors even in very clear studio recordings with good pronounciation (like actors and speakers) but compared to the last years you can really see the improvement.
Thank you very much for your generosity!
Hey no problem, Steve has been my language learning guru for a decade now, using his advice I was able to learn Spanish and German but in Serbian it was hard to find content with subtitles (which is my preferred way to get input for A2-C1) so this app was largely built for myself to be able to permit that, then once it was done I put it up so others could use it as well, glad it's come full circle and that people can use it to help them, wouldn't be here without Steve for sure!
Thanks a lot for providing this. It is really fantastic. There may be a bug though: today I tried to transcribe a Chinese interview podcast. The first minute was an intro in English and the rest was 40min or so of Mandarin. The transcription however was all in English. So I had to cut out that first minute of English in order to make it transcribe in Mandarin.
Did you put the language as Chinese? If it used Auto Detect it will only base it off the first 30 seconds.
Is your site down? I uploaded some audios but they are stuck in "processing" (100% uploaded)... (!?)
WANT TO LEARN A NEW LANGUAGE?
Learn from content you love!Sign Up Free