Best Way to Generate Subtitles?
How to make text from audio you need blackhole and Microsoft word. I hope you now share more your sources (text and audio) on lingQ :D with this tools https://www.youtube.com/watch?v=w0Y582Ihx58
Live Transcribe & Notification - The best app I have ever found.
Perfect for transcribing news or programs where the speech is clear and there is not much background noise. Also effective for movies but, of course, less so.
You can try google docs voice typing feature (found under "tools"). Set it to the language you're trying to capture. It will capture over the mic so you need relative silence around. Probably works best for very clear and distinct voice (i.e. something with singular voice and little background noise going on).
I've not had a huge amount of luck with this during very limited experimentation. It usually works ok to start, but then stops at some point, but possibly I have some extensions that interfere or my usual 100,000 tabs open or something hogging resources. I generally don't have use for it anyway so I haven't tried it much.
This suggestion is from noxialisrex who has said it's worked pretty good for him in the past.
BTW, after a little more playing around, it looks like ublock extension may have been affecting this. I turned it off and it appears to be picking up everything pretty nicely...even with fair amounts of background noise (like in a news reports). Only bad thing is there's no punctuation.
There are a lot of resources on writing, go online.
OpenAI released their opensource Whisper transcribing tool that does this very well.
Thank you. Where can I find this software/app? Does it work in Farsi?
Thanks for the hint! I wasn't aware of whisper, so I just tried it out.
The result is pretty good, not really worse than Amazon, but I only tested 1 minute of audio. Because this software is slow. One minute of audio took 20 minutes to process. This would result in something like 30h for a standard 90m podcast? Unfortunately this makes whisper unusable for me. For the record I ran this on a MacBook Pro 16 from 2019 (Intel) and used the sample code from the website:
model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3")
Maybe someone else can try and share their numbers. I don't know a thing about machine learning, so I probably did something wrong. If there are any performance tips, feel free to share them. Thanks in advance!
Also worth noting is that whisper doesn't seem to be set up for long-form content: https://github.com/openai/whisper/discussions/29#discussioncomment-3726710
So it looks like I'll still have to feed "Big Tech", for now.
The instructions are here: https://github.com/openai/whisper#readme
and Persian is supported, please feel free to share your results, I would be interested.
Yes actually Whisper takes forever on CPU. However if you run on GPU the speed is faster. Using the small model I could process a 30 minute audio clip in approximately 12 minutes.
You can see the approximate times on their website I believe, but on a gpu the large, accurate model should approach real time while the fastest model should be around 32x faster.
For French I found some errors using the small model so I exclusively use the large model. I have been using it for tv episodes which are 21 minutes long. Each one takes approximately 18 minutes to process, and I just simply queue 10 or more of them up before I go to sleep.
If you don’t have a strong GPU, you may consider using google colab. It’s a free resource using googles hardware. I’m not too clear on the limitations but for whisper at least I tested it and didn’t run into any over the course of about an hour. I imagine it would work for you for a 90 min podcast once in awhile. Make sure you switch the runtime environment to gpu.
As for long form content I did see some discussion about it but I hadn’t personally encountered any problems with it so I don’t know much about that.
The only thing that whisper annoys me a bit about is the time stamping for the generated subtitles can be somewhat off and needing manual adjustment. Of course as noted on their website, some languages perform wildly different than others.
I don't know if anyone else has been looking for a more performant alternative to the original Python implementation, but I found this gem:
It's a reimplementation in C/C++, optimized for CPU using SIMD , threads etc. Overall I'm very impressed with the results, occasionally it fails or gets hung up on one sentence, which then repeats endlessly. But this only happened on 2 out of 23 files I tried.
The accuracy is actually quite impressive. For Chinese this is on par with Amazon I'd say. I have also tried Romanian which probably hasn't had much training data, for example Google Cloud struggles quite a bit; Whisper gets better results, although it is not close to the accuracy it achieves in Chinese. But I'm not looking for perfection anyways. Here are some non-scientific performance numbers, all tests were done with the medium model on a MacMini M1 using 4 threads (adding more doesn't improve the results) as of b4a3875.
Input length in seconds / time to output in seconds
So, roughly 3x realtime, 3 hours of input take about 1 hour to process. The above results were with Chinese, but I couldn't see a difference with Romanian.
The software also supports vtt and srt subtitle output. Please note that LingQ will ignore the first subtitle line after importing, I already reported this bug. Just use txt for the time being.
I didn't try the python implementation but I decided to give this a try and it seems to work well for japanese. Granted I only tested it in a couple files, and the speaker was speaking very clearly so take as it is.
What model did you try? I went for the base since it's what they suggested in the repo but I'm not sure how it fares with respect to quality. As I understand it with the bigger ones (medium, large) you trade computation for more accurate subtitles?
Thanks for the share btw, very interesting.
My tests were with the medium model. Maybe I'll experiment with other sizes later, but currently I'm satisfied with accuracy and speed, especially for Chinese. Romanian is already a bit flaky, not sure I could go lower.
The original paper (https://cdn.openai.com/papers/whisper.pdf) includes "word error rate" (WER) data, starting on page 23, this can serve as indicator for the accuracy across the various languages. This might help to determine which model to choose.
Interestingly Japanese seems to achieve significantly better results than Chinese, even though the latter had quite a bit more training data (see Figure 3 on page 7). Curious.
Btw. one big thing I forgot to mention is that Whisper supports punctuation, this is not a given even for payed services.
guys im so bad at this how do you even begin to use the program !
Well, I only tried it on macOS, on Linux it's probably even easier. I don't really know, but Windows is probably a world of pain. Please note, all commands are provided in good conscience, but I assume no liability if you screw up your system :)
There are a couple of prerequisites:
- a c/c++ compiler
On macOS: if you have Xcode installed, you are good, else try run (can't quite remember...):
Check your compiler:
On Linux, maybe everything is installed already, on Ubuntu you might need:
sudo apt install g++
- git (version control, preinstalled)
- wget (for downloading the model file)
macOS: brew install wget
- FFmpeg (for converting the audio files)
brew install ffmpeg
Linux, e.g. Ubuntu:
sudo apt install ffmpeg
Run the following commands for the bundled example:
bash ./download-ggml-model.sh base.en
./main -m models/ggml-base.en.bin -f samples/jfk.wav
If you want to convert your own files, you need to prepare them first (change input.mp3 to your input file):
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Maybe download a different model, e.g. medium:
bash ./download-ggml-model.sh medium
This is the command I used for my tests thus far (medium model, Chinese language, srt subtitle file output):
./main -m models/ggml-medium.bin -f /Users/bamboo/podcast.wav -l zh -osrt
You can choose the language code: https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp#L31
And choose the output format:
-otxt -ovtt -osrt
thanks im gonna try it!!
Thank you. Does it support Farsi language and generate subtitles from YouTube movies? What is the maximum movie length?
Yes, it support Farsi language.
What is the maximum movie length?
I really don't know
i tried getting this to work on my mac to no avail :( not sure what i did wrong
OMG bro! I've looking for something likes this for months! I'm so happy. I'm learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! 😁
OMG bro! I've been looking for something likes this for months! I'm so happy. I'm learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! 😁
Google the show's title and include .srt / subtitles. You may get some good results. This works quite well for me and Japanese.
I've gotten meeting transcripts by recording the meeting, uploading the file to Youtube, using the auto-generate subtitles function, and then finding the "show transcript" button in the function tab. It's bit clunky but it's free . . . .
yess i tried this but for a tv show and they flagged it and banned it :(
Sorry - I've never tried it with copywritten material. Does this happen when you keep the files private?
Basically all of the "Big Tech" companies offer speech to text services (STT).
They typically expect users to interact with an API, so a basic level of programming / scripting knowledge is helpful. But web interfaces exist, at least for AWS and Google. Still, the first step is always to upload the audio to the respective storage solution (e.g. S3). This involves a bit of clicking around but should be good enough for occasional use.
Personally I have been using AWS Transcribe for Chinese podcasts, I'm happy with the results. It sometimes struggles with names (esp. foreign ones) and when multiple people speak at the same time (doesn't record anything). In general it helps to set the number of speakers option. AWS can output srt and/or vtt subtitle files, those can be imported into LingQ using the "import ebook" functionality, this doesn't allow audio to be synced alongside however.
In general I would expect the quality of these services to be excellent, at least in the major languages. IIRC Google and Microsoft's Chinese transcriptions were't any worse than Amazon's. Pricing should be comparable, but depends on region and currency. I only use Amazon because it is the most convenient for me.
Another service is https://www.iflyrec.com/ it is probably the most used service in Mainland China, it is supposedly excellent, but be aware that creating an account and especially payment can be challenging.
Ahh ok i was looking for a free option haha cause there is 300 episodes and another 3 different series related to it. but ill poke around with it thanks!
I was trying to sign up for this service, but couldnm't get past the initial verification step. Not getting a text on my US phone number, and they don't offer an email alternative. :(
You mean iflyrec? I think they only accept Chinese phone numbers, and even then the real problem will be the payment. I don't have an account there but I doubt that their service is somehow better than "Big Tech". It's probably not worth the hassle. Personally, I have moved on to Whisper and create transcripts myself. But I always found Amazon's service to be adequate and reasonably priced as well. I you are interested I could send you a sample transcription privately. They only require a credit card IIRC.
heyyy how much did you end up spending on amazon? and how is the process?
I think I spend about 100€ on transcripts. In the free tier you can get 60 minutes for free each month for one year, if you exceed that you pay. The pricing depends on currency and region, see here: https://aws.amazon.com/transcribe/pricing/
The process is quite simple, create a storage bucket on S3 and upload your audio files, then add a transcription job. Everything can be done using the website, but is available via API as well. As for options, I found the "speaker identification" to help with accuracy, just enter the number of speakers. Also don't forget to the select the subtitle formats. Also, I've had the language selection reset on me, so when creating multiple jobs, make sure to select the language for each item.
I tried HappyScribe quite a bit. It is just way too expensive though in my opinion, and your not even paying for a transcription that is 100 percent correct.
Yea i used that one and it was nice and it helped a bit but not perfect. If it was maybe like $10 a month and you can use it as much as you want id be interested but yea too expensive.
maybe you could try this site.
I haven't tried it, but they give the option to convert 8 hours of video into subtitles and voice and also translate them. The service is provided in many languages. The example they show on the site is impressive. The price is 10 dollars per month.
I also saw this site.
I haven't tried it also but they have less functions for the same price.
Sorry I was wrong. Payment at https://maestrasuite.com/
is $10 per hour (there are more plans). It is much more expensive.