Best Way to Generate Subtitles?
pixla

How to make text from audio you need blackhole and Microsoft word. I hope you now share more your sources (text and audio) on lingQ :D with this tools https://www.youtube.com/watch?v=w0Y582Ihx58
Cinderela

Live Transcribe & Notification - The best app I have ever found.
Perfect for transcribing news or programs where the speech is clear and there is not much background noise. Also effective for movies but, of course, less so.
ericb100

You can try google docs voice typing feature (found under "tools"). Set it to the language you're trying to capture. It will capture over the mic so you need relative silence around. Probably works best for very clear and distinct voice (i.e. something with singular voice and little background noise going on).
I've not had a huge amount of luck with this during very limited experimentation. It usually works ok to start, but then stops at some point, but possibly I have some extensions that interfere or my usual 100,000 tabs open or something hogging resources. I generally don't have use for it anyway so I haven't tried it much.
This suggestion is from noxialisrex who has said it's worked pretty good for him in the past.
ericb100

BTW, after a little more playing around, it looks like ublock extension may have been affecting this. I turned it off and it appears to be picking up everything pretty nicely...even with fair amounts of background noise (like in a news reports). Only bad thing is there's no punctuation.
Grieds

There are a lot of resources on writing, go online.
Bcpt

OpenAI released their opensource Whisper transcribing tool that does this very well.
Cinderela

Thank you. Where can I find this software/app? Does it work in Farsi?
bamboozled

Thanks for the hint! I wasn't aware of whisper, so I just tried it out.
The result is pretty good, not really worse than Amazon, but I only tested 1 minute of audio. Because this software is slow. One minute of audio took 20 minutes to process. This would result in something like 30h for a standard 90m podcast? Unfortunately this makes whisper unusable for me. For the record I ran this on a MacBook Pro 16 from 2019 (Intel) and used the sample code from the website:
import whisper
model = whisper.load_model("medium")
result = model.transcribe("podcast.mp3")
print(result["text"])
Maybe someone else can try and share their numbers. I don't know a thing about machine learning, so I probably did something wrong. If there are any performance tips, feel free to share them. Thanks in advance!
Also worth noting is that whisper doesn't seem to be set up for long-form content: https://github.com/openai/whisper/discussions/29#discussioncomment-3726710
So it looks like I'll still have to feed "Big Tech", for now.
@
The instructions are here: https://github.com/openai/whisper#readme
and Persian is supported, please feel free to share your results, I would be interested.
Bcpt

Yes actually Whisper takes forever on CPU. However if you run on GPU the speed is faster. Using the small model I could process a 30 minute audio clip in approximately 12 minutes.
You can see the approximate times on their website I believe, but on a gpu the large, accurate model should approach real time while the fastest model should be around 32x faster.
For French I found some errors using the small model so I exclusively use the large model. I have been using it for tv episodes which are 21 minutes long. Each one takes approximately 18 minutes to process, and I just simply queue 10 or more of them up before I go to sleep.
If you don’t have a strong GPU, you may consider using google colab. It’s a free resource using googles hardware. I’m not too clear on the limitations but for whisper at least I tested it and didn’t run into any over the course of about an hour. I imagine it would work for you for a 90 min podcast once in awhile. Make sure you switch the runtime environment to gpu.
As for long form content I did see some discussion about it but I hadn’t personally encountered any problems with it so I don’t know much about that.
The only thing that whisper annoys me a bit about is the time stamping for the generated subtitles can be somewhat off and needing manual adjustment. Of course as noted on their website, some languages perform wildly different than others.
bamboozled

I don't know if anyone else has been looking for a more performant alternative to the original Python implementation, but I found this gem:
https://github.com/ggerganov/whisper.cpp
It's a reimplementation in C/C++, optimized for CPU using SIMD , threads etc. Overall I'm very impressed with the results, occasionally it fails or gets hung up on one sentence, which then repeats endlessly. But this only happened on 2 out of 23 files I tried.
The accuracy is actually quite impressive. For Chinese this is on par with Amazon I'd say. I have also tried Romanian which probably hasn't had much training data, for example Google Cloud struggles quite a bit; Whisper gets better results, although it is not close to the accuracy it achieves in Chinese. But I'm not looking for perfection anyways. Here are some non-scientific performance numbers, all tests were done with the medium model on a MacMini M1 using 4 threads (adding more doesn't improve the results) as of b4a3875.
Input length in seconds / time to output in seconds
1539 533
2389 697
1755 587
1810 531
1731 553
1733 540
1491 494
1976 659
2181 651
2282 733
Total:
18887 5978
So, roughly 3x realtime, 3 hours of input take about 1 hour to process. The above results were with Chinese, but I couldn't see a difference with Romanian.
The software also supports vtt and srt subtitle output. Please note that LingQ will ignore the first subtitle line after importing, I already reported this bug. Just use txt for the time being.
rafarafa

I didn't try the python implementation but I decided to give this a try and it seems to work well for japanese. Granted I only tested it in a couple files, and the speaker was speaking very clearly so take as it is.
What model did you try? I went for the base since it's what they suggested in the repo but I'm not sure how it fares with respect to quality. As I understand it with the bigger ones (medium, large) you trade computation for more accurate subtitles?
Thanks for the share btw, very interesting.
bamboozled

My tests were with the medium model. Maybe I'll experiment with other sizes later, but currently I'm satisfied with accuracy and speed, especially for Chinese. Romanian is already a bit flaky, not sure I could go lower.
The original paper (https://cdn.openai.com/papers/whisper.pdf) includes "word error rate" (WER) data, starting on page 23, this can serve as indicator for the accuracy across the various languages. This might help to determine which model to choose.
Interestingly Japanese seems to achieve significantly better results than Chinese, even though the latter had quite a bit more training data (see Figure 3 on page 7). Curious.
Btw. one big thing I forgot to mention is that Whisper supports punctuation, this is not a given even for payed services.
alex1029

guys im so bad at this how do you even begin to use the program !
bamboozled

Well, I only tried it on macOS, on Linux it's probably even easier. I don't really know, but compiling on Windows is probably a world of pain.
[Edit: actually you can just download a Windows executable from the automated build system on Github - look under Actions -> Artifacts]
Please note, all commands are provided in good conscience, but I assume no liability if you screw up your system :)
Prerequisites:
- C/C++ compiler
- git (version control, preinstalled)
- FFmpeg (for converting the audio files)
macOS:
On macOS: if you have Xcode installed, you are good, else try run:
xcode-select —install
Check your compiler:
clang -v
(if you have brew.sh installed):
brew install ffmpeg
(MacPorts should work as well)
Linux:
Potentially everything is in place already, on Ubuntu you may need:
sudo apt install g++
and
sudo apt install ffmpeg
---
Cloning the repo and compiling:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make
bash ./download-ggml-model.sh base.en
(run the bundled example):
./main -m models/ggml-base.en.bin -f samples/jfk.wav
---
Pre-processing the input file:
(change input.mp3 to your input file):
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Model download:
For example. the medium model:
bash ./download-ggml-model.sh medium
Example command:
This is the command I used for my tests thus far (medium model, Chinese language, srt subtitle file output):
./main -m models/ggml-medium.bin -f /Users/bamboo/podcast.wav -l zh -osrt
Choose the language:
Output format:
-otxt (plain text file)
-ovtt (VTT subtitle)
-osrt (SRT subtitle)
Beam Search:
If you struggle with whisper getting stuck (endlessly repeating the same line) try the -bs parameter, e.g. -bs 5 But note that the beam search is really slow, for me it reduces the speed with the medium model from 3x to 1x real time.
Other options:
The -pc option gives colored output, according to how confident the model is
-pp will print occasional progress messages, e.g. 5% done
---
Batch:
You can also batch convert:
for f in *.mp3; do ffmpeg -i "${f}" -ar 16000 -ac 1 -c:a pcm_s16le "${f%%.*}.wav"; done
for f in *.wav; do ./main -m models/ggml-medium.bin -l zh -pc -pp -osrt -f "${f}"; done
alex1029

thanks im gonna try it!!
MohammedAltalib

I use this program it's not perfect but it's free.
pyTranscriber
https://github.com/raryelcostasouza/pyTranscriber
This explains how to use it
Cinderela

Thank you. Does it support Farsi language and generate subtitles from YouTube movies? What is the maximum movie length?
MohammedAltalib

Yes, it support Farsi language.
What is the maximum movie length?
I really don't know
Cinderela

Thank you!
alex1029

i tried getting this to work on my mac to no avail :( not sure what i did wrong
rickvan_wichen

OMG bro! I've looking for something likes this for months! I'm so happy. I'm learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! 😁
rickvan_wichen

OMG bro! I've been looking for something likes this for months! I'm so happy. I'm learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! 😁
ericfromlingq

Google the show's title and include .srt / subtitles. You may get some good results. This works quite well for me and Japanese.
Alan_R

I've gotten meeting transcripts by recording the meeting, uploading the file to Youtube, using the auto-generate subtitles function, and then finding the "show transcript" button in the function tab. It's bit clunky but it's free . . . .
alex1029

yess i tried this but for a tv show and they flagged it and banned it :(
Alan_R

Sorry - I've never tried it with copywritten material. Does this happen when you keep the files private?
alex1029

yup!
bamboozled

Basically all of the "Big Tech" companies offer speech to text services (STT).
Google: https://cloud.google.com/speech-to-text/
Microsoft: https://azure.microsoft.com/en-us/products/cognitive-services/speech-to-text/
Facebook: wit.ai
Amazon: https://aws.amazon.com/transcribe/
They typically expect users to interact with an API, so a basic level of programming / scripting knowledge is helpful. But web interfaces exist, at least for AWS and Google. Still, the first step is always to upload the audio to the respective storage solution (e.g. S3). This involves a bit of clicking around but should be good enough for occasional use.
Personally I have been using AWS Transcribe for Chinese podcasts, I'm happy with the results. It sometimes struggles with names (esp. foreign ones) and when multiple people speak at the same time (doesn't record anything). In general it helps to set the number of speakers option. AWS can output srt and/or vtt subtitle files, those can be imported into LingQ using the "import ebook" functionality, this doesn't allow audio to be synced alongside however.
In general I would expect the quality of these services to be excellent, at least in the major languages. IIRC Google and Microsoft's Chinese transcriptions were't any worse than Amazon's. Pricing should be comparable, but depends on region and currency. I only use Amazon because it is the most convenient for me.
Another service is https://www.iflyrec.com/ it is probably the most used service in Mainland China, it is supposedly excellent, but be aware that creating an account and especially payment can be challenging.
alex1029

Ahh ok i was looking for a free option haha cause there is 300 episodes and another 3 different series related to it. but ill poke around with it thanks!
michelleschilz

I was trying to sign up for this service, but couldnm't get past the initial verification step. Not getting a text on my US phone number, and they don't offer an email alternative. :(
bamboozled

You mean iflyrec? I think they only accept Chinese phone numbers, and even then the real problem will be the payment. I don't have an account there but I doubt that their service is somehow better than "Big Tech". It's probably not worth the hassle. Personally, I have moved on to Whisper and create transcripts myself. But I always found Amazon's service to be adequate and reasonably priced as well. I you are interested I could send you a sample transcription privately. They only require a credit card IIRC.
alex1029

heyyy how much did you end up spending on amazon? and how is the process?
bamboozled

I think I spend about 100€ on transcripts. In the free tier you can get 60 minutes for free each month for one year, if you exceed that you pay. The pricing depends on currency and region, see here: https://aws.amazon.com/transcribe/pricing/
The process is quite simple, create a storage bucket on S3 and upload your audio files, then add a transcription job. Everything can be done using the website, but is available via API as well. As for options, I found the "speaker identification" to help with accuracy, just enter the number of speakers. Also don't forget to the select the subtitle formats. Also, I've had the language selection reset on me, so when creating multiple jobs, make sure to select the language for each item.
https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-console.html
MarkE

I tried HappyScribe quite a bit. It is just way too expensive though in my opinion, and your not even paying for a transcription that is 100 percent correct.
alex1029

Yea i used that one and it was nice and it helped a bit but not perfect. If it was maybe like $10 a month and you can use it as much as you want id be interested but yea too expensive.
Cinderela

maybe you could try this site.
I haven't tried it, but they give the option to convert 8 hours of video into subtitles and voice and also translate them. The service is provided in many languages. The example they show on the site is impressive. The price is 10 dollars per month.
I also saw this site.
https://wearenova.ai/nova-tools/automatic-transcriptions/#translate-languages
I haven't tried it also but they have less functions for the same price.
Cinderela

Sorry I was wrong. Payment at https://maestrasuite.com/
is $10 per hour (there are more plans). It is much more expensive.
alex1029

yuppp