這是一張有關標題為 Get Raw Subtitles for Any Video or Audio with Faster Whisper 的圖片

Get Raw Subtitles for Any Video or Audio with Faster Whisper

Transcribe videos and generate subtitles with Faster Whisper: download YouTube/Bilibili content, produce precise SRT/TXT files leveraging GPU acceleration.

Introduction

After recording a video, how can you quickly add subtitles?

Or, after recording a meeting conversation or interview on an iPhone, how do you convert it into a readable text transcript?

When a video’s content is too complex or there’s a language barrier, how can you feed it to an AI for a structured summary, complete with causes and effects?

All the needs above can be met by using OpenAI’s open-source Whisper automatic speech recognition model, or its high-efficiency derivative project, Faster-Whisper. By deploying and running it locally, you can achieve one-click video/audio to text transcription, subtitle generation, and subsequent content analysis.

What is Whisper?

Whisper is an automatic speech recognition (ASR) system officially released and open-sourced by OpenAI in September 2022. As of 2025, Whisper has been available for nearly three years.

Main Features

  • Speech-to-Text: Automatically recognizes and converts speech content from recordings, videos, etc., into a text transcript.

  • Multi-language Support: Supports over 57 languages (such as Chinese, English, Japanese, French, etc.) and has automatic language detection capabilities. The underlying model is actually trained on data from 98 languages using weak supervision, but the official guarantee of support is limited to the 57 languages with a word error rate below 50%.

  • Automatic Language Recognition: The model can automatically determine the language being spoken without needing it to be specified beforehand.

  • Powerful Fault Tolerance: Maintains high recognition accuracy even with poor audio quality, heavy accents, or background noise.

Technical Features

  • Deep Learning Model: Utilizes a Transformer architecture and is trained with weak supervision on a large-scale, multi-language dataset of speech-to-text pairings.

  • Open-Source and Available: The source code and model weights are freely available on GitHub, making it convenient for researchers and developers to perform secondary development and customization.

  • Cross-Platform Support: Can be deployed on various operating systems, including Windows, macOS, and Linux.

  • Multiple Model Sizes: Offers various model configurations, from tiny to large-v3, turbo, allowing users to choose based on their resource and accuracy requirements.

Use Cases

  • Automatic Video Subtitle Generation: Quickly generate video transcripts and subtitle files (SRT/TXT).

  • Automated Meeting Minutes: Transcribe audio from online/offline meetings, reducing the time spent on manual organization.

  • Voice Search and Voice Control: Supports use cases like voice queries and command recognition.

  • Podcast and Interview Transcription: Quickly convert long-form audio content into text transcripts for easier content management and retrieval.

Other Versions

In addition to the original Whisper model, the community has developed several derivative projects to meet different performance and functional needs:

  1. faster-whisper: Based on the Whisper architecture but rewritten with CTranslate2 as the inference engine. It significantly improves execution speed and resource efficiency, making it suitable for applications requiring high throughput or real-time transcription.

  2. WhisperX: Builds on faster-whisper by adding advanced features like batch inference, forced alignment, speaker diarization, and voice activity detection (VAD). It is particularly useful for precise transcription and analysis of long audio files or recordings with multiple speakers.

  3. Whisper.cpp: A super-lightweight C/C++ implementation of the Whisper model. It can run locally offline on various edge devices (like laptops, mobile phones, IoT) with a fast startup time, making it ideal for scenarios with high demands for real-time performance and portability.

Previously, I used the official OpenAI Whisper for transcribing video subtitles. However, I later discovered that with the same hardware resources, faster-whisper not only achieves comparable or even better recognition accuracy but also significantly reduces computation time. Therefore, this tutorial will prioritize using faster-whisper. If higher alignment accuracy or multi-speaker analysis is needed in the future, evaluating WhisperX would be the next step.

Deploying faster-whisper Locally

Install Python

You need Python 3.9 or a newer version. Here, we’ll install it directly using WinGet:

1
winget install -e --id Python.Python.3.9

Check the version with python -V. If a version number is returned, the installation was successful:

1
2
PS C:\Users\wells> python -V
Python 3.9.13

Install faster-whisper

1
pip install faster-whisper

Install CUDA Toolkit 12 and cuDNN (Optional)

⚠️ This step is only for systems equipped with an NVIDIA graphics card that supports CUDA.

Using a CUDA-supported GPU can dramatically accelerate the model’s processing workflow and significantly reduce the time required. Below is a summary of NVIDIA GPU models that support CUDA:

CategoryModels
Data Center & SuperchipsB200, B100, GH200, H200, H100, L40S, L40, L4, A100, A40, A30, A16, A10, A2, T4, V100
Workstation & Pro CardsRTX PRO 6000, RTX PRO 5000, RTX PRO 4500, RTX PRO 4000, RTX 6000, RTX 5000, RTX 4500, RTX 4000, RTX 2000, RTX A6000, RTX A5500, RTX A5000, RTX A4500, RTX A4000, RTX A2000, Quadro RTX 8000/6000/5000/4000, Quadro GV100, Quadro T-Series
GeForceRTX 50: 5090, 5080, 5070 Ti, 5070, 5060 Ti, 5060 <br> RTX 40: 4090, 4080, 4070 SUPER/Ti, 4070, 4060 Ti, 4060, 4050 <br> RTX 30: 3090 Ti, 3090, 3080 Ti, 3080, 3070 Ti, 3070, 3060 Ti, 3060, 3050 Ti, 3050 <br> RTX 20: 2080 Ti/Super, 2080, 2070 Super, 2070, 2060 Super, 2060 <br> GTX 16: 1660 Ti/Super, 1660, 1650 Super/Ti, 1650 <br> TITAN: Titan RTX, TITAN V
Jetson SoCJetson AGX Orin, Jetson Orin NX, Jetson Orin Nano

Go to the official CUDA Toolkit 12.0 Downloads link and select the corresponding operating system and version.

In this case, I am currently downloading cuda_12.9.1_576.57_windows.exe. I will use the default settings throughout the installation process.

After installation is complete, you can check if CUDA has been installed in the system directory C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA.

CUDA Toolkit Download Page Verify CUDA Installation in System Path

Next, go to the official cuDNN accelerated library download page to download a compatible cuDNN version.

After downloading, you will get a zip archive named cudnn-windows-x86_64-9.10.2.21_cuda12-archive.zip.

After unzipping, copy the bin, include, and lib folders into the already installed CUDA directory.

cuDNN Download Page Copy bin, include, lib from zip to CUDA directory

The installation and configuration of CUDA and cuDNN are now complete.

Using the Script to Transcribe Video/Audio to a Transcript

Script Download

Download the complete package: generate-srt.zip

After unzipping, you can place the entire folder in any directory to run it. The folder contains three files:

  1. generate-srt.bat (Main program)
  2. generate-srt.py
  3. parameters.txt (參數設定.txt)

The batch file needs external tools to download videos. It will require yt-dlp to download YouTube videos or lux to download Bilibili videos.

You can use WinGet to install yt-dlp:

1
winget install yt-dlp.yt-dlp

If you need to convert Bilibili videos, you will need lux. Please go to its official release page to download the lux.exe executable.

After downloading, unzip lux.exe into the generate-srt folder.

generate-srt Directory

generate-srt.bat

This batch file is responsible for the entire “Download → Transcribe → File Organization” automation. Key logic includes:

  1. Configuration Section (⚠️ Needs to be adjusted based on your computer on first use)

    • whisperModel: Specify the Faster-Whisper model size (tiny/base/small/medium/large-v3/turbo)
    • whisperDevice: Computation device (cpu/cuda)
    • whisperTask: Task mode (transcribe/translate)
    • computeType: Precision type (float16/int8_float16/…)
    • openEditor: The editor to open the subtitle file after transcription (none/notepad/code)
    • autoLoop: Whether to automatically loop to process multiple videos (true/false)
  2. Environment Check

    • Checks if the yt-dlp and lux downloaders are installed or exist in the current directory.
    • Checks if Python is available and if faster_whisper can be imported.
    • Checks if generate-srt.py is in the same directory.
  3. Uses yt-dlp or lux to download the video, then passes the video path to generate-srt.py to output a subtitle file.

  4. Finally, opens the subtitle file with a text editor.

generate-srt.py

This Python script uses the Faster-Whisper package to perform speech recognition. Its core parameters can be passed externally. You generally do not need to modify this file.

parameters.txt (參數設定.txt)

This file explains all adjustable parameters, provides usage examples, and recommended settings. Here is a sample of its content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Parameter Configuration Guide

## whisperModel
Options: tiny, base, small, medium, large-v3

## openEditor
Options: none, notepad, code

## autoLoop
true: Automatically restart after conversion is complete
false: Run only once

## whisperDevice
Options: cpu, cuda
- To use GPU acceleration, you need an NVIDIA graphics card and must install:
  - CUDA 12
  - cuDNN 9 for CUDA 12
- Important: The latest version of faster-whisper relies on ctranslate2, which only supports CUDA 12.
- If you have an AMD GPU or no NVIDIA CUDA, please select cpu.

## whisperTask
Options: transcribe, translate
- transcribe: Directly convert speech content into text in the original language (i.e., ASR).
- translate: Translate the speech content into English text (i.e., speech translation).

## computeType
Options: auto, default, float16, int8_float16, int8, int8_float32, int8_bfloat16, int16, float32, bfloat16
- NVIDIA graphics card (Compute Capability >= 7.0): float16 or int8_float16 (fast and memory-efficient)
- CPU only (Intel/AMD): int8_float32 or int16 (memory-efficient, good speed), or auto
- AMD GPU: Recommend using auto or float32 (AMD support is limited, auto will select the fastest)
- When unsure, set to auto, and the system will automatically choose the best type
- Reference: https://github.com/OpenNMT/CTranslate2/blob/master/docs/quantization.md

⚠️ Before you start, you must modify the batch file according to your computer’s hardware. Apply the changes to the corresponding variable section in generate-srt.bat.

whisperModel=turbo offers a good balance between results and performance. For less powerful computers, you can change it to base or tiny.

whisperDevice=cpu. If you have already set up CUDA support, you can change this to cuda, which will significantly reduce the conversion time.

openEditor=none. If you want the subtitle file to open automatically after conversion, you can set this to notepad or code.

Usage Example

  1. First, modify the global variables in generate-srt.bat. If you have GPU support configured, change it to SET "whisperDevice=cuda" to speed up the conversion.

  2. Double-click generate-srt.bat, and the system will automatically check if the execution environment meets the requirements.

    Run generate-srt.bat

  3. Enter a YouTube URL, for example, “Taijiang National Park – Conservation and Habitat of the Black-faced Spoonbill”. This video has embedded subtitles and does not provide subtitles in other languages.

    Downloading and Converting to Subtitles

  4. After the conversion is complete, you can find the generated .srt subtitle file in the output folder.

    Conversion Result

Transcript conversion result:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
1
00:00:00,690 --> 00:00:05,969
台江國家公園是2009年的12月28號成立

2
00:00:05,969 --> 00:00:11,269
它是一個以濕地為主的一個國家公園

3
00:00:11,269 --> 00:00:13,710
有紀錄的鳥類將近300種

4
00:00:13,710 --> 00:00:16,710
當然最主要是水鳥

5
00:00:16,710 --> 00:00:20,789
當然包括雁鴨類或鷺鷥類

6
00:00:20,789 --> 00:00:22,269
當然還有黑面琵鷺

7
00:00:22,269 --> 00:00:26,429
我們全世界黑面琵鷺大概有3000多隻

8
00:00:26,429 --> 00:00:30,589
可是來台灣去年已經超過了2000隻

9
00:00:30,589 --> 00:00:36,149
當中的90%是在我們曾文溪口

10
00:00:36,149 --> 00:00:45,530
在我們來的第二年數量大概減少了四五百隻

11
00:00:45,530 --> 00:00:52,130
我們推斷的原因可能是它度冬期間的食物來源不足

12
00:00:52,130 --> 00:00:56,130
所以我們提倡淺坪式的魚塭

13
00:00:56,130 --> 00:00:57,350
為什麼要淺坪呢

14
00:00:57,350 --> 00:01:00,990
就是水深大概是一公尺以內

15
00:01:00,990 --> 00:01:04,549
陽光照射到魚塭裡面它會自然的

16
00:01:04,549 --> 00:01:06,549
產生藻類

17
00:01:06,549 --> 00:01:09,549
再適時的加以一些飼料

18
00:01:09,549 --> 00:01:11,549
虱目魚就可以長得很好

19
00:01:11,549 --> 00:01:16,549
所以虱目魚在一般四月份開始養殖

20
00:01:16,549 --> 00:01:18,549
一直到十月份收成

21
00:01:18,549 --> 00:01:22,549
以後還有一些小魚小蝦它要洩水

22
00:01:22,549 --> 00:01:27,549
這個地方就變成黑面琵鷺最喜歡的這個覓食場所

23
00:01:27,549 --> 00:01:30,549
也強調人用半年鳥用半年

24
00:01:30,549 --> 00:01:34,430
我們想到這樣子的方式去推動以後

25
00:01:34,430 --> 00:01:37,430
那也得到BirdLife International

26
00:01:37,430 --> 00:01:40,430
世界上最大的鳥類保育組織的認同

27
00:01:40,430 --> 00:01:44,430
來給我們頒授這個保育的成就獎

Independently Transcribing Local Files

If you don’t need to download a video via the bat file and already have a local audio file, you can directly execute the following command to transcribe subtitles:

1
python generate-srt.py [input.mp3] --model_size turbo --device cuda --task transcribe --compute_type float16

Conclusion

After converting audio/video content to text, you can submit these transcripts to a large language model (like ChatGPT or Gemini) for summarization, analysis, or key point extraction. This step helps in grasping the core information of the video in a short amount of time, effectively filtering out lengthy or off-topic content and significantly improving information retrieval efficiency.

In my tests, I found that the base model’s output contained quite a few inaccuracies and tended to misspell proper nouns, but it was generally fine for analysis and summarization by GPT. As for turbo and large-v3, I personally feel they are quite similar, but turbo has the advantage of extremely fast conversion speeds. Therefore, for daily use, I primarily use turbo.

Finally, I’m sharing a comparison of the conversion results for an 18-minute video on two different computers—one using an NVIDIA GPU and another with only a CPU—for anyone who might find it useful.

Hardware ConfigurationModel VersionGPU UsedProcessing Time
NVIDIA RTX 3060 TibaseYes01 min 16 sec
NVIDIA RTX 3060 Tilarge-v3-turboYes01 min 42 sec
NVIDIA RTX 3060 Tilarge-v3Yes06 min 43 sec
AMD Ryzen 5 3500XbaseNo03 min 17 sec
AMD Ryzen 5 3500Xlarge-v3-turboNo11 min 29 sec

References

  1. openai/whisper
  2. Speech to text - OpenAI API
  3. Word error rate
  4. SYSTRAN/faster-whisper
  5. m-bain/whisperX: WhisperX
Theme Stack designed by Jimmy