Description
I'd like three new optional fields added to the JSON video files to support subtitles/transcripts.
"original_machine_srt_subtitles": "a large string of srt-formatted subtitles from Whisper speech recognition",
"human_reviewed_srt_subtitles": "a large string of srt-formatted subtitles that have been corrected by a human",
"human_reviewer_notes_for_subtitles": "a large free form string of notes to future human volunteers correcting the machine subtitles"
All three are optional. The machine subtitles are kept there as a reference for future human volunteers. The human subtitles are made from the machine ones. The pyvideo website would show a text form of the human subtitles without the srt timestamps (or the machine subtitles as a fallback.) Both human and machine srt files are available for download. The human reviewer notes are not displayed on the pyvideo website.
The subtitles are listed on the video page so that the speaker's words can be indexed by search engines. We could either automatically generate them from the .srt content (in which case, they'd be in one very big unreadable paragraph), or we could have a fourth field to contain them and have LLMs decide where to insert the new paragraphs. I'm not sure which is better. Having the text and srt formats separate could lead to inconsistencies between them, though we could write code to check for this.
Machine subtitles are only added to the JSON file when they've been skimmed by a human reviewer for basic quality. (Accents and bad audio quality tend to drastically increase the error rate.) However, thorough corrections by a human reviewer wouldn't be necessary. I don't want perfect to be the enemy of good.
Also alternatively, we could have the human review notes displayed on the website to inform visitors the general quality level of the subtitles. Either way, I recommend we have a link next to all subtitles to a page that tells visitors how they can contribute corrections.