Roy et al., 2009 - Google Patents

Fast transcription of unstructured audio recordings

Roy et al., 2009

Document ID: 10417787848602850561
Author: Roy B; Roy D
Publication year: 2009

External Links

Cited by

Snippet

We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split …

Continue reading at dspace.mit.edu (PDF) (other versions)

230000035897 transcription 0 title abstract description 59

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/28—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
- G10L15/265—Speech recognisers specially adapted for particular applications
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30796—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using original textual content or text extracted from visual content or transcript of audio data
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3074—Audio data retrieval
- G06F17/30743—Audio data retrieval using features automatically derived from the audio content, e.g. descriptors, fingerprints, signatures, MEP-cepstral coefficients, musical score, tempo
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00

Similar Documents

Publication	Publication Date	Title
Albanie et al.	2018	Emotion recognition in speech using cross-modal transfer in the wild
Roy et al.	2009	Fast transcription of unstructured audio recordings
CN105245917B (en)	2018-05-04	A kind of system and method for multi-media voice subtitle generation
Morgan et al.	2001	The meeting project at ICSI
US10977299B2 (en)	2021-04-13	Systems and methods for consolidating recorded content
Yeh et al.	2011	Segment-based emotion recognition from continuous Mandarin Chinese speech
JPWO2005069171A1 (en)	2008-09-04	Document association apparatus and document association method
Sun et al.	2019	Multi-modal sentiment analysis using deep canonical correlation analysis
CN101625862A (en)	2010-01-13	Method for detecting voice interval in automatic caption generating system
Mirkin et al.	2017	A recorded debating dataset
Latif et al.	2021	Controlling prosody in end-to-end TTS: A case study on contrastive focus generation
Petermann et al.	2023	Tackling the cocktail fork problem for separation and transcription of real-world soundtracks
Lebourdais et al.	2022	Overlaps and gender analysis in the context of broadcast media
Li et al.	2023	Multi-scale attention for audio question answering
Atmaja et al.	2022	Jointly predicting emotion, age, and country using pre-trained acoustic embedding
Solberg et al.	2023	A Large Norwegian Dataset for Weak Supervision ASR
Gilmartin et al.	2016	Capturing Chat: Annotation and Tools for Multiparty Casual Conversation.
Chen et al.	2017	Automatic emphatic information extraction from aligned acoustic data and its application on sentence compression
Kotsakis et al.	2020	Investigation of spoken-language detection and classification in broadcasted audio content
Leh et al.	2018	Speech Analytics in Research Based on Qualitative Interviews: Experiences from KA3
Bertillo et al.	2023	Enhancing Accessibility of Parliamentary Video Streams: AI-Based Automatic Indexing Using Verbatim Reports.
Demilie Melese et al.	2024	Speaker-based language identification for Ethio-Semitic languages using CRNN and hybrid features
Kachhoria et al.	2024	Minutes of Meeting Generation for Online Meetings Using NLP & ML Techniques
Waghmare et al.	2012	A Comparative Study of the Various Emotional Speech Databases
Jain et al.	2021	Detection of Sarcasm Through Tone Analysis on Video and Audio Files: A Comparative Study on AI Models Performance