US20130124984A1 - Method and Apparatus for Providing Script Data - Google Patents
Method and Apparatus for Providing Script Data Download PDFInfo
- Publication number
- US20130124984A1 US20130124984A1 US12/789,749 US78974910A US2013124984A1 US 20130124984 A1 US20130124984 A1 US 20130124984A1 US 78974910 A US78974910 A US 78974910A US 2013124984 A1 US2013124984 A1 US 2013124984A1
- Authority
- US
- United States
- Prior art keywords
- script
- metadata
- clip
- words
- revised
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000013515 script Methods 0.000 title claims abstract description 978
- 238000000034 method Methods 0.000 title claims abstract description 238
- 238000004519 manufacturing process Methods 0.000 claims description 61
- 238000003860 storage Methods 0.000 claims description 20
- 230000000875 corresponding effect Effects 0.000 description 147
- 238000012545 processing Methods 0.000 description 88
- 230000009471 action Effects 0.000 description 61
- 239000011159 matrix material Substances 0.000 description 59
- 238000004458 analytical method Methods 0.000 description 33
- 230000008569 process Effects 0.000 description 31
- 230000015654 memory Effects 0.000 description 21
- 230000000694 effects Effects 0.000 description 20
- 238000004422 calculation algorithm Methods 0.000 description 18
- 238000012986 modification Methods 0.000 description 18
- 230000004048 modification Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 16
- 238000013518 transcription Methods 0.000 description 16
- 230000035897 transcription Effects 0.000 description 16
- 230000000007 visual effect Effects 0.000 description 16
- 230000007704 transition Effects 0.000 description 14
- 238000003780 insertion Methods 0.000 description 9
- 230000037431 insertion Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 7
- 230000002596 correlated effect Effects 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 6
- 238000005192 partition Methods 0.000 description 6
- 238000012552 review Methods 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 5
- 230000003542 behavioural effect Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000001771 impaired effect Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 239000003550 marker Substances 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 239000000047 product Substances 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000008676 import Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 241000270730 Alligator mississippiensis Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- GCSZJMUFYOAHFY-SDQBBNPISA-N (1z)-1-(3-ethyl-5-hydroxy-1,3-benzothiazol-2-ylidene)propan-2-one Chemical compound C1=C(O)C=C2N(CC)\C(=C\C(C)=O)SC2=C1 GCSZJMUFYOAHFY-SDQBBNPISA-N 0.000 description 1
- 101150092509 Actn gene Proteins 0.000 description 1
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 101001113490 Homo sapiens Poly(A)-specific ribonuclease PARN Proteins 0.000 description 1
- 102100023715 Poly(A)-specific ribonuclease PARN Human genes 0.000 description 1
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 229920000638 styrene acrylonitrile Polymers 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/102—Programmed access in sequence to addressed parts of tracks of operating record carriers
- G11B27/105—Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N9/00—Details of colour television systems
- H04N9/44—Colour synchronisation
- H04N9/475—Colour synchronisation for mutually locking different synchronisation sources
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/28—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/41—Structure of client; Structure of client peripherals
- H04N21/414—Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance
- H04N21/4143—Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance embedded in a Personal Computer [PC]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
- H04N21/43072—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440236—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
- H04N21/47205—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/84—Generation or processing of descriptive data, e.g. content descriptors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/44—Receiver circuitry for the reception of television signals according to analogue transmission standards
- H04N5/445—Receiver circuitry for the reception of television signals according to analogue transmission standards for displaying additional information
Definitions
- a script serves as a roadmap to when and how elements of a movie/video will be produced.
- scripts are a rich source of additional metadata and include numerous references to characters, people, places, and things.
- directors, editors, sound engineers, set designers, marketing, advertisers, and other production personnel are interested in knowing which people, places, and things occurred or will occur in certain scenes.
- This information is often present in the script but is not typically directly correlated to the corresponding video content (e.g., video and audio) because timing information is missing from the script. That is, elements of the script are not correlated with a time in which they appear in the corresponding video content.
- script elements e.g., spoken dialogue
- production personnel may know that a character speaks a certain line of dialogue in a scene based on the script, the production personnel may not be able to readily determine the precise time in the working or final video when the particular line was spoken.
- a full script can include several thousand script elements or entities. If one were to try to find the actual point in time when a particular event (e.g., when a line was spoken) in a corresponding movie/video, the video content may have to be manually searched by a viewer to locate the event such that the corresponding timecode can be manually recorded. Thus, production personnel may not be able to easily to search or index their scripts and video content.
- the actual recorded clip may vary from the script and, thus, the script and the actual recorded video and audio may not correlate well with one another. Typically, these changes are tracked manually, if at all. This can lead to increased difficulties in post-production operations, such as aligning the script the recorded video and audio.
- the script text is said to be “aligned” with the recorded dialogue, and the resulting script may be referred to as an “aligned script.”
- Aligned scripts may be useful as production personnel often desire to search or index video/audio content based on the text provided in the script.
- production personnel may desire to generate closed caption text that is synchronized to actual spoken dialogue in video content.
- time aligning is a difficult task to automate.
- time-aligning textual scripts and metadata to actual video content is a tedious task that is accomplished by a manual process that can be expensive and time-consuming.
- a person may have to view and listen to video content and manually transcribe the corresponding audio to generate an index of what took place and when, or to generate closed captioning text that is synchronized to the video.
- To manually locate and record a timecode for even a small fraction of the dialogue words and script elements within a full-length movie often requires several hours of manual work, and doing this for the entire script might require several days or more. Searching may be even more difficult in view of differences between the script and what was actually recorded and how it was ordered during production. Similar difficulties may be encountered while creating video descriptions for the hearing impaired.
- a movie may be manually searched to identify gaps in dialogue for the insertion of video description narrations that describe visual elements (e.g., actions, settings) and a more complete description of what is taking place on screen.
- a method that includes providing script data that includes ordered script words indicative of dialogue and providing audio data corresponding to at least a portion of the dialogue.
- the audio data includes timecodes associated with dialogue.
- the method includes correlating the script data with the audio data, and generating time-aligned script data that includes time-aligned words indicative of dialogue spoken in the audio data and corresponding timecodes for time-aligned words.
- a computer implemented method that includes providing video content data corresponding to the script data including ordered script words indicative of dialogue.
- the video content data includes audio data includes a transcript including transcript words corresponding to at least a portion of the dialogue and timecodes associated with the transcript words.
- the method also includes correlating the script data with the video content data, and generating time-aligned script data that includes time-aligned words indicative of words spoken in the video content and corresponding timecodes for time-aligned words.
- a method that includes receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
- a non-transitory computer readable storage medium having program instructions stored thereon, wherein the program instructions are executable to cause a computer system to perform a method that includes receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
- a computer system for receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
- FIG. 1A is a block diagram that illustrates components and dataflow for document time-alignment in accordance with one or more embodiments of the present technique.
- FIG. 1B is text that illustrates exemplary script data in accordance with one or more embodiments of the present technique.
- FIG. 1C is text that illustrates exemplary transcript data in accordance with one or more embodiments of the present technique.
- FIG. 1D is text that illustrates exemplary time-aligned script data in accordance with one or more embodiments of the present technique.
- FIG. 2 is a block diagram that illustrates components and dataflow for script time-alignment in accordance with one or more embodiments of the present technique.
- FIG. 3 is a flowchart that illustrates a script time-alignment method in accordance with one or more embodiments of the present technique.
- FIG. 4 is a flowchart that illustrates a script synchronization method in accordance with one or more embodiments of the present technique.
- FIG. 5A is a depiction of an exemplary alignment matrix in accordance with one or more embodiments of the present technique.
- FIG. 5B is a depiction of an exemplary alignment sub-matrix in accordance with one or more embodiments of the present technique.
- FIG. 6 is a depiction of an exemplary graphical user interface sequence in accordance with one or more embodiments of the present technique.
- FIG. 7A is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique.
- FIG. 7B is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique.
- FIG. 7C is a depiction of a line of text and corresponding in/out ranges in accordance with one or more embodiments of the present technique.
- FIGS. 8A and 8B are block diagrams that illustrate components and dataflow of a script time-alignment technique in accordance with one or more embodiments of the present technique.
- FIG. 9A is a depiction of an exemplary script document in accordance with one or more embodiments of the present technique.
- FIG. 9B is a depiction of a portion of an exemplary video description script in accordance with one or more embodiments of the present technique.
- FIG. 9C is a flowchart that illustrates a method of generating a video description in accordance with one or more embodiments of the present technique.
- FIG. 10A is a block diagram that illustrates a script workflow in accordance with one or more embodiments of the present technique.
- FIG. 10B is a block diagram that illustrates components and dataflow for providing script data in accordance with one or more embodiments of the present technique.
- FIG. 10C is diagram depicting an illustrative display of a graphical user interface for viewing/revising script metadata in accordance with one or more embodiments of the present technique.
- FIG. 10D is a flowchart that illustrates a method of providing script data in accordance with one or more embodiments of the present technique.
- FIG. 10E is a block diagram that illustrates components and dataflow for processing a script in accordance with one or more embodiments of the present technique.
- FIG. 11 is a block diagram that illustrates an example computer system in accordance with one or more embodiments of the present technique.
- Speech-To-Text a process by which source audio containing dialogue or narrative is automatically transcribed to a textual representation of the dialogue or narrative.
- the source audio may also contain music, noise, and/or sound effects that generally contribute to lower transcription accuracy.
- STT transcript a document generated by a STT transcription engine containing the transcription of the dialogue or narrative of the audio source.
- Each word in the transcript may include an associated timecode which indicates precisely when the audio content associated with each word of the dialogue or narrative occurred.
- Timecodes are typically provided in hours, minutes, seconds and frames.
- Feature films are typically shot at 24 frames per second, thus twelve frames is about 1 ⁇ 2 second in duration.
- Script a document that outlines all of the visual, audio, behavioral, and spoken elements required to tell the story in a corresponding video or movie. Dramatic scripts are often referred to as a “screenplay”. Scripts may not include timecode data, such that they may not provide information about when an element of the script actually occurs within corresponding video content (e.g., a script may not provide a relative time within the video content that indicates precisely when the audio content associated with each word of the dialogue or narrative occurred).
- Shooting Script a version of a script that contains scene numbers, individual shots and other production notes that is used during production and recording of the program.
- Script dialogue/narrative the script lines to be spoken in a corresponding video or movie. Each script line may include text that includes one or more words.
- Script alignment a process by which a set of words of a dialogue or narrative in a script are matched to corresponding transcribed words of video content.
- Script alignment may include providing an output that is indicative of a relative time within the video content that words of dialogue or narrative contained in the script are spoken.
- Aligned Script a script that outlines all of the visual, audio, behavioral, and spoken elements required to tell the story in a corresponding video or movie and includes timecode data indicative of when elements of the script actually occur within corresponding video content (e.g., a time aligned script may include a relative time within the video content that indicates precisely when the audio content associated with each word of the dialogue or narrative occurred).
- Word n-gram a consecutive subsequence of N words from a given sequence. For example, (The, rain, in), (rain, in, Spain) and (in, Spain, falls) are valid 3-grams from the sentence, “The rain in Spain falls mainly on the plain.”
- Alignment matrix a mathematical structure used to represent how the words from a script source will align with the transcribed words of a transcript (e.g., an STT transcript generated via a speech-to-text (STT) process).
- a vertical axis of the matrix may be formed of words in a script in the sequence/order in which they occur (e.g., ordered script words)
- a horizontal axis of the matrix may be formed of words in the transcript in the sequence/order in which they occur (e.g., ordered transcript words).
- Each matrix cell at the intersection of a corresponding row/column may indicate the accumulated number of word insert, update or delete operations needed to match the sequence of ordered script words to the sequence of ordered transcript words to the (row, col) entry.
- a path with the lowest score through the matrix is indicative of the best word alignment.
- NLP Natural Language Processing
- Program a visual and audio production that is recorded and played back to an audience, such as a movie, television show, documentary, etc.
- Edited Program (sequence or cut) a visual and audio production that is recorded and played back to an audience, e.g.: a movie, television show, documentary, etc.
- Dialogue the words spoken by actors or other on-screen talent during a program.
- Video Description an audio track in a program containing descriptions of the setting and action.
- the video description may be inserted into the natural pauses in dialogue or between critical sound elements.
- a video description often includes narration to fill in the story gaps for the blind or visually impaired by helping to describe visual elements and provide a more complete description of what's happening (e.g., visually) in the program.
- Describer a person who develops the description to be recorded by the voicer. In some cases, the describer is also the voicer.
- SAP Secondary Audio Program
- Digital Television broadcasting (DTV)—Analog broadcasting ceased in the U.S. in 2009 and was replaced by DTV.
- Script GUI a “what you see is what you get” (WYSIWYG) graphical representation of the written script.
- a Script GUI may provide a representation of the script in an industry standard format.
- a document includes at least a portion of a script document, such as a movie or speculative script (e.g., dramatic screenplay), that outlines visual, audio, behavioral, and spoken elements required to tell a story.
- video content includes video and/or audio data that corresponds to at least a portion of the script document.
- the audio data of the video content is transcribed into a textual format (e.g., spoken dialogue/narration is translated into words).
- the transcription is provided via a speech-to-text (STT) engine that automatically generates a transcript of words that correspond to the audio data of the video content.
- the transcript includes timing information that is indicative of a point in time within the video content that one or more words were actually spoken.
- the words of the transcript (“transcript words”) are aligned with corresponding words of the script (“script words”).
- aligning the transcript words with corresponding script words includes implementation of various processing techniques, such as matching sequences of words, assessing confidence/probabilities that the words identified are in fact correct, and substitution/replacement of script/transcript word with transcript/script words.
- the resulting output includes time-aligned script data.
- the script data includes a time-aligned script document including accurate representation of each of the words actually spoken in the video content, and timing information that is indicative of when the word of the script were actually spoken within the video content (e.g., a timecode associated with each word of dialogue/narration).
- time-aligned data may include timecodes for other elements of the script, such as scene headings, action elements, character names, parentheticals, transitions, shot elements, and the like.
- two source inputs are provided: (1) a script (e.g., plain dialogue text or a Hollywood Spec. Script/Dramatic screenplay) and (2) an audio track dialogue (e.g., an audio track dialogue from video content corresponding to the script).
- a coarse-grain alignment of blocks of text is performed by first matching identical or near identical N-gram sequences of words to generate corresponding “hard alignment points”.
- the hard-alignment points may include matches between portions of the script and transcript (e.g., N-gram matches of a sequence of script words with a sequence of transcript words) which are used to partition an initial single alignment matrix (e.g., providing a correspondence of all ordered script words vs.
- transcript words are ordered transcript words into a number of smaller sub-matrices (e.g., providing a correspondence of script words that occur between the hard alignment points vs. transcript words that occur at or between the hard alignment points).
- additional words matches may be indentified as “soft alignment points” within each sub-matrix block of text.
- the soft alignment points may define multiple non-overlapping interpolation intervals.
- unmatched words may be located between the matched words (e.g., between the hard alignment points and/or the soft alignment points).
- an interpolation e.g., linear or non-linear interpolation
- an interpolation may be performed to determine timecodes for each of the non-matched words (e.g., words that have not been assigned timecode information) occurring between the matched points.
- all words e.g., matched and unmatched
- the timecode information may be merged with the words of the script and/or transcript documents to generate a time-aligned script document that includes all of the words spoken and their corresponding timecode information to indicate when each of the words was actually spoken within the video content.
- Such a technique may benefit from combining the accuracy of the script words and the timecodes of the transcript words.
- the techniques described herein may provide techniques by which all textual elements (e.g., dialogue/narration) of a script (e.g., a Hollywood movie script or dramatic screenplay script) can be automatically time-aligned to the specific points in time within corresponding video content, to identify when specific dialogue, text, or actions within the script actually occur within the video content. This enables identifying and locating when dialogue and important semantic metadata provided in a script actually occurs within corresponding production video content.
- time alignment may be applied to all elements of the script (e.g., scene headings, action elements, etc.) to enable a user to readily identify where various elements, not just dialogue words, occur within the script.
- the timecode information may also be used to identify gaps in dialogue for the insertion of video description content that includes narrations to fill in the story gaps for the blind or visually impaired, thereby helping to describe visual elements and provide a more complete description of what's happening (e.g., visually) in the program
- the techniques described herein may be employed to automatically and accurately synchronize the written movie script (e.g., which may contain accurate text, but no time information) to a corresponding audio transcript (e.g., which contains accurate time information but may include very noisy or erroneous text).
- techniques may employ the transcript to identify actual words/phrases spoken that vary from the text of the script.
- the accuracy of the words in the script or transcript may, thus, be combined with accurate timing information in the transcript to provide an accurate time aligned script.
- the techniques described herein may demonstrate good tolerance to noisy transcripts or transcripts that have a large number of errors. By partitioning the alignment matrix into many smaller sub-matrices, the techniques described herein may also provide improved performance including increased processing speeds while maintaining significantly higher overall accuracy.
- FIG. 1 is a block diagram that illustrates system components and dataflow of a system for implementing time-alignment (system) 100 in accordance with one or more embodiments of the present technique.
- system 100 implements a synchronization module 102 to analyze a document 104 and corresponding video content 106 . Based on the analysis, system 100 generates time-aligned data (e.g., time aligned script document) 116 that associates various portions of document 104 with corresponding portions of video content 106 .
- Time aligned data 116 may provide the specific points in time within video content 106 that elements (e.g., specific dialogue, text, or actions) defined in document 104 actually occur.
- document 104 (e.g., a script) is provided to a document extractor 108 .
- Document extractor 108 may generate a corresponding document data 110 , such as a structured/tagged document.
- a structured/tagged document may include embedded script data that is provided to synchronization module 102 for processing.
- document 104 may include a script document, such as a movie script (e.g., a Hollywood script), a speculative script, a shooting script (e.g., a Hollywood shooting script), a closed caption (SRT) video transcript or the like.
- a script document such as a movie script (e.g., a Hollywood script), a speculative script, a shooting script (e.g., a Hollywood shooting script), a closed caption (SRT) video transcript or the like.
- a script document such as a movie script (e.g., a Hollywood script), a speculative script, a shooting script (e.g., a Hollywood shooting script), a closed caption (SRT) video transcript or the like.
- a movie script e.g., a Hollywood script
- a shooting script e.g., a Hollywood shooting script
- SRT closed caption
- a movie script may include a document that outlines all of the visual, audio, behavioral, and spoken elements required to tell a story.
- a speculative (“spec”) script or screenplay may include a preliminary script used in both film and television industries.
- a spec script for film generally includes an original screenplay and may be a unique plot idea, an adaptation of a book, or a sequel to an existing movie.
- a “television” spec script is typically written for an existing show using characters and storylines that have already been established.
- a “pilot” spec script typically includes an original idea for a new show.
- a television spec script is typically 20-30 pages for a half hour of programming, 40-60 pages for a full hour of programming, or 80-120 pages for two hours of programming.
- Script 104 may include a full script including several thousand script elements or entities, for instance, or a partial script including only a portion of the full script, such as a few lines, a full scene, or several scenes.
- script 104 may include a portion of a script that corresponds to a clip provided as video content 106 . Since film production is a highly collaborative process, the director, cast, editors, and production crew may use various forms of the script to interpret the underlying story during the production process. Further, since numerous individuals are involved in the making of a film, it is generally desirable that a script conform to specific standards and conventions that all involved parties understand (e.g., it will use a specific format w.r.t. layout, margins, notation, and other production conventions).
- a script document is intended to structure all of the script elements used in a screenplay into a consistent layout.
- Scripts generally include script elements embedded in the script document. Script elements often include a title, author name(s), scene headings, action elements, character names, parentheticals, transitions, shot elements, dialogue/narrations, and the like.
- An exemplary portion of a script segment 130 is depicted in FIG. 1B .
- Script segment 130 includes a scene heading 130 a , action elements 130 b , character names 130 c , dialogues 130 d , and parentheticals 130 e.
- Document (script) extractor 108 may process script 104 to provide document (script) data 110 , such as a structured/tagged script document. Words contained in the document (script) data may be referred to as script words.
- a structured/tagged (script) document may include a sequential listing of the lines of the document in accordance with their order in script 104 , along with a corresponding tag (e.g., tags—“TRAN”, “SCEN”, “ACTN”, “CHAR”, “DIAG”, “PARN” or the like) identifying a determined element type associated with some, substantially all, or all of each of the lines or groupings of the lines.
- a structured/tagged document may include an Extensible Markup Language (XML) format, such as *.ASTX format used by certain products, such as those produced by Adobe Systems, Inc., having headquarters in San Jose, Calif. (hereinafter “Adobe”).
- XML Extensible Markup Language
- Adobe an Extensible Markup Language
- document extractor 108 may obtain script 104 (e.g., a layout preserved version of the document), perform a statistical analysis and/or feature matching of features contained within the document, identify document elements based on the statistical analysis and/or the feature matching, pass the identified document elements through a finite state machine to assess/determine/verify the identified document elements, assess whether or not document elements are incorrectly identified, and, if it is determined that there are incorrectly identified document elements, re-performing at least a portion of the identification steps, or, if it is determined that there are no (or sufficiently few) incorrectly identified document elements, and generate/store/output a structured/tagged (script) document or other forms of document (script) data 110 that is provided to synchronization module 102 .
- script 104 e.g., a layout preserved version of the document
- identify document elements based on the statistical analysis and/or the feature matching
- pass the identified document elements through a finite state machine to assess/determine/verify the identified document elements, assess whether or not document elements are
- document extractor 108 may employ various techniques for extracting and transcribing audio data, such as those described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein.
- video content 106 is provided to an audio extractor 112 .
- Audio extractor 112 may generate a corresponding transcript 114 .
- Video content 106 may include video image data and corresponding audio soundtracks that include dialogue (e.g., character's spoken words or narrations), sound effects, music, and the like.
- Video content 106 for a movie may be produced in segments (e.g., clips) and then assembled together to form the final movie or video product during the editing process.
- a movie may include several scenes, and each scene may include a sequence of several different shots that typically specify a location and a sequence of actions and dialogue for the characters of the scene.
- the sequence of shots may include several video clips that are assembled into a scene, and multiple scenes may be combined to form the final movie product.
- a clip, including video content 106 may be recorded for each shot of a scene, resulting in a large number of clips for the movie.
- Tools such as Adobe Premiere Pro by Adobe Systems, Inc., may be used for editing and assembling clips from a collection of shots or video segments.
- audio content e.g., without corresponding video content may be provided).
- audio content such as that of a radio show
- Audio extractor 112 may process video content 106 to generate a corresponding transcript that includes an interpretation of words (e.g., dialogue or narration) spoken in video content 106 .
- Transcript 114 may be provided as a transcribed document or transcribed data that is capable of being provided to other portions of system 100 for subsequent processing.
- audio extractor 112 includes a speech-to-text engine that takes an audio segment from video content 106 containing spoken dialogue, and uses speech-to-text (STT) technology to generate a time-code transcript of the dialogue.
- transcript 114 may indicate the timecode and duration for each spoken word that is identified by the audio extractor. Words of transcript 114 may be referred to as transcript words.
- speech-to-text (STT) technology may implement a custom language model such as that described herein.
- speech-to-text (STT) technology may implement a custom language model and/or an enhanced multicore STT transcription engine such as those described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009 and/or U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, which are hereby incorporated by reference as though fully set forth herein.
- a transcript 114 generated by audio extractor 112 may include a raw transcript.
- An exemplary raw transcript (e.g., STT transcript) 132 is depicted in FIG. 1C .
- Raw transcript 132 includes a sequential listing of identified transcript words having associated time code, duration, STT word estimate and additional comments regarding the transcription.
- the timecode may indicate at what point in time within the video content the word was spoken (e.g., transcript word “dad” was spoken 7165.21 seconds from the beginning of the associated video content), the duration may indicate the amount of time the word was spoken from start to finish (e.g., it took about 0.27 sec to say the word “dad”), and comments may indicate potential problems (e.g., that noise in the audio data may have generated an error).
- the raw transcript information may also include a confidence value that indicates the probability that the interpreted/indicated word is accurate.
- the raw transcript information may not include additional text features, such as punctuation, capitalization, and the like.
- document extraction and audio extraction may occur in parallel.
- document extractor 108 receives script 104 and generates script data 110 independent of audio extractor 112 receiving video content 106 and generating transcript 114 . Accordingly, these two processes may be performed in parallel with one another.
- document extraction and audio extraction may occur in series. For example, document extractor 108 may receive document 104 and generate document data 110 prior to audio extractor 112 receiving video content 106 and generating transcript 114 , or vice versa.
- Synchronization module 102 may generate time-aligned data 116 .
- Time-aligned data 116 may be provided as a document or raw data that is capable of being provided to other portions of system 100 for subsequent processing.
- Time-aligned data 116 may be based on script information (e.g., document data 110 ) and video content information (e.g., transcript 114 ).
- script information e.g., document data 110
- transcript 114 video content information
- synchronization module 102 may compare transcript words in transcript 114 to script words in the document (script) data 110 to determine whether or not the transcribed words are accurate. The comparison may use various indicators to assess the accuracy. For example, a plurality of words and phrases with exact matches between transcript 114 and document data 110 may have high probabilities of being correct, and may be referred to as “hard reference points”.
- Words and phrases with partial matches may have a lower probability of being correct, and may be referred to as “soft reference points”.
- Words and phrases that do not appear to have matches may have a low probability of being correct.
- Words and phrases with a low probability of being correct may be subject to additional amounts of processing. For example, low probability matches may be subject to interpolation based on the hard and soft reference points. Words that are part of hard or soft reference pints may be referred to as words having a match, whereas words that are not part of a hard or soft reference point may be referred to as unmatched words or words not having a match.
- the hard-alignment points may be used to partition the document data and the transcript into smaller segments that correspond to one another, and additional processing may be performed on the smaller segments in substantial isolation.
- the timecodes and other information associated with matched words may be used to derive (e.g., interpolate) timecode and other information about the unmatched words.
- Time aligned data 116 may include words (e.g., from the script words or transcript words) having a specific timecode associated therewith.
- time aligned data 116 may include words from both document data 110 and transcript data 114 used to generate a single script that accurately identifies words actually spoken in video content 106 along with corresponding timecode information for each spoken word of dialogue or other elements.
- the timecode for each word may be obtained directly from matching words of the transcript, or may generated (e.g., via interpolation).
- Time aligned data 116 may be stored at a storage medium 118 (e.g., a database), displayed at a display device 120 (e.g., a graphical display viewable by a user), or provided to other modules 122 for processing.
- An exemplary time-aligned script data/document 134 is depicted in FIG. 1D .
- time-aligned data/document 134 includes spoken words 136 grouped with other spoken words of their respective script elements 137 , and provided along with their associated timecodes 138 .
- a start time 140 for each element grouping of lines is also provided.
- each of the script elements (and text of the script elements) is also assigned a corresponding time code.
- FIG. 2 is a block diagram that illustrates components and dataflow of system 100 in accordance with one or more embodiments of the present technique.
- synchronization module 102 includes a script reader 200 , a script analyzer 202 , a Speech-to-Text (STT) reader 204 , an STT analyzer 206 , a matrix aligner 208 , an interval generator/interpolator 210 , and a time-coded script generator 212 .
- STT Speech-to-Text
- FIG. 3 is a flowchart that illustrates a script time-alignment method 300 according to one or more embodiments of the present technique.
- Method 300 may provide alignment techniques using components and dataflow implemented at system 100 .
- method 300 includes providing script content, as depicted at block 302 , providing audio content, as depicted at block 304 , aligning the script content and audio content, as depicted at block 306 , and providing time-coded script data, as depicted at block 308 .
- providing script content includes inputting or otherwise providing a script 104 , such as a Hollywood Spec. Movie Script or dramatic screenplay script, to system 100 .
- a plain text document such as a raw script document
- script extractor 108 which processes script 104 (e.g., to identify, structure, and extract the text of script 104 ) to generate script data 110 , such as a structured/tagged script document.
- Script extractor 108 may employ techniques for converting documents, such as those described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, U.S. patent application Ser.
- Document data 110 may be provided to synchronization module 102 for subsequent processing, as described in more detail below.
- providing audio content includes inputting or otherwise providing video content 106 , such as a clip/shot of a Hollywood movie, having associated audio content that corresponds to a script 104 , to system 100 .
- Audio data may be extracted from video content 106 using various techniques. For example, an audio data track may be extracted from video content 106 using a Speech-to-Text (STT) engine and/or a custom language model.
- STT Speech-to-Text
- audio extractor 112 may employ an STT engine and/or custom language model to generate transcript 114 that includes a transcription of spoken words (e.g., audio dialogue or narration) of the Hollywood movie or other audio data.
- Audio extractor 112 may employ various techniques for extracting and transcribing audio data, such as those described below and/or those techniques described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, and/or U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, which are both hereby incorporated by reference as though fully set forth herein.
- a resulting transcript 114 may be provided to synchronization module 102 for subsequent processing, as described in more detail below.
- aligning the script and audio content includes employing a matching technique to align the script words (e.g., dialogue or narrations) of script 104 to elements of the video content 106 . This may include aligning script words to corresponding transcript words.
- alignment includes synchronization module 102 implementing a two-level word matching system to align script words of script 110 to corresponding transcript words of transcript 114 .
- a first matching routine is executed to partition a matrix of script words vs. transcript words into a sub-matrix. For example, an N-gram matching scheme may be used to identify high probability matches of a sequence of multiple words.
- N-gram matching may include attempting to exactly (or at least partially) match phrases of multiple transcript words with script words.
- the matched sequence of words may be referred to as hard-alignment points.
- the hard alignment points may include several matched words, and may be used to define boundaries of each sub-matrix. Thus, the hard-alignment points may define smaller matrices of script words vs. transcript words.
- Each of the smaller sub-matrices may, then, be processed (e.g., in series or parallel) using additional matching techniques to identify word matches within each of the sub-matrices.
- processing may be provided via multiple processors. For example, processing in series or parallel may be performed using multiple processors of one or more hosted services or cloud computing environments.
- each of the sub-matrix is processed independent of (e.g., in substantial isolation from) processing of the other sub-matrices.
- These resulting additional word matches may be referred to as soft alignment points.
- the timecode information associated with the words of the hard and soft alignment points may be used to assess timecode information for the unmatched words (e.g., via interpolation). For example, timecodes associated with the words that make up the matched points at the end and beginning of an interval of time may be used as references to interpolate time values for unmatched words that fall within the interval between the matched words. Alignment techniques that may be implemented by synchronization module 102 are discussed in more detail below. Further, techniques for matching are discussed in more detail below with respect to FIGS. 8A and 8B .
- providing time-coded script data includes providing timecodes assigned to all dialogue and other script element types.
- synchronization module 102 after synchronization module 102 aligns word N-grams from script 110 with corresponding word N-grams of transcript 114 , it may output (e.g., to a client application) time information in the form of time-coded script data (e.g., time-aligned script data 116 ) that contains timecodes assigned to some or all dialogue and to some and/or all other script element types associated with script 104 .
- the data may be stored, displayed/presented or processed.
- a script e.g., a Hollywood Spec.
- Time-aligned script data 116 may be processed and used by other applications, such as the Script Align feature of Adobe Premiere Pro.
- processing may be implemented to time-align script elements other than audio (e.g., scene headings, action description words, etc.) directly to the video scene or full video content.
- timecodes of the script words may be used to determine a timecode of the script element.
- each of the script elements may be provided in the time-aligned script data in association with a timecode, as discussed above with regard to FIG. 1D .
- Providing time-coded script data may include providing the resulting time-aligned data 116 to a storage medium, display device, or other modules for processing, as described above with regard to FIG. 1A .
- FIG. 4 is a flowchart that illustrates a time-alignment method 400 according to one or more embodiments of the present technique.
- Method 400 may provide alignment techniques using components and dataflow implemented at synchronization module 102 .
- method 400 generally includes reading a script (SCR) file and a speech-to-text (STT) file, and processing the SCR and STT files using various techniques to generate an output that includes time-aligned script data.
- SCR script
- STT speech-to-text
- method 400 includes reading an SCR file, as depicted at block 402 .
- This may include reading script data, such as script data 110 , described above with respect to block 302 .
- reading an SCR file may include script reader 200 reading a generated SCR file (e.g., document data 110 ).
- the SCR file may include a record-format representation of a source Hollywood spec. script of dramatic screenplay script. Records contained in the SCR file may each include one complete script element.
- Script reader 200 may extract script element type and data values from each record and place these into an internal representation (e.g., a structured/tagged script document).
- method 400 includes reading an STT file, as depicted at block 404 .
- This may include reading STT data, such as transcript 114 , as described above with respect to block 304 .
- Transcript 114 may include an STT file having transcribed data, such as that of the STT word transcript data 132 depicted in FIG. 1C .
- the STT data may provide a timecode for each spoken word in the audio sound track which corresponds in time to video content 106 .
- method 400 includes building a SCR N-gram dictionary, as depicted at block 406 .
- building an SCR N-gram dictionary includes identifying all possible sequences of a given number of consecutive words.
- the number of words in the sequence may be represented by a number “N”.
- N is set to a value of 3: (The, rain, in), (rain, in, Spain), (in, Spain, falls), (Spain, falls, mainly), (falls, mainly, on), (mainly, on, the), and (on, the, plain).
- additional N-gram word sequences may be generated based on words that precede or follow a phrase. For example, where the first word of a following sentence is “Why”, an additional 3-gram may include (the, plain, why).
- the value of N may be set by a user. In some embodiments, the value of N is set to a predetermined value, such as four. For example, N may be automatically set to a default value of four, and the user may have the option to change the value of N to something other than four (e.g., one, two, three, five, etc.).
- script analyzer 202 may build a word N-gram “dictionary” of all words from script 110 and may record their relative positions within script 110 and/or STT analyzer 206 may build a word N-gram “dictionary” of all words from transcript 114 and may record their relative positions within transcript 114 .
- the resulting N-gram dictionaries may include an ordered table of 1-gram, 2-gram, 3-gram, or N-gram word sequences.
- method 400 includes matching N-grams, as depicted at block 408 .
- matching N-grams may include attempting to match N-grams of the script 110 to corresponding N-grams of transcript 114 .
- SCR analyzer 202 and/or STT analyzer 206 may attempt to match all word N-grams of the N-gram dictionaries and may store the matches (e.g., in an internal table) in association with corresponding timecode information associated with the respective transcript word(s).
- the stored matching N-grams may indicate the potential for a matched sequence of words, and may be referred to as “candidate” N-grams for merging.
- a phrase from the script N-gram dictionary may be matched with a corresponding phrase the transcript N-gram dictionary, however, due to the phrase being repeated several time within the script/video content, the match may not be accepted until the relative positions can be verified.
- method 400 includes merging N-grams, as depicted at block 410 .
- merging of N-grams may be provided by SCR analyzer 202 and/or STT analyzer 206 .
- merging N-grams includes merge some or all sequential N-gram matches into longer matched N-grams. For example, where two consecutive matching N-grams are identified, such as two consecutive 3-grams of (The, rain, in) and (rain in Spain), they may be merged together to form a single N-gram, referred to as a single 4-gram of (The, rain, in, Spain). Such a technique may result in merged N-grams of length N+1 after each iteration.
- the technique may be repeated (e.g., iteratively) to merge all consecutive N-grams to provide N-grams having higher values of N.
- N-grams with higher values of N may have higher probabilities of being an accurate match.
- the iterative process may continue until no additional N-gram matches are identified. For example, where there are at most ten consecutive words identified as matching, increasing to an 11-gram length may yield no matching results, thereby terminating the merging process. Further, techniques for N-gram matching are discussed in more detail below with respect to FIGS. 8A and 8B .
- the resulting set of merged N-grams may provide a set of “hard alignment points”.
- each separate N-gram may indicate with relatively high certainty that a sequence of words in script 110 precisely matches a sequence of words in transcript 114 .
- the sequence of words may identify a hard-alignment point.
- a hard alignment point may include a series of matched words.
- the hard alignment points may include a series of words that each soft-align.
- timing data for each of the words of the matching N-grams may be correlated with the corresponding script words.
- timing data for other words e.g., unmatched words or words having low probabilities of accurate matches
- may be assessed and determined based on the timecode data of words associated with matched words e.g., words that make up one or at least a portion of one or more alignment points. For example, interpolation may be used to assess and determine the position of a script word that occurs between matched script words (e.g., script words associated with alignment points).
- method 400 includes generating a sub-matrix, as depicted at block 412 .
- each hard alignment point may define a block of script text (e.g. a sequence of words in script 110 ) and a timecode indicative of where the hard alignment point occurs in the video.
- script and transcript words associated with hard alignment points may be associated with timecode data, other script words (e.g., unmatched words between each hard alignment point) may still need to be aligned to corresponding transcript words to assess and determine their respective timecode.
- each successive pair of hard/soft alignment points is used to create an alignment sub-matrix.
- the alignment sub-matrix may include script words (e.g., sub-set of script words) that occur between matched script words (e.g., script words associated with hard alignment points) and intermediate transcript words (e.g., a sub-set of transcript words) that occur between matched transcript words (e.g., transcript words associated with hard alignment points).
- the script words may be provided along one axis (e.g., the y or x-axis) of the sub-matrix, and the intermediate transcript words may be provided along the other axis (e.g., the x or y-axis) of the sub-matrix.
- FIG. 5A depicts an exemplary (full) alignment matrix 500 in accordance with one or more embodiments of the present technique.
- Alignment matrix 500 may include some or all of the script words aligned in sequence along the y-axis and all of some of the transcript words aligned in sequence along the x-axis, or vice versa.
- script words and transcript words would match exactly, resulting in a substantially straight line having a slope of about one or negative one.
- hard alignment points 502 are identified. Between each of the hard-alignment points 502 are a number of soft alignment points 504 (denoted by squares) and/or interpolated alignment points 506 (denoted by X's).
- Hard alignment points 502 may be determined as a result of matching/merging N-gram sequences as discussed above with respect to blocks 408 and 410 .
- Soft alignment points 504 may be determined as a result of additional processing, such as use of a standard/optimized Levenshtein algorithm, discussed in more detail below.
- Interpolated alignment points 506 may be determined as a result of additional processing, such as linear or non-linear interpolation between hard and/or soft alignment points, discussed in more detail below.
- Interpolation intervals 507 extend between adjacent soft alignments points 504 .
- alignment matrix 500 may include one or more alignment sub-matrices 508 a - 508 g (referred to collectively as sub-matrices 508 ).
- Sub-matrices 508 a - 508 g may be defined by the set of points (e.g., script words and transcript words) that are located between adjacent, respective, hard alignment points 502 .
- matrix 500 includes seven sub-matrices 508 a - 508 g .
- An exemplary sub-matrix 508 e is also depicted in detail in FIG. 5B .
- method 400 includes pre-processing a sub-matrix, as depicted at block 414 .
- Pre-processing of the sub-matrix may be provided at matrix aligner 208 .
- pre-processing the sub-matrix may include identifying the range of a particular sub-matrix (e.g., the range/sequence of associated script words and transcript words associated with the axis of the particular sub-matrix).
- script and transcript words that fall between two words contained in adjacent hard alignment points 502 may be identified as a matrix sub-set of script words (SCR word sub-set) 510 (represented by outlined triangles) and a corresponding matrix sub-set of transcript words (STT word sub-set) 512 (represented by solid triangles), as depicted in FIG. 5B with respect to sub-matrix 508 e .
- SCR word sub-set represented by outlined triangles
- STT word sub-set represented by solid triangles
- a timecode and position offset data structure used for booking is initialized prior to words of SCR word sub-set 510 being aligned to words of STT word sub-set 512 of sub-matrix 508 e .
- all special symbols and punctuation are removed from SCR word sub-set 510 . This may provide for a more accurate alignment as both symbols and punctuations are typically not present in a transcript 114 , and, are, thus, not present in STT word sub-set 512 .
- sub-matrices 508 of the initial alignment matrix 500 are sequentially processed (e.g., in order of their location along the diagonal of the alignment matrix 500 ) to find the best time alignment for words between each pair of hard reference points 502 that define each respective sub-matrix 508 a - 508 g .
- system 100 includes a single core system used to process the sub-matrices
- alignment of the sub-matrices 508 may be processed sequentially (e.g., in series—one after the other).
- system 100 includes a multi-core system used to process sub-matrices
- alignment of some or all of sub-matrices 508 may be processed in parallel (e.g., simultaneously). Such parallel processing may be possible as the processing of each sub-matrix is independent of all of the other sub-matrices due to the bounding of the matrices with hard alignment points that are assumed to be accurate and that include known timecode information.
- method 400 includes aligning the sub-matrix, as depicted at block 416 .
- Aligning the sub-matrix may be provided at matrix aligner 208 .
- a sub-matrix may be aligned using an algorithm.
- An algorithm may employ a dynamic programming technique to assess multiple potential alignments for a sub-matrix, to determine the best fit alignment of the potential alignments, and employ the best fit alignment for the given sub-matrix.
- an algorithm may identify several possible solutions within the sub-matrix, and may select the solution having the lowest indication of possible error.
- the algorithm may include a Levenshtein Word Edit Distance algorithm.
- a dynamic programming algorithm for computing the Levenshtein distance may require the use of an (n+1) ⁇ (m+1) matrix, where n and m are the lengths of the two respective word sets (e.g., the SCR word set and the STT word set).
- the algorithm may be based on the Wagner-Fischer algorithm for edit distance.
- an alignment path defines a potential sequence of words that may be used between hard alignment points.
- aligning the sub-matrix may include breaking alignment paths within each sub-matrix into discrete sections during processing to more accurately assess individual portions of the alignment path. Based on match probabilities/strengths of various portions of the alignment path, a single alignment path may be broken into separate discrete intervals that are assessed individually. For example, where an alignment path within a sub-matrix includes a first portion having a relatively high match probability and an adjacent second portion having a relatively low match probability, the first and second portions can be separated.
- the first portion may be identified as a sequence of words having a high probability of a match
- the second portion may be identified as a sequence of words having a low probability of a match. Accordingly, the first portion may be identified as an accurate match that can be relied on in subsequent processing and the second portion may be identified as an inaccurate match that should not be relied on in subsequent processing.
- Such a technique may be used in place of merely identifying a mediocre match of the entire alignment path that may or may not be reliable for use in subsequent processing.
- aligning the sub-matrix may include weighting various processing operations to reflect operations that may be indicative of inaccuracies.
- aligning the sub-matrix may include assessing weighting penalties for matched words that are subject to an insert, delete, or substitute operation. Such a technique may help to adapt to false-positive word identifications produced by an STT engine.
- the algorithm may be modified in an attempt to improve alignment. For example, in some embodiments, timecode information recorded with each word of an STT word set is correlated with a matching word of a corresponding SCR word set.
- the matching word may include a single word or a continuous sequence of words, wherein the sequence of words includes less than the number (“N”) of words required by the selected N-gram.
- N the number of words required by the selected N-gram.
- an algorithm such as a Levenshtein Word Edit Distance algorithm, may be used to identify soft-alignment points.
- the soft designation is used to indicate that because of noise, error artifacts, and the like in STT transcript 114 , these alignments may have a lower probability of being accurate than the multi-word, hard-alignment points that define the range/partition of the respective sub-matrix.
- soft-alignment points may be determined using heuristic and/or phonetic matching.
- aligning the sub-matrix may include heuristic filtering.
- Heuristic filtering of noise may include filtering (e.g., ignoring or removing) “stop words” (e.g., short articles such as “a”, “the”, etc.) that are typically inserted into an STT transcript when the STT engine is confused or otherwise unable to decipher the audio track.
- stop words e.g., short articles such as “a”, “the”, etc.
- STT engines often insert articles such as “a”, “the”, etc. while various events other than dialogue occur, such as the presence of noise, music or sound effects.
- Such articles may also be inserted when dialogue is present but cannot be deciphered by the STT engine, such as when noise, music or sound effects drown out dialogue or narration.
- the STT transcript may include a sequence of “the the the the the the the the . . . ” indicative of a duration when music or other such events occur in the audio content.
- heuristics may be used to identity portion transcript words that should be ignored. For example, transcript words that should not be considered in the alignment process, and/or should not be included in the resulting time-aligned script data.
- heuristics may be used to identify repetitive sequences of words, and to determine which of the repeated sequence of words, if any need to be included or ignored in the resulting script document. For example, where a clip includes repetitive dialogue, such as where an actor repeats their lines several times in an attempt to get the line correct, transcript 114 may include several repetitions (e.g., “i'll be back i'll be back i'll be back). A corresponding portion of script 110 may include a single recitation of the line (e.g., “I'll be back.”).
- heuristics may be implemented to identify the repeated phrases, to identify one of the phrases of the transcript for use in aligning with script words, and to align the corresponding script words to the selected phrase of transcript 114 .
- the timecodes for words of one of the three phrases in transcript 114 may be associated with the corresponding script words of the phrase “I'll be back”.
- the other repeated phrases are ignored/deleted.
- ignored/deleted transcript words may not be considered in the alignment process, and/or may not be included in the resulting time-aligned script data. Ignoring/deleting the phrases may help to ensure that they do not create errors in aligning other portions of script 110 .
- alignment may attempt to match the other two repeated phrases (e.g., those not selected) with phrases preceding or following the corresponding phrase of script 110 .
- they can also be aligned as “alternate takes”. For example, it may not know which take will eventually be used in a finished edit, so regardless of which take is used, the correct script text and timing information may flow through to that portion of the recorded clip in use.
- a single portion script text may be aligned to each of the repeated portions of the transcript text.
- aligning the sub-matrix may include matching based at least partially on phonetic characteristics of words. For example, a word/phrase of the SCR word set may be considered a match to a word/phrase of the STT word set when the two words/phrases sound similar.
- a special phonetic word comparator may be used to assess word/phrase matches.
- a phonetic comparator may include “fuzzy” encodings that provide for matching script words/phrases that may sound similar to a word identified in the STT transcript. Thus, a word/phrase may be considered a match if they fall within a specific phonetic match threshold.
- a script word may be considered a match to a transcript word if the transcript word is a word identified as being an phonetic equivalent to the word in script 110 , or vice versa.
- the terms “their” and “there” may be identified as phonetic matches although the terms do not exactly match one another.
- Such a technique may account for variations in spoken language (e.g., dialects) that may not be readily identified by an STT engine.
- Use of phonetic matching may be used in place of or in combination with an exact word/phrase match for each word/phrase.
- method 400 includes generating and/or interpolating intervals, as depicted at block 418 .
- Generating and/or interpolating intervals may be provided at interval generator/interpolator 210 .
- generating and/or interpolating intervals may include identifying intervals between identified matched words (e.g., words of hard and/or soft reference points), interpolating the relative position of un-matched words between the matched words.
- An interpolated timecode for the un-matched words may be based on their interpolated position between the matched words and the known timecodes of the matched words.
- the sub-matrices are combined to form a list including script words and corresponding transcript words for each word associated with a hard or soft alignment point.
- all possible word alignment correspondences have been identified, leaving only unmatched script dialogue words (e.g., words that are not associated with hard nor soft reference points), and non-dialogue words within the script such as scene action descriptions and other information.
- unmatched script dialogue words e.g., words that are not associated with hard nor soft reference points
- non-dialogue words within the script such as scene action descriptions and other information.
- the timecode information for the unmatched script words is provided via linear timecode interpolation.
- Linear time code interpolation may include defining an interval that extends between two adjacent reference points, and spacing each of the unmatched words that occur between the two reference points across equal time spacing (e.g., sub-interpolation intervals) within the interval.
- a sub-interpolation interval may be defined as:
- t 1 is a timecode of a first reference point defining a first end of an interpolation interval
- t 2 is a timecode of second reference point defining a second end of the interpolation interval
- n is the number of unmatched words.
- a first of the unmatched words may be determined to occur at 1.25 seconds
- a second of the unmatched words may be determined to occur at 1.50 seconds
- a third of the unmatched words may be determined to occur at 1.75 seconds.
- the sub-interpolation interval is equal to (2 sec-1 sec)/(3+1), or 0.25 sec.
- FIG. 5B illustrates interpolated points 506 for unmatched script words that are evenly spaced between soft alignment points in accordance with the above described linear interpolation technique. A similar technique may be repeated for each respective interpolation interval between hard/soft alignment points.
- method 400 includes assigning timecodes, as depicted at block 420 .
- Assigning timecodes may be provided at time-coded script generator 212 .
- assigning time codes includes assigning times for each of the script words based on the reference points and interpolated points. For example, in some embodiments, the entire list of soft alignment points is scanned and each successive pair of soft alignment points defines an interpolation interval. Upon defining each interpolation interval, sub-interpolation intervals are determined, and timecode data aligning with the sub-interpolation intervals is assigned to all of the script words of the respective script word set. For example, the unmatched words of the above described interpolation interval may be assigned timecodes of 1.25 seconds, 1.50 seconds, and 1.75 seconds, respectively. Further, techniques for interpolating are discussed in more detail below with respect to FIGS. 8A and 8B .
- a non-linear interpolation technique may be employed to assess and determine timecode information associated with words/phrases within a script document.
- non-linear interpolation or similar matching techniques may be used in place of or in combination with linear interpolation techniques employed to determine timecodes for script words.
- Non-linear interpolation may be useful to account for words that were not spoken at even rate between alignment points. For example, where two alignment points define an interval having matched words on either end and several unmatched words between them, linear interpolation may assign timecode information to the unmatched words assuming an even spacing across the interval as discussed above. The resulting timecodes may be reflective of someone speaking at a constant cadence across the interval. Unfortunately, the resulting timecode information may be inaccurate due to different rates of speech across the interval, pauses within the interval, or the like.
- non-linear interpolation of timecode information may include assessing an expected rate (or cadence) for spoken words and applying that expected rate to assess and determine timecode information for the unmatched words.
- non-linear interpolation may include, for a given script word, determining a rate of speaking for matched script words proximate the script word, and applying the rate of speaking to determine a timecode for the script word.
- FIG. 7A illustrates alignment of a script phrase 700 (e.g., a portion of script data 110 ) with a spoken phrase 701 (e.g., a portion of transcript 114 ) that may be accomplished using non-linear interpolation in accordance with one or more embodiments of the present technique.
- script phrase 700 is illustrated in association with an alignment 702 .
- Phrase 700 includes, “What is your answer to my question? I need to know your answer now!”
- Alignment 702 includes a series of word-match indicators (e.g., word associated with a hard alignment point (H) and words associated with a soft alignment point (S)) and words that are unmatched (U).
- the dots/points between the unmatched representations of “question” and “I” may indicate a pause between speaking of the words (e.g., a pause that would be indicated by timecode information differential between transcript words “position” and “eye” of spoken phrase 701 ).
- the sequence of four words “What is your answer to” and “know your answer now” include matches, and the words, “my”, “question”, “I”, “need” and “to” are unmatched.
- rates of speaking matched words proximate/adjacent (e.g., before or after) unmatched words may be used to assess and determine timecode information for the unmatched words.
- the rate of speaking “What is your answer to” may be used to assess and determine timecode information for the words “my” and “question.” That is, if it is determined that “What is your answer to” is spoken at a rate of one word every 0.1 seconds (e.g., via timecode information provided in the transcript and/or prior alignment/matching), the following words “my question” may be assigned timecode information in accordance with the rate of 0.1 words per second.
- timecodes associated with twenty-one minutes and one-tenth of a second (21:00.1) and twenty-one minutes and two-tenths of a second (21:00.2) may be assigned to the words “my” and “question”, respectively, in aligned script data 116 , for example.
- punctuation within the script may also be used to assess and determine timecode information.
- punctuation indicative of the end of a phrase may be used to determine the presence of a pause between words or phrases.
- the presence of the question mark in phrase 700 may indicate that the phrases “What is your answer to my question?” and “I need to know your answer now!” may be separated by a pause and, thus may each be spoken at different rates.
- Such a technique may be employed to assure that non-linear interpolation is applied to the individual phrases within a sub-matrix to account for an expected pause.
- the rate of speaking “know your answer now” may be used to assess and determine timecode information for the words “I”, “need” and “to”. That is, if it is determined that “know your answer now” was spoken at a rate of one word every 0.2 seconds (e.g., via timecode information provided in transcript 114 ), the preceding words “I need to” may be assigned timecode information in accordance with the rate of 0.2 words per second.
- the word “know” is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.00) within a movie
- the word “need” was spoken at twenty-one minutes nine and six-tenths of a second (21:09.6)
- the word “to” was spoken twenty-one minutes nine and eight-tenths of a second (21:09.8).
- Timecodes associated with twenty-one minutes nine and four-tenths of a second (21:09.4), twenty-one minutes nine and six-tenths of a second (21:09.6), and twenty-one minutes nine and eight-tenths of a second (21:09.8) may be assigned to the words “I”, “need”, and “know”, respectively, in aligned script data 116 , for example. Accordingly, punctuation may be used to identify pauses or similar breakpoints that can be used to break words or phrases into discrete intervals such that respective rates of speaking (e.g., cadence) can be appropriately applied to each of the discrete intervals. Other indicators may be used to indicate characteristics of the spoken words. For example, “stopwords” present in the transcript may be indicative of a pause or break in speaking and may be interpreted as a pause and implemented as discussed above.
- the unmatched words may be assigned timecode information based on even spacing between the matched words, and thus, may not account for the pause or similar variations.
- the first of the words “to” is determined to have been spoken at exactly twenty-one minutes (21:00.0) and the word “know” is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.0)
- the five unmatched words “my”, “question”, “I”, “need” and “to” would be evenly spaced across the ten second interval at 1.67 second intervals, not accounting for the pause. Although minor in these small increments, this could lead to increased alignment errors where a pause in dialogue occurs for several minutes, for example.
- a rate of speech may be based on machine learning. For example, a rate of speech may be based on other words spoken proximate to the words in question. In some embodiments, a rate of speech may be determined based on elements of the script. For example, a long description of an action item may be indicative of a long pause in the actual dialogue spoken.
- words of the script that occur proximate/between reference points may be aligned with unmatched words of the transcript that also occur proximate/between the same reference points.
- the four unmatched words “my”, “question”, “I” and “need” of script phrase 700 fall within in the interval between matched words “to” and “know”.
- the timecodes associated with the unmatched words of transcript phrase 701 may be assigned to the four unmatched words “my”, “question”, “I” and “need” of script phrase 700 , respectively. That is the timecode of the first unmatched transcript word in the interval may be assigned to the first unmatched script word in the interval, the timecode of the second unmatched transcript word in the interval may be assigned to the second unmatched script word in the interval, and so forth.
- punctuation and/or capitalization from script text may be used to improve alignment. For example, if the first alignment point (hard or soft) occurs in the middle of the first sentence of the clip, it may be determined that the script words and transcript words preceding the alignment point in the script text and the corresponding transcript text should align with one another.
- the timecodes for the script words may be interpolated (e.g., linearly or non-linearly) across the time interval that extends from the beginning of speaking of the corresponding transcript words in the scene to the corresponding alignment point.
- the corresponding script words and transcript words may have a one-to one correspondence, and, thus, timecode information may be directly correlated.
- the first script word of the sentence may be associated with the timecode information of the first transcript word of the clip
- the second script word of the sentence may be associated with the timecode information of the second transcript word of the clip
- the beginning of a sentence may be identified by a capitalized word and the end of a sentence may be identified by a period, exclamation point, question mark, or the like.
- FIG. 7B is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique. More specifically, FIG. 7B illustrates alignment of a script text 703 (e.g., a portion of script 110 ) with a spoken dialog 704 (e.g., a portion of transcript 114 ) that may be accomplished with the aid of capitalization and punctuation in accordance with one or more embodiments of the present technique.
- Script text 703 includes a portion of a script that is spoken throughout a clip/scene.
- script text 703 includes the first sentence of the clip/scene (e.g., “It is good to see you again”) and the last sentence of the clip/scene (e.g., “I will talk to you later tonight”).
- Spoken dialog 704 may include transcript text of a corresponding clip (e.g., “get it could to see you again” and “i will talk with you house get gator flight”).
- script text 703 and transcript text 704 is illustrated in association with an alignment 705 .
- Alignment 705 includes a series of word-match indicators (e.g., word associated with a hard alignment point (H) and words associated with a soft alignment point (S)) and words that are unmatched (U).
- timecode for the script words at the beginning of the scene/clip that precede the first alignment point may be interpolated across the time interval that extends from the beginning of speaking of the corresponding transcript words in the scene/clip to the corresponding alignment point (e.g., interpolated between the timecode of the transcript words “get” and “to” in the transcript phrase 704 ).
- the number of corresponding unmatched script words and transcript words has a one-to-one correspondence, and, thus, timecode information may be directly correlated.
- the first three script words (“It”, “is” and “good”) may each be assigned timecodes of the first three transcript words (“get”, “it” and “could”), respectively.
- the location of the alignment points in the middle of the last sentence may enable the unmatched words “about”, “it”, “later”, and “tonight” that are located between the last alignment point of the scene/clip and the period indicative of the end of the scene/clip, to be interpolated across the interval between the transcript words “you” and “flight” and/or to each be assigned timecode information corresponding to transcript words “house”, “get”, “gator”, and “flight”, respectively.
- script elements may be used to identify the beginning or end of a sentence. For example, if between two lines of dialog, there is a parenthetical script element that corresponds to a sound effect, such as a car crash, the presence of the sound effect, indicated by a pause or stop words, may be used to identify the beginning or end of adjacent lines of dialog.
- the techniques described with regard to alignment points in the middle of a sentence at the beginning or end of a scene/clip may be employed.
- the timecodes for the unmatched words that occur between the alignment point and the identifiable script element may be interpolated across the corresponding interval or otherwise be determined. That is, the intermediate script element may be used in the same manner as capitalization and/or punctuation is used as described above.
- the density of the words in the transcript may be used to assess and determine timecode information associated with the words in the script. For example, in the illustrated embodiment of FIG. 7 , there are four unmatched transcript words in the interval of phrase 701 between matched words (e.g., “two” and “know”) and there are five unmatched words (e.g., “my”, “question”, “I”, “need” and “to”) in the corresponding interval of phrase 700 between matched words (e.g., “to” and “know”). Based on the timecode information for the transcript words in the interval, it may be determined that two of the four unmatched transcript words are spoken at the beginning of the interval and that two of the four unmatched transcript words are spoken at the end of the interval.
- timecode information for the transcript words in the interval it may be determined that two of the four unmatched transcript words are spoken at the beginning of the interval and that two of the four unmatched transcript words are spoken at the end of the interval.
- a corresponding percentage of the script words (e.g., approximately equal to the percentage of transcript words) will be provided over the respective portions of the interval. For example, in the embodiment of FIG.
- the word “to” (in the first portion of the phrase 700 ) that defines a start of the interval is determined to have been spoken at exactly twenty-one minutes (21:00.0)
- the word “know” defining an end of the interval is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.0)
- the word “position” is determined to have been spoken at exactly twenty-one minutes and ten and two-tenths seconds (21:00.2)
- the word “eye” is determined to have been spoken at exactly twenty-one minutes and nine and four-tenths seconds (21:09.4)
- the two unmatched script words “my” and “question” may be evenly spaced over the first portion of the interval from twenty-one minutes (21:00.0) to twenty-one minutes and ten and two-tenths seconds (21:00.2)
- the three unmatched words “I”, “need” and “to” may be evenly spaced across the third portion of the interval from twenty-one minutes and nine and four-tenths seconds (21:09.4) to twenty-one minutes
- the distribution of script words within the interval is approximately equivalent to the distribution of transcript words in the corresponding interval. That is, about fifty percent of the script words in the interval are time aligned across the first portion of the interval before the pause and about fifty percent of the script words in the interval are time aligned across the third portion of the interval after the pause.
- a plurality of script words may be accepted for use in the time-aligned script data based on a confidence (e.g., high probability/density of word matches that were previously determined).
- a confidence e.g., high probability/density of word matches that were previously determined.
- Such a technique may enable blocks of text to be verified/imported from the script data to the time-aligned script data when matches within the blocks are indicative of a high probability that the corresponding script words are accurate. That is, the script data will be the text used in the time-aligned script data for those respective words of the script/dialogue.
- a block of script words may be imported when word matches (e.g., hard alignment points and/or soft alignment points) meet a threshold level.
- verifying/importing blocks of text may include using some individual script words having a match (e.g., associated with hard and/or soft alignment points) with words of the script, while importing/using unmatched transcript words (e.g., that are not associated with a soft and/or hard alignment points).
- verifying/importing script words may include importing text characteristics, such as capitalization, punctuation, and the like. In the embodiment of FIG.
- the script text may be used for the entire block of text in the aligned script document, including matched and unmatched words for use in the script-aligned data. For example, the block of corresponding script text “What is your answer to my question? I need to know your answer now!” may be used in the aligned script although all of the words do not have a match.
- the imported script words have incorporated the capitalization and punctuation of the corresponding text of the script document.
- Timecode information may be associated with each of the script and transcript words using any of the techniques described herein to properly time align the unmatched words of the phrase (e.g., to provide timecodes for the words “my question? I need to”).
- the transcript words including those not matched
- the transcript words may be used in the resulting time-aligned script. Accordingly, if the transcripts words of the phrase “What is your answer to by position eye do know your answer now!” have a high confidence leave but are not all matched, the phrase may be used in the resulting text of the time-aligned script data. Note that both, the matched and unmatched words of the raw STT have been imported.
- Such a technique may facilitate use of transcript words in place of script words where the actor ad-libs or otherwise does not recite the exacting wording of the script.
- a user could choose for themselves whether to use the Script word(s) or SST transcript word(s), based on an indication, such as confidence level. For example, even if the confidence level assumes one is more accurate than the other, it may not be so, and the user may be provided an opportunity to correct this by switching use of one or the other in the script data. Also, the user can manually edit in a correction, and this correction could be automatically stamped with a 100% confidence label. In some embodiments, the automated changes/imports may be marked such that a user can readily identify them, and modify them as needed.
- confidence/probability information provided during STT operations may be employed to assess whether or not a word or block of words in a transcript meets threshold criteria, such that the transcript words may be used in the time-aligned script data in place of the corresponding script words.
- Such an embodiment may resolve discrepancies by using the transcript word in the aligned script data 116 where there is a high confidence that the transcript word is accurate and the corresponding script word is not (e.g., where an actor ad-libs a line such that the actual words spoken are different from the words in the script).
- an STT engine may provide a high confidence level (e.g., above 90%) for a given transcript word, and, thus, the transcript word is considered to meet the threshold criteria (e.g., 85% or above). That is, the word in the transcript may be more accurate than corresponding script words. As a result, the transcript word is provided in the aligned script data, in place of a corresponding script word.
- a confidence/probability provided by an STT operation may be used in combination with matching criteria.
- the transcript word may be provided in the aligned script data, in place of a corresponding script word.
- a high confidence level e.g., above 90%
- the STT engine provides a low confidence level (e.g., below 50%) for a corresponding transcript word
- the script word may be provided in the aligned script data, in place of a corresponding transcript word.
- a portion of the script may be longer than a corresponding clip.
- the portion of the script that is actually spoken may be time aligned appropriately, and the unspoken portions of the script may be bunched together between aligned points.
- the bunching of words may result in timecode information being associated with the bunched words that indicates them being spoken at an extremely high rate, when in fact they may not have been spoken at all.
- a threshold is applied to ignore or delete words that appear to have been spoken too quickly such that bunched words may be ignored or deleted.
- a threshold word rate may be set to a value that is indicative of the fastest reasonable rate for a person to speak (e.g., about six words per second).
- the threshold word rate may be set to a default value, may be determined automatically, or may user selected.
- a speaking rate may be customized based on the character speaking the dialogue. For example, one actor may speak slowly whereas another actor may speak much faster, and thus the slower speaking character's dialogue may be associated with a lower threshold rate, where as the faster speaking character's dialogue may be associated with a higher threshold rate.
- Automatically determining a threshold word rate may include sampling other spoken portions of a script (e.g., other lines delivered by the same character) to determine a reasonable rate for words that are actually spoken, and the threshold rate may be set at that value or based off of that value.
- a maximum word rate threshold may be set to approximately twenty percent greater than that value (e.g., about six words per second). Such a cushion may account for natural variations in speaking rate that may occur while still identifying unlikely variations in speaking rate. In some embodiments, words having spacing that do not fall within the maximum word rate threshold are ignored or deleted, such that they are not aligned. For example, a script may read:
- the corresponding video content may only include an actor reciting Henry's lines, one after the other.
- the lines delivered for Henry may be provided accurate timecode information associated with the time periods in which the two lines are spoken, however, the line associated with Indy, that is not spoken, may be bunched into the pause between delivery of Henry's first and second lines. For example, if Henry's lines were delivered one-after the other, with a half-second pause in-between, the phrase “I like Indiana more than the name Henry Jones, Junior” may not be matched (because it was not actually spoken) and, thus, may be interpolated (e.g., linearly) over the half-second time frame between the lines in the script.
- Corresponding timecode information may indicate that “I like Indiana more than the name Henry Jones, Junior” was spoken at a rate of one word about every five one-hundredths of a second, or about twenty words per second. Where the maximum word threshold is set to about six words per second, the determined rate of about twenty words per second would exceed the maximum word threshold. Thus the phrase “I like Indiana more than the name Henry Jones, Junior” may be ignored/deleted, such that alignment may be provided for only the lines actually spoken (e.g., Henry's lines). The phrase “I like Indiana more than the name Henry Jones, Junior” may not be provided in the time-aligned data 116 .
- words that were bunched at the beginning or end of dialogue may be identified and removed.
- the following lines at the beginning of the dialogue were linearly interpolated:
- ignoring/deleting words that appear to exceed a maximum threshold rate may also help to eliminate “stopwords” generated by an STT engine from being considered for alignment. For example, where an STT engine inserts a plurality of “the, the, the, . . . ” in place of music or sound effects, the high frequency of the words “the” may be identified and they may be ignored/deleted such that they are not aligned to words in the script.
- the stopwords may be flagged (e.g., not recognized) so that a user can take further action if desired.
- a clip may include audio content having extraneous spoken words that are not intended to be aligned with corresponding script words.
- extraneous words and phrases may include an operator calling out “Speed!” shortly before starting the camera rolling while audio is already being recorded, the director calling out “Action!” shortly before the characters beginning to speak lines of dialogue, the director calling out “Cut!” at the end of a take, or conversations inadvertently recorded shortly before, after, or even in the middle of a take. These cues typically occur at the beginning and end of shots, and, thus, processing may be able to recognize these words based on their location and/or their audio-waveforms that are recognized and provided in a corresponding STT transcript.
- synchronization module 102 may align the extraneous words of the transcript to script words, resulting in numerous errors.
- User defined words such as “Speed”, “Action” and “Cut” may be defined and can be recognized by their audio waveforms and provided in a corresponding STT transcript. The user defined words may be automatically flagged for the user or deleted.
- only a defined range of recorded dialogue is aligned to script text. Such a technique may be useful to ignore or eliminate extraneous recorded audio from the alignment analysis. For example, defining a range of recorded dialog may enable the analysis to ignore extraneous conversations or spoken words that are incidentally recorded just before or after a take for a given scene.
- an in/out range defines the portion of the audio that is aligned to a corresponding portion of the script.
- Defining an in/out range may define discrete portions of the script (e.g., script word) and/or audio content (e.g., transcript words) to analyze while also defining discrete portions of the audio content data to ignore during the alignment of transcript words with corresponding script words, thereby preventing extraneous words (e.g., transcript words) from inadvertently being aligned with script words.
- FIG. 7C is a depiction of a line of text and corresponding in/out ranges in accordance with one or more embodiments of the present technique. More specifically, FIG. 7C illustrates an exemplary in-range 710 and out-ranges 711 .
- the in-range 710 and out-ranges 711 limits analysis to only audio content of in-range 710 , referred to herein as audio content of interest 712 , and excludes audio content not located within in-range 710 (e.g., content located in out-ranges 711 ).
- Audio content of interest 712 may include the dialogue or narration spoken during the respective clip that falls within one or more specified in/out-ranges.
- Extraneous audio content 714 may include words captured on the audio that are not intended to be aligned with a corresponding portion of script document, and, thus, fall outside of the one or more specified in/out-ranges.
- audio content of interest 712 includes the transcribed phrase “hello mike . . .
- extraneous audio content 714 includes the phrases/words “are we ready speed action” spoken at the head of the clip, just before audio content of interest 712 and “cut how did that look” spoken at the tail of the clip, just after audio content of interest 712 .
- in range 710 is defined by an in-marker 710 a and an out-marker 710 b .
- In-marker 710 a defines a beginning of audio content of interest 712
- out-marker 710 b defines an end of audio content of interest 712 .
- extraneous content 714 at the head and tail of the clip is ignored during analysis, as indicated by the grayed out bar in FIG. 7C .
- embodiments may include multiple discrete ranges defined within a single clip.
- two additional in/out markers may be added within in-range 710 , thereby dividing it into two discrete in-ranges and providing an additional out-range embedded therein.
- the use of in/out-ranges may be employed to resolve issues normally associated with multiple takes of a given scene or clip.
- an out-range may be located at any portion of the clip.
- the in/out-ranges may be swapped, thereby ignoring extraneous audio data in the middle of the clip, while analyzing audio content of interest at the head and tail of the clip.
- markers 710 a and 710 b may be user defined. For example, a user may be presented with a display similar to that of FIG. 7C and may use a slider-type control to move markers 710 a and 710 b , thereby windowing in/out-ranges 710 and 711 . Thus, a user may view some or all of the text and may cut-out the extraneous audio content 714 using in/out-ranges.
- markers 710 a and 710 b may be defined as an offset of a given duration of time or number of words. For example, an offset of ten-seconds may exclude ten seconds of audio data at the head or tail of the clip.
- Such a technique may be of particular use where there is a consistent delay at the beginning or end of filming a clip.
- An offset of five words may exclude the first and/or last five words of spoken dialog at the head or tail of the clip.
- Such a technique may be of particular use where there is a consistent phrase or series of words spoken at the beginning or end of filming a clip.
- the offsets may be predetermined and/or user selectable. For example, a default offset value may be employed, but may be editable by a user (e.g. via a sliding window as described above).
- portions of the audio content may include extraneous audio other than spoken words, such as music or sound effects. If analyzed, the extraneous audio may create an additional processing burden on the system. For example, synchronization module 102 may dedicate processing in an attempt to match/align extraneous transcript words (e.g., stop words) to script words. In some embodiments, the extraneous audio content may be identified and ignored during alignment. Such a technique may enable processing to focus on dialogue portions of audio content, while skipping over segments of extraneous audio. In some embodiments, the audio content may be processed to classify segments of the audio content into one of a plurality of discrete audio content types.
- segments of the audio content identified as including dialogue may be classified as dialogue type audio
- segments of the audio content identified as including music may be classified as music type audio
- segments of the audio content identified as including sound effects may be classified as sound effect type audio.
- segments of transcript words that include a series of different words occurring one after another (e.g., how are you doing) and/or that are not indicative of stop words may be classified as a dialogue type audio
- segments of transcript words that include a series of stop words of a long duration e.g., the the the the . . .
- segments of transcript words that include a series of stop words of a short duration e.g., the the the the
- segments of the audio content that cannot be identified as one of dialogue, music or sound effect type audio may be categorized as unclassified type audio.
- each of the segments may or may not be subject to alignment or related processing based on their classification. For example, during alignment of transcript words to script words, the segments associated with dialogue type audio may be processed, whereas the segments associated with music and sound effect type audio may be ignored. By ignoring music and sound effect type segments, processing resources may be focused on the dialogue segments, and, thus, are not wasted attempting to align the transcript words associated with the music and sound effect to script words.
- unclassified type audio may be considered for alignment or may be ignored.
- what classifications are processed and what classifications are ignored may include a default setting and/or may be user selectable.
- a weighting value is assigned to each word based on the alignment type (e.g., interpolation, hard alignment, or soft alignment). Stronger alignments (e.g., hard and soft alignments) may have higher weighting than weaker alignments (e.g., interpolation).
- a total weighting is assessed for a window/interval that includes several consecutive words. The interval of several words is a sliding window that is moved to assess adjacent intervals/windows of words.
- the total weighting e.g., sum of weightings
- timecodes may be assigned to one or more of the words, thereby, not ignoring/deleting the words in the window.
- Such a technique may be provided at the beginning and end of a set of dialogue to assess and determine the start and stop of the actual spoken dialogue and to ignore/delete the script dialogue that preceded/followed the spoken dialogue in the script, but was not actually spoken (e.g., the script text that was linearly interpolated as was bunched before or after the dialogue actually spoken).
- processing may be implemented to time-align script elements other than dialogue (e.g., scene headings, action description words, etc.) directly to the video scene or full video content.
- script elements other than dialogue (e.g., scene headings, action description words, etc.) directly to the video scene or full video content.
- a script element, other than dialogue e.g., a scene heading
- timecodes of the words may be used to determine a timecode of the intervening script element.
- a script element occurring in the script between the two words may be assigned a timecode between 21:00.00 and 21:10.00, such as 21:05.00.
- one or more script elements may have their timecodes determined via linear and/or non-linear interpolation, similar to that described above.
- the amount of content e.g., the number of lines or number of words
- script elements may be used to assess a timecode for a given script element or plurality of script elements.
- first script element between two words having timecodes includes half the amount of content of a second script element also located between the two words
- the first script element may be assigned a timecode of 21:03.00 and the second script element may be assigned a time code of 21:05.00, thereby reflecting the smaller content and potentially shorter duration of the first element relative to the second element.
- some or all of the script elements may be provided in the time-aligned script data in association with a timecode.
- timecodes are first assigned to the dialogue words during initial alignment, and timecodes are assigned to the other script elements in a subsequent alignment process based on the timecodes of the dialogue determined in the initial alignment (e.g., via interpolation).
- the resulting time aligned data 116 may include timecodes for some or all of the script elements of script 104 .
- method 400 includes generating a time-aligned script output, as depicted at block 422 , as discussed above.
- Generating time-aligned script output may be provided via time-coded script generator 212 .
- each word or element of the script and/or transcript may be associated with a corresponding timecode.
- the complete list of script word and/or transcript words that are associated with hard, soft and interpolated timecodes may be used to generate time-aligned data 116 , including a final TimeCodedScript (TCS) data file which contains some or all of the script elements with assigned time codes.
- TCS TimeCodedScript
- the TCS data file may be provided to another application, such as the Adobe Script Align and Replace feature of Adobe Premiere Pro, for additional processing.
- time-aligned data 116 may be stored in a database for use by other applications, such as the Script Align feature of Abode Premiere Pro.
- a graphical user interface may provide a graphical display that indicates where matches (e.g., hard and/or soft alignment points) or non-matches occur within a user interface.
- the user interface may include symbols or color coding to enable a user to readily identify various characteristics of the alignment. For example, hard alignments may be provided in red (or green) to indicate a good/high confidence, soft alignments in blue (or yellow) to indicate a lower confidence, and interpolated points in yellow (or red) to indicate an even lower confidence level.
- the user interface may enable a user to quickly scan the results to assess and determine where inaccuracies are most likely to have occurred.
- a user may commit resources for review and proofing efforts on portions of a time-aligned script that may be susceptible to errors (e.g., where no or few matches occur) and may not commit resources for review and proofing efforts on portions of a time-aligned script that may not be susceptible to errors (e.g., where a large number of matches occur).
- a user may be presented with a chart, such as that illustrated in FIG. 5A .
- the chart may enable a user to readily identify portions of the script that do not include a high percentage of matches (e.g., the sub-matrix 508 located at the uppermost left portion of the chart).
- high confidence areas may include a similar visual indicator (e.g., grayed out) and portions that may require attention may have appropriate visual indicators (e.g., bright colors—not grayed out).
- a user may be provided the option to select whether or not to use the text from the raw STT analysis or the text from the written script. For example, a user may be provided a selection in association with the sub-matrix 508 located at the uppermost left portion of the chart that enables all, some, or individual words contained in the sub-matrix to use the text from the raw STT analysis or the text from the written script.
- the information may be returned to synchronization module 102 and processed in accordance with the user input. For example, where a user opts to use STT text in place of script text, synchronization module 102 may conduct additional processing to provide the corresponding time-aligned script data. In some embodiments, the user may be prompted for input while synchronization module 102 is performing the time alignment. For example, as the synchronization module 102 encounters a decision point, it may prompt the user for input.
- speech-to-text analysis may provide the option of creating a custom dictionary (e.g., custom language model).
- a custom dictionary may be generated for a given clip based on one or more reference scripts that have content that is the same or similar to the given script, or based on a single reference script that at least partially corresponds to the video content or exactly matches the audio portions of the video content.
- some or all words of the reference script may be used to define a custom dictionary
- a raw speech analysis may be performed to generate a transcript using words of the custom dictionary to transcribe words of the audio content, transcript words may then be matched against the script words of the reference script to find alignment points, and the words of the reference script text may be paired with the corresponding timecodes, thereby providing a time-aligned/coded version of the reference script.
- a custom language model is generated for one or more portions of video content. For example, where a movie or scene includes a plurality of clips, a custom language module may be provided for each clip to improve speech recognition accuracy.
- a custom language model is provided to a STT engine such that the STT engine may be provided with terms that are likely to be used in the clip that is being analyzed by the STT engine. For example, during STT transcription, the STT engine may at least partially rely on terms or speech patterns defined in the custom language model.
- a custom language model may be directed toward a certain sub-set of language. For example, the custom language model may specify a language (e.g., English, German, Spanish, French, etc.).
- the custom language model may specify a certain language segment.
- the custom language module may be directed to a certain profession or industry (e.g., a custom language module including common medical terms and phrases may be used for clips from a medical television series).
- the STT engine may weight words/phrases found in the associated custom language module over the standard language model. For example, if the STT engine associates a word with a word that is present in the associated custom language model and a word that is present in a standard/default language model, the STT engine may select the word associated custom language model as opposed to the word present in the standard/default language model.
- a word identified in a transcript that is found in the selected custom language model may be assigned a higher confidence level than a similar word that is only found in the standard/default language model.
- a custom language model is generated from script text.
- script data 110 may include embedded script text (e.g., words and phrases) that can be extracted and used to define a custom language model.
- embedded metadata may be provided using various techniques, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “SYSTEMS AND METHODS FOR ASSOCIATING METADATA WITH MEDIA USING METADATA PLACEHOLDERS”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein.
- a custom language model may include a word frequency table (e.g., how often each of the words in the custom language model is used within a given portion of the script) and a word tri-graph (e.g., indicative of other words that precede and followed a given word in a given portion of the script).
- all or some of the text identified in the script may be used to populate the custom language model.
- Such a technique may be particularly accurate because the script and resulting language model should include all or at least a majority of the words that are expected to be spoken in the clip.
- speech-to-text (STT) technology may implement a custom language model as described in U.S.
- metadata included in the script may be used to further improve accuracy of the STT analysis.
- the script includes a clip identifier, such as a scene number
- the scene number may be associated with the clip such that a particular custom language model is used for STT analysis of video content that corresponds to the associated portion of the script.
- a first portion of the script is associated with scene one and a second portion of the script is associated with scene two
- a first custom language model may be extracted from the first portion of the script
- a second custom language model may be extract from the second portions of the script.
- the STT engine may automatically use the first custom language model
- the STT engine may automatically use the second custom language model.
- a clip when a clip contains only a few lines of dialogue in a short scene out of a very long script, knowing that the clip contains a specific scene number (e.g., harvested from the script metadata) allows focusing on the text in the script for that scene, and not having to assess the entire script.
- a specific scene number e.g., harvested from the script metadata
- FIG. 6 depicts a sequence of dialogs 600 in accordance with one or more embodiments of the present technique.
- a user may select a clip or group of clips, then chooses “Analyze Content” from a Clip menu, initiating the sequence of dialogs 600 .
- the Analyze Content dialog may allow a user to use embedded Adobe Story Script text if present for the speech analysis, or to add a reference script which will be used to improve speech analysis accuracy.
- the sequence of dialogs 600 includes content analysis dialogs that allow users to import a reference script to create a custom dictionary/language model for speech analysis.
- a reference script may include a text document containing dialogue text similar to the recorded content in the project (e.g., a series of nature documentary scripts, or a collection of scripts from a client's previous training videos).
- a user may choose Add from the Reference Script menu.
- the File Open dialog 604 a user may navigate to the reference script text file, select it and click OK.
- the Add Reference Script dialog 606 may open, where a user can name the reference script, choose a language, and view the text of the file below in a scrolling window.
- the “Script Text Matches Recorded Dialogue” option may be selected if the imported script exactly matches the recorded dialogue in the clips (e.g., a script the actors read their lines from).
- the analysis engine automatically sets the weighting of the reference script vs. the base language model based on length, frequency of key words, etc.
- a user may click the OK button, the Import Script dialog closes, and the analysis of the reference script may begin.
- the reference script is selected in the Analyze Content's Reference Script menu.
- the OK button the selected clip's speech content is analyzed.
- the reference script matches the recorded dialogue exactly (e.g., the script that was written for the project or transcriptions of interview sound bites).
- a user may select the “Script Text Matches Recorded Dialogue” option in the Add Reference Script dialog 606 , as discussed above. This may override the automatic weighting against the base language model and give the selected reference script a much higher weighting.
- Significantly higher accuracy can be achieved using matching reference scripts, although accuracy may be primarily dependent on the clarity of the spoken words and the quality of the recorded dialogue.
- an Adobe Story to Adobe OnLocation workflow may be used to embed the dialogue from each scene into a clip's metadata.
- a script written in Adobe Story may be imported into OnLocation, which may produce a list of shot placeholders for each scene.
- These placeholders may be recorded direct to disk using OnLocation during production or merged with clips that are imported into OnLocation after they were recorded on another device.
- the text for each scene from the original script may be embedded in the metadata of all the clips that were shot for that scene.
- Embedded metadata may be provided using various techniques, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “SYSTEMS AND METHODS FOR ASSOCIATING METADATA WITH MEDIA USING METADATA PLACEHOLDERS”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein.
- the script text embedded in each of the clips may be automatically used as a reference script and, then, aligned with the recorded speech during the analysis.
- the analyzed speech text is replaced with the script text embedded in the source clip's extensible metadata platform (XMP) metadata.
- XMP extensible metadata platform
- the “Use Embedded Adobe Story Script Option” of Analyze Content dialog 602 when the “Use Embedded Adobe Story Script Option” of Analyze Content dialog 602 is selected, Adobe Story script text embedded in an XMP will be used for analysis, and the Reference Script popup menu may be disabled. If the selected clip contains Adobe Story script embedded text, the “Use Embedded Adobe Story Script Option” may be checked by default. For mixed states in the selection (e.g., where at least one clip has Adobe Story script text embedded, and at least one clip does not), the dialog will open with the “Use Embedded Adobe Story Script Option” checkbox indicating a mixed state and the Reference Script popup menu may be enabled.
- the clip with the Adobe Story script embedded will be analyzed using the Adobe Story script and the clip without the Adobe Story script embedded will be analyzed using the reference script. Selecting the mixed state may generate a check in the “Use Embedded Adobe Story Script Option” checkbox and disable the “Reference Script” menu. If the analysis is run in this state, the result may be the same as above. Selecting the checkbox again may remove the check mark at the “Use Embedded Adobe Story Script Option” checkbox and may re-enable the “Reference Script” menu. If the analysis is run in this state, all clips may use the assigned reference script, and ignore any embedded Story Script text that may be in one or more of the selected clips.
- an STT engine may require that a custom language model include a minimum number of words (e.g., a minimum word count). That is, an STT engine may return an error and/or ignore a custom language model if the model does not include a minimum number of words. For example, if a portion of a script includes only ten words, a corresponding custom language model may include only the ten words. If the STT engine required a minimum of twenty-five words, the STT may not be able to use the custom language model having only ten words. In some embodiments, the words in the custom language model may be duplicated to meet the minimum word count.
- a minimum number of words e.g., a minimum word count
- the ten words may be repeated two additional times in an associated document or file that defines the custom language model to generate a total of thirty words, thereby enabling the resulting custom language model to meet the minimum word requirement of twenty-five words. It is noted that if all of the words are replicated the same number of times, the word frequency table (e.g., how often each of the words in the custom language model is used), and the word tri-graph (e.g., indicative of other words that precede and followed a given word) of the custom language model should remain accurate. That is the frequencies and words that precede or follow a given word remain the same.
- entities e.g., dialogue and events
- users e.g., marketing personnel, advertisers, and legal personnel
- users may be interested in identifying and locating when specific people, places, or things occur in the final production video or film to enable, for example, identifying prominent entities that occur in a scene in order to perform contextual advertising (e.g., an advertisement showing a certain type of car ad if the car appears in a crucial segment.)
- contextual advertising e.g., an advertisement showing a certain type of car ad if the car appears in a crucial segment.
- the processed script, extracted entities, and time-aligned dialogue/entity metadata may enable third-parties applications (e.g., contextual advertisers) to perform high relevancy ad placement.
- a method for identifying and aligning some or all entities within a script includes receiving script data, processing the script data, receiving video content data (e.g., video and audio data), processing the video content data, and synchronizing the script data with the video content data to generate time-aligned script data, and categorizing each regular or proper noun entity within the time-aligned script data.
- receiving and processing script data and receiving and processing video content data are performed in series or parallel prior to performing synchronizing the script data with the video content data which is flowed by categorizing each regular or proper noun entity within the time-aligned script data.
- Receiving script data may include processes similar to those above described with respect to document extractor 108 .
- receiving script data may include accepting a Hollywood “Spec.” Movie Script or dramatic screenplay script document (e.g., document 104 ), converting this script into specific structured and tagged representation (e.g., document data 110 ) via systematically extracting and tagging all key script elements (e.g., Scene Headings, Action Descriptions, Dialogue Lines), and then storing these elements as objects in a specialized document object model (DOM) (e.g., a structured/tagged document) for subsequent processing.
- DOM specialized document object model
- Processing the script data may include extracting specific portions of the script. Extracted portions may include noun items.
- processing script data may include processing the objects (e.g., entire sentences tagged by script section) within the script DOM using an NLP engine that identifies, extracts, and tags the noun items identified by the system for each sentence. The extracted and tagged noun elements are then recorded into a specialized metadata database.
- Receiving video content data may include processes similar to those described above with respect to audio extractor 112 .
- receiving video content data may include receiving a video or audio file (e.g., video content 112 ) that contains spoken dialogue that closely but not necessarily exactly corresponds to the dialogue sections of the input script (e.g., document 104 ).
- the audio track in the provided video or audio file is then processed using a Speech-to-Text engine (e.g., audio extractor 112 ) to generate a transcription of the spoken dialogue (e.g., transcript 114 ).
- the transcription may include extremely accurate timecode information but potentially higher error rates due to noise and language model artifacts. All spoken words and timecode information of the transcript that indicates at exactly what point in time in the video or audio the words were spoken, is stored.
- Synchronizing the script data with the video content data to generate time-aligned script data may include processes similar to those described above with respect to synchronization module 102 .
- synchronizing the script data with the video content data to generate time-aligned script data may include analyzing and synchronizing the structured (but untimed) information in a tagged script document (e.g., document data 110 ) and the text resulting from the STT transcription stored in metadata repository (e.g., transcript 114 ) to generate a time-aligned script data (e.g., time aligned script data 116 ).
- the time-aligned script data is provided to a named Entity Recognition system to categorize each regular or proper noun entity contained within the time-aligned script data.
- FIGS. 8A and 8B are block diagrams that illustrates components of and dataflow in a document time-alignment technique in accordance with one or more embodiments of the present technique. Note, the dashed lines indicate potential communication paths between various portions of the two block diagrams.
- System 800 may include features similar to that of previously described system 100 .
- script data is provided to system 800 .
- Script document/data 802 may be similar to document 104 .
- movie script documents, closed caption data, and source transcripts are presented as inputs to the system 100 .
- Movie scripts may be represented using a semi-structured Hollywood “Spec.” or dramatic screenplay format which provides descriptions of all scene, action, and dialogue events within a movie.
- script data 802 may be provided to a script converter 804 .
- Script converter 804 may be similar to document extractor 108 .
- script elements may be systematically extracted and imported into a standard structured (e.g., XML, ASTX, etc.).
- Script converter 804 may enable all script elements (e.g., Scenes, Shots, Action, Characters, Dialogue, Parentheticals, and Camera transitions) to be accessible as metadata to applications (e.g., Adobe Story, Adobe OnLocation, and Adobe Premiere Pro) enabling indexing, searching, and organization of video by textual content.
- applications e.g., Adobe Story, Adobe OnLocation, and Adobe Premiere Pro
- Script converter 804 may enable scripts to be captured from a wide variety of sources including: professional screenwriters using word processing or script writing tools, from fan-transcribed scripts of film and television content, and from legacy script archives captured by OCR.
- Script converter 804 may employ various techniques for extracting and transcribing audio data, such as those described in described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein.
- converted script data 805 (e.g., an ASTX format movie script) from script converter 804 may be provided to a script parser 806 .
- parser may be implemented as a portion of document extractor 108 .
- Spec. scripts captured and converted into a standard (e.g., Adobe) script format may be parsed by script parser 806 to identify and tag specific script elements such as scenes, actions, camera transitions, dialogue, and parenthetical.
- the ability to capture, analyze, and generate structured movie scripts may be used in certain time-alignment workflows (e.g., Adobe Pro “Script Align” feature where dialogue text within a movie script is automatically synchronized to the audio dialogue portion of video content).
- parsed script data is processed by a natural language (processing) engine (NLP) 808 .
- NLP natural language
- a filter 808 a analyzes dialogue and action text from the parsed script data. For example, the input text is normalized and then broken into individual sentences for further processing. Each sentence may form a basic information unit for lines of the script, such as lines of dialogue in the script, or descriptive sentences that describe the setting of a scene or the action within a scene.
- grammatical units of each sentence are tagged at a part-of-speech (POS) tagger 808 b .
- POS tagger 808 b may use a transformational grammar rules technique to first induce and learn a set of lexical and contextual grammar rules from an annotated and tagged reference corpus, and then apply the learned runs for performing the POS tagging step of submitted script sentences.
- tagged verb and noun phrases are submitted to a Named Entity Recognition (NER) system 808 c .
- NER system 808 c may then identify and classify entities and actions within each verb or noun phrase.
- NER 808 c may employ one or more external world-knowledge ontologies (API's) to perform the final entity tagging and classification.
- API's world-knowledge ontologies
- some or all extracted entities from NER system 808 c are then represented using a script Entity-Relationship (E-R) data model 810 that includes Scripts, Movie Sets, Scenes, Actions, Transitions, Characters, Parentheticals, Dialogue, and/or Entities.
- E-R script Entity-Relationship
- the instantiated model 810 may be physically stored into a relational database 812 .
- the instantiated model 810 may be mapped into an RDF-Triplestore 814 (see FIG. 8B ).
- a specialized relational database schema may be provided for certain application (e.g., for Adobe Story).
- script metadata may be used to record all script metadata and entities and the interrelationships between all entities.
- a relational database to RDF mapping processor 816 may then used automatically processes the relational database schema representation of the E-R model 810 to transfer all script entities in relational database table rows into the RDF-Triplestore 814 .
- Mapping may include RDF mapping system and process techniques, such as those described in described in U.S. patent application Ser. No. 12/507,746 entitled “CONVERSION OF RELATIONAL DATABASES INTO TRIPLESTORES”, filed Jul. 22, 2009, which is hereby incorporated by reference as though fully set forth herein.
- E-R model 810 may be saved to relational database 812 .
- Relational database 812 may implement E-R model 810 though a set of specially defined tables and primary key/foreign key referential integrity constraints between tables.
- an RDF-Triplestore 820 may be used to store to the mapped relational database 812 using output of relational database to RDF mapping processor 816 .
- RDF-Triplestore 820 may represent the relational information as a directed acyclic graph and may enable both sub-graph and inference chain queries needed by movie or script query applications that retrieve script metadata.
- Use of RDF-Triplestore 820 may allow video scene entities to be queried using an RDF query language such as SPARQL or a logic programming language, like Prolog.
- RDF-Triplestore enables certain kinds of limited machine reasoning and inferences on the script entities (e.g., finding prop objects common to specific movie sets, classifying a scene entity using its IS_A generalization chain for a particular prop, or determining the usage and ownership rights to specific cartoon characters within a movie, for example.
- Script dialogue data may be stored within RDF-Triplestore 820 .
- an application server 822 may be used to process incoming job requests and then communicate RDF-Triplestore data back to one or more client applications 824 , such as Adobe Story.
- Application server 822 may contain a workflow engine along with one or more optional web-servers. Script analysis requests or queries for video and script metadata may be processed by server 822 , and then dispatched to a workflow engine which invokes either the NLP analysis engine 808 or a multimodal video query engine 826 .
- Application server 822 may include a Triad/Metasky web server.
- client application 824 may be used to implement further processing.
- Adobe Story is a product that a client may use to leverage outputs of the workflows described herein to allow script writers to edit and collaborate on movie scripts, to extract, index, and to tag script entities such as people, places, and objects mentioned in the dialogue and action sections of a script.
- Adobe story may include a script editing service.
- the above described steps may describe certain aspects of text processing.
- the following described steps may describe certain aspects of video and audio processing.
- video/audio content 830 is input and accepted by the workflow system 800 .
- Video/audio content 830 may be similar to that of video content 106 .
- Video/audio content 830 may provide video footage and corresponding dialogue sound tracks.
- the audio data may be analyzed and transcribed into text using an STT engine, such as those described herein.
- a resulting generated STT transcript (e.g., similar to transcript 114 ) may be aligned with converted textual movie scripts 805 .
- the STT transcript may be processed by the natural language analysis and entity extraction components for keyword searching of the video. Natural language analysis and entity extraction components for keyword searching of the video may use multimodal video search techniques, such as those described in U.S.
- audio content is provided.
- input audio dialogue tracks may be directly provided by television or movie studios, or extracted from the provided video files using standard known extraction methods.
- the extracted audio may be converted to a mono channel format that uses 16-bit samples with a 16 kHz frequency response.
- an STT engine 832 is modified by use of a custom language model (CLM).
- STT engine 832 may employ transcription based at least partially or completely on a provided CLM.
- the CLM may be provided/built using certain methods, such as those described herein.
- STT engine 832 includes a multicore STT engine.
- the multicore STT engine may segment the source audio data, may provide STT transcriptions using parallel processing.
- speech-to-text (STT) technology may implement a custom language model and/or an enhanced multicore STT transcription engine such as those described in U.S.
- a metadata time synchronization service 834 aligns elements of transcript 832 with corresponding portions of script data 802 to generate time-aligned script data.
- Metadata time synchronization service 834 may be similar to synchronization module 102 .
- metadata time synchronization service 834 implements a specialized STT/Script alignment component to provide time alignment of non-timecoded words in the script with timecoded words in the STT transcript using a hybrid two-level alignment process, such as that described herein with regard to synchronization module 102 . For example, in level one processing, smaller regions or partitions of text and STT transcription keywords are accurately identified and prepared for detailed alignment.
- each script word may be assigned an accurate video timecode. This facilitates keyword search and time-indexing of the video by client applications such as the multimodal video search engine 826 , or other applications.
- a modified Viterbi and/or phonetic/text comparator is implemented by metadata time synchronization service 834 .
- the alignment process may also implement special override rules to resolve alignment option ties. As described herein, a decision as to whether or not an alignment is made may not rely only on precise text matches between the transcribed STT word and the script word, but rather, may rely on how closely words sound to each other; this may be provided for using a specialize phonetic encoding of the STT words and script words. Such a technique may be applicable to supplement a wide variety of STT alignment applications.
- data relating to the user is provided a graphical display that presents source script dialogue, the resulting time aligned words, and/or video content in association with one another.
- a GUI/visualization element of an application e.g., CS5 Premiere Pro Script Align feature
- a user may search a video based on the corresponding words in the time-aligned script data.
- a multimodal video search engine may allow a user to search for specific segments of video based on provided query keywords.
- the search feature may implement various techniques, such as those described in U.S. patent application Ser. No. 12/618,353 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, which is hereby incorporated by reference as though fully set forth herein.
- locations for the insertion of video descriptions can be located, video description content can be extracted from the script and automatically inserted into a time aligned script and/or audio track using time aligned script data (e.g., time aligned script data 116 as described with respect to FIGS. 1 and 2 ) provided by system 100 .
- Video descriptions may include an audio track in a movie or television program containing descriptions of the setting and action. Video description narrations fill in the story gaps by describing visual elements and provide a more complete description of what's happening in the program. This may be of particular value to the blind or visually impaired by helping to describe visual elements that they cannot view.
- the video description may be inserted into the natural pauses in dialogue or between critical sound elements, or the video and audio may be modified to enable insertion of video descriptions that may other wise be too long for the natural pauses.
- Video description content may be generated by extracting descriptive information and narrative content from a script written for the project, syncing and editing it to the video program for playback.
- Video description content may be extracted directly from descriptive text embedded in the script. For example, location settings, actor movements, non-verbal events, etc. that may be provided in script elements (e.g., title, author name(s), scene headings, action elements, character names, parentheticals, transitions, shot elements, dialogue/narrations, and the like) may be extracted as the video description content, aligned to the correct portion of scenes (e.g., to pauses in dialogue) using time alignment data, and the video description content may be manually or automatically edited (if needed) to fit into the spaces available between dialogue segments.
- script elements e.g., title, author name(s), scene headings, action elements, character names, parentheticals, transitions, shot elements, dialogue/narrations, and the like
- the video description content may be manually or automatically edited (if needed) to fit into
- the time aligned data acquired using system 100 may be used to identify the location of pauses within the audio content for embedding narrative content (e.g., action elements).
- the locations of the pauses in the audio content may be provided to a user as locations for inserting video description content.
- narrative content e.g., action element descriptions embedded in the script
- narrative content may be automatically inserted into corresponding pauses within the dialogue of the audio track to provide the corresponding video description content.
- the resulting video description content may be reviewable and editable by a user.
- a text version of the video description content can be used as a blueprint for recording by a human voiceover talent.
- the video description track can be created automatically using synthesized speech to read the video description content (e.g., without necessarily requiring any or at least a significant amount of human labor).
- a script may include a variety of script elements such as a scene heading, action, character, parenthetical, dialogue, transition, or other text that cannot be classified. Any or all of these and other script elements can be used to generate useful information for a video description track.
- a scene heading (also referred to as a “slugline”) includes a description of where the scene physically occurs.
- a scene heading may indicate that the scene takes place indoors (e.g., INT.) or outdoors (e.g., EXT.), or possibly both indoors and outdoors (e.g., INT./EXT.)
- a location name follows the description of where the scene physically occurs.
- “INT./EXT.” may be immediately followed by a more detailed description of where the scene occurs. (e.g., INT. KITCHEN, INT. LIVING ROOM, EXT. BASEBALL STADIUM, INT. AIRPLANE, etc.).
- the scene heading may also include the time of day (e.g., NIGHT, DAY, DAWN, EVENING, etc.). This information embedded in the script helps to “set the scene.”
- the scene type is typically designated as internal (INT.) or external (EXT.), and includes a period following the INT or EXT designation.
- a hyphen is typically used between other elements of the scene heading. For example, a complete scene heading may read, “INT. FERRY TERMINAL BAR—DAY” or “EXT. MAROON MOVIE STUDIO—DAY”.
- An action element typically describes the setting of the scene and introduces the characters in a scene. Action elements may also describe what will actually happen during the scene.
- a character name element may include an actual name (e.g., MS. SUTTER), description (e.g., BIG MAN) or occupation (e.g., BARTENDER) of a character. Sequence numbers are typically used to differentiate similar characters (e.g., COP #1 and COP #2). A character name is almost always inserted prior to a character speaking (e.g., just before dialog element), to indicate that the character's dialogue follows.
- a dialog element indicates what a character says when anyone on screen or off screen speaks. This may include conversation between characters, when a character speaks out loud to themselves, or when a character is off-screen and only their voice is heard (e.g., in a narration). Dialog elements may also include voice-overs or narration when the speaker is on screen but is not actively speaking on screen.
- a parenthetical typically includes a remark that indicates an attitude in dialog delivery, and/or specifies a verbal direction or action direction for the actor who is speaking the part of a character.
- Parentheticals are typically short, concise and descriptive statements located under the characters name.
- a transition typically includes a notation indicating an editing transition within the telling of a story.
- “DISSOLVE TO:” means the action seems to blur and refocus into another scene, as generally used to denote a passage of time. Transitions almost always follow an action element and precede a scene heading. Common transitions include: “DISSOLVE TO:”, “CUT TO:”, “SMASH CUT:”, “QUICK CUT:”, “FADE IN:”, “FADE OUT:”, and “FADE TO:”.
- a shot element typically indicates what the camera sees.
- a shot element that recites “TRACKING SHOT” generally indicates the camera should follow a character as he walks in a scene.
- “WIDE SHOT” generally indicates that every character appears in the scene.
- a SHOT tells the reader the focal point within a scene has changed.
- Example of shot elements include: “ANGLE ON . . . ”, “PAN TO . . . ”, “EXTREME CLOSE UP . . . ”, “FRANKIE'S POV . . . ”, and “REVERSE ANGLE . . . ”.
- script elements may be identified and extracted as described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein.
- the script elements may be time aligned to provide time-aligned data 116 as described herein.
- the time aligned data may include dialogue as well as other script elements having corresponding timecodes that identify when each of the respective words/elements occur within the video/audio corresponding to the script.
- FIG. 9A illustrates an exemplary script document 900 in accordance with one or more embodiments of the present technique.
- Script document 900 depicts an exemplary layout of the above described script elements.
- script document 900 includes a transition element 902 , a scene heading element 904 , action elements 906 a , 906 b and 906 c , character name elements 908 , dialog elements 910 , parenthetical elements 912 , and shot element 914 .
- Script writers and describers often have closely aligned goals to describe onscreen actions succinctly, vividly and imaginatively.
- the action element text may be the most useful for creating video description content, as action elements typically provide the descriptions that clearly describe what has happened, is happening, or about to happen in a scene.
- long text passages in a script describing major changes in the setting or complex action sequences translate to longer spaces between dialogue in the recorded program (often filled with music and sound effects) and provide opportunities for including longer segments of video description content.
- the action described under the scene heading 904 and action element 906 a is a wide establishing shot that follows the character out onto a busy studio lot.
- a user may have control over which script elements to use in creating a video description. For example, a user may select to use only action elements and shot elements and to ignore other elements of the script. In some embodiments, the selection may be done before or after the video description is generated. For example, a user may allow the system to generate a video description using all or some of the script elements, and may subsequently pick-and-choose which elements to keep after the initial video description is generated.
- FIG. 9B illustrates an exemplary portion of a video description script 920 that corresponds to the portion of script 900 of FIG. 9A .
- Video description script 920 includes a video description track 922 broken into discrete segments ( 1 - 9 ) provided relative to gaps and dialogue of an audio track (e.g., main audio program recorded dialogue) 924 that corresponds to spoken words of dialogue content of script 920 .
- the content of video description track 922 corresponds to action element text of action elements 906 a , 906 b and 906 c of script 900 of FIG. 9A .
- Each corresponding pause/gap in dialogue of audio track 922 is identified with a time of duration (e.g., “00:00:28:00 Gap” indicating a gap of twenty-eight seconds prior to the beginning of the script dialogue of segment 2 ).
- the corresponding content of video description 922 is provided adjacent the gap/pause, and is identified with a time of duration for the video description content (e.g., “00:00:27:00” indicating twenty-seven seconds for the video description content to be spoken) where applicable.
- the content of video description 922 may be modified to fit within the corresponding gap. For example, in the illustrated embodiment, a portion of the first segment of video description content is removed to enable the resulting video description content to fit within the duration of the gap when spoken.
- the entire video description content may be deleted or ignored where there is not a gap of sufficient length for the video description content.
- the video description content of segment 3 was deleted/ignored as the corresponding pause in dialogue was only about twelve frames (or 1 ⁇ 2 a second) in duration—too short for the insertion of the corresponding video description content.
- Video description script 920 and video description content 922 can be used as a blueprint for recording by a human voiceover talent. Thus, a voicer may simply have to read the corresponding narration content as opposed to having to manually search through a program, manually identify breaks in the dialog, and derive/record narrations to describe the video.
- the video description track can be created automatically using synthesized speech to read the video description content 922 (e.g., without necessarily requiring any or at least a significant amount of human labor).
- FIG. 9C is a flowchart that illustrates a method 950 of generating a video description in accordance with one or more embodiments of the present technique.
- Method 950 may provide video description techniques using components and dataflow implemented at system 100 .
- Method 950 generally includes identifying script elements, time aligning the script, identifying gaps/pauses in dialogue, aligning video description content to the gaps/pauses, generating a script with video description content, and generating a video description.
- Method 950 may include identifying script elements, as depicted at block 952 .
- Identifying script elements may include identifying some or all of the script elements contained within a script from which a video description is to be generated. For example, a script may be analyzed to provide script metadata that identifies a variety of script elements, such as scene headings, actions, characters, parentheticals, dialogue, transitions, or other text that cannot be classified.
- script elements may be identified and extracted as described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein.
- the identification of the elements may not actually be performed but may simply be provided or retrieved for analysis.
- Method 950 may also include time aligning the script, as depicted at block 954 .
- Time aligning the script may include using techniques, such as those described herein with regard to system 100 , to provide a timecode for some or all elements of the corresponding script.
- a script may be processed to provide a timecode for some or all of the words within the script, including dialogue or other script elements.
- the timecode information may provide stop and start time for various elements, including dialogue, which enables the identification of pauses between spoken words of dialogue.
- the time alignment may not actually be performed but may simply be provided. For example, a system generating a video description may be provided or retrieve time aligned script data 116 .
- Method 950 may also include identifying gaps/pauses in dialogue, as depicted at block 956 .
- identifying gaps/pauses in dialogue may include assessing timecode information for each word of spoken dialogue to identify the beginning and end of spoken lines of dialogue, as well as any pauses in the spoken lines of dialogue that may provide gaps for the insertion of video description content. For example, in video description script 920 of FIG. 9B , a pause of twenty-eight seconds was identified at segment 1 , prior to the start of recorded dialogue of segment 2 , a pause of 0.12 seconds was identified at segment 3 , and a pause of 4.06 seconds was identified at segment 7 .
- a gap threshold may be used to identify what pauses are of sufficient length to constitute a gap that may be of sufficient length to be used for inserting video description content. For example, a gap threshold of three seconds may be set, thereby ignoring all pauses of less than three seconds and identifying only pauses equal to or greater than three-seconds as gaps of sufficient length to be used for inserting video description content. Such a technique may be useful to ignore normal pauses in speech (e.g., between spoken words) or short breaks between characters lines of dialogue that may be so short that it would be difficult to provide any substantive video description within the pause.
- the gap threshold value may be user selectable. As depicted in FIG.
- segment 3 of recorded dialogue 924 includes an inserted statement of “No gap available”, and the corresponding action text was deleted/ignored (as indicated by the strikethrough).
- the gap may be detected, but may be ignored.
- the user may be alerted to the gap, thereby enabling them to readily identify gaps that could be used for the insertion of additional video description content.
- video descriptions may be inserted into any available gaps, even out of sequence with their corresponding location in the script, according to rules or preferences provided by the user.
- segment 3 there was no available gap for the video description that would normally be inserted at that point according to the script. However, if there were another available gap within a prescribed number of seconds before or after that segment (e.g., segment 3 ), the video description could be inserted at that other location nearby within the prescribed number of seconds before or after that segment (e.g., segment 3 ).
- Method 950 may also include aligning video description content to gaps/pauses, as depicted at block 958 .
- Aligning the video description content may include aligning the script elements with dialogue relative to where they occur within the script.
- each of the action elements 906 a , 906 b and 906 c are aligned relative to dialogue that occurs before or after the respective action elements.
- aligning video description content includes modifying the video description content and/or the recorded dialogue for merging of the video description content with the recorded dialogue where possible. For example, as depicted in FIG. 9B the script action elements have been aligned to the recorded dialog and the action element text from the script has been aligned with the available gaps when possible.
- Two gaps were identified at segments 1 and 7 for the insertion of corresponding video description content and one action element text segment was deleted because a gap/pause of sufficient length was not available between the lines of dialogue where it was located in the script.
- the user may be provided the opportunity to edit, rewrite, move, or delete the video description content, or the video description content may be automatically modified to fit within the provided gap or deleted.
- a user may have control over the resulting video description. For example, a user may modify a video description at their choosing, or may be provided an opportunity to select how to truncate a video description that does not fit within a gap. For example, in the illustrated embodiment of FIG. 9B , a user may select to remove the text of segment 1 (as indicated by the strikethrough) in an effort to make the video description fit within the corresponding gap.
- video description content may be automatically modified to fit within a given gap. If a gap is too short to fit the corresponding video description content, the video description content may be automatically truncated using rules of grammar.
- the last word(s) or entire last sentence(s) may be incrementally truncated/removed until the remaining video content description is short enough to fit within the gap.
- the last sentence “Maroon is leading an entourage of ASSISTANTS trying to keep up” may have been automatically removed, relieving the user of the need to manually modify the content.
- the user may have the opportunity to approve or modify the changes.
- the duration may be updated dynamically to indicate to the user whether the revised description will fit within an available gap.
- a gap in the recorded program may be created or the duration of a gap may be modified to provide for the insertion of video description content.
- the gap in the recorded audio may be increased (e.g., by inserting an additional amount of pause in the audio track between the end of segment 2 and the beginning of segment 4 ) to five seconds to enable the action element text to be fit within the resulting gap.
- Such a technique may be automatically applied at some or all instances where a gap is too short in duration to fit the corresponding video description content.
- modifications of the dialogue may introduce delays or pauses within the corresponding video and, thus, may modify the video and dialogue of a traditional program, it may be particularly helpful in the context of audio-only programs. For example, for books-on-tape or similar audio tracks produced for the blind or visually impaired.
- video description content may be allowed to overlap certain portions of the audio track.
- a user may have the option of modifying the video description content to overlap seemingly less important portions of the dialogue, music, sound effects, or the like.
- the main audio recorded dialogue, music, sound effects, or the like may be dipped (e.g., reduced) in volume so that the video description may be heard more clearly. For example, the volume of music may be lowered while the video description content is being recited.
- Method 950 may also include generating a script with video description content, as depicted at block 960 .
- Generating a script with video content may include generating a script document that includes video description content; script/recorded dialogue, and/or other script elements aligned with respect to one another.
- FIG. 9B illustrates an exemplary video description script 920 that includes video description content 922 and recorded dialogue 924 .
- the modifications to the video description content are displayed.
- a “clean” version of the video description script may be provided.
- clean video description script may incorporate some or all of the modifications that are not visible.
- a text version of the video description content can be used as a blueprint for recording by a human voiceover talent.
- a voicer may simply have to read the corresponding narration content as opposed to having to manually search through a program, manually identify breaks in the dialog, compose appropriate video descriptions of correct lengths, and/or derive/record narrations to describe the program.
- Method 950 may also include generating a video description, as depicted at block 962 .
- Generating the video description may include recording a reading of the video description content. For example, a reading by a voicer and/or a synthesized reading of the video description content may be recorded to generate a video description track.
- the video description track may be merged with the original audio of the program to generate a program containing both the original audio and the video description audio.
- a script may go through many revisions between the time the production team begins working on the project and the time the final edited program is completed. Scenes may be added or deleted, dialogue may be re-written or ad-libbed during recording, and shots may be reordered during the editing process. In certain scenarios, the script may not be updated until the final cut of the program has been approved and someone spends the time to manually revise the script such that it matches the actual edited program.
- Another scenario may include creating different versions or cuts of an edited program, with each of the versions including a unique set of variations from the original script. Thus, there may be multiple versions of the script, with each version being accurate to a specific matching “cut” of the edited program.
- a version may be created with one set of dialogue/video content that is appropriate for viewing by a restricted audience (e.g., adults only) and a different version with a different set of dialogue/video content that is appropriate for a broader audience (e.g., children).
- changes to a script and information relating to certain portions of the script may be recorded using script metadata.
- the script metadata may be updated to reflect changes that occur during the production process.
- the script metadata may be an accurate representation of the audio/video of a program, and may be used to generate an accurate final script.
- An accurate final script may require less or no time for review and may be useful in subsequent processing, such as time aligning the script or other processes as described herein.
- the final/revised script may be used in place of the original script as a source of script data (e.g., document (script) data 110 ) that is used for time aligning with a transcript (e.g., transcript 114 ) of corresponding video content (e.g., video content 106 ).
- a source of script data e.g., document (script) data 110
- transcript e.g., transcript 114
- video content e.g., video content 106
- FIG. 10A is a block diagram that illustrates components and dataflow of a system for processing a script workflow (workflow) 970 in accordance with one or more embodiments of the present technique.
- Workflow 970 includes a script 972 describing a plurality of scenes (e.g., scenes 1 - 6 ).
- Script 972 may include, for example, a written script similar to those described above with respect to FIGS. 1B and 9A .
- script 972 may include an original version of the script. Although the original version of a script is followed during production, there are typically changes during production and editing. For example, scenes may be edited, added, deleted, or reordered during production. An original version of the script may include a version of the script prior to changes made during production.
- Script 972 may include embedded metadata, such as script elements, that provide information related to a scene.
- script 972 includes metadata 974 (e.g., dialogue elements), associated with each scene.
- metadata 974 may be broken into smaller segments, such as segments of metadata associated with a particular scene or shot.
- script 972 may be processed to generate a structured/tagged script document including metadata 974 .
- metadata 974 may be extracted from script 972 and associated with one or more clips 976 that are shot during production of the program for script 972 .
- metadata 974 may be broken into smaller segments and may be distributed among various recorded clips 976 .
- Segments of metadata 974 from a portion of script 972 may be associated with a clip corresponding to the same portion of script 974 .
- segments of metadata 974 from scenes ( 1 - 6 ) of script 972 are extracted and associated with one of a series of clips 976 that are associated with a particular scene ( 1 - 6 ) of script 972 .
- segments of metadata 974 may be associated with a plurality of clips.
- a segment of metadata for Scene 1 of script 972 may be associated with both Clips 1 A and 1 B where they are both clips of a portion of Scene 1 .
- Each of clips 976 may include electronic copies (e.g., files) of clips that are shot during production of the program for script 972 .
- Segments of metadata 974 may be embedded into each corresponding clip 976 (e.g., into the file containing the clip) such that the particular segment of metadata 974 travels with a corresponding clip 976 .
- the segment of metadata 974 embedded with Clip 1 may be accessible by the application.
- clips 976 may act as metadata containers that enable segments of metadata to travel with a particular clip and be accessed independent of a source of the segment of metadata (e.g., metadata 974 of script 972 ).
- one or more of clips 976 may be accessed to add metadata or modify existing metadata associated with one or more clips 976 .
- a user may access Clip 1 , to embed a particular segment of metadata 974 (e.g., Script elements of Scene 1 ) with Clip 1 .
- a user may access Clip 1 to provide revised metadata 978 that is embedded into Clip 1 .
- revised metadata 978 may include the revised line of dialogue
- Clip 1 may be accessed to replace a corresponding line of dialogue from Scene 1 embedded in Clip 1 with the revised line of dialogue contained in revised metadata 978 .
- Similar techniques may be employed to update clip numbers, or other portions of metadata of clips 976 .
- metadata of clips 976 e.g., a scene/shot number
- revised metadata 978 may reflect changes made to a portion of script 972 , and may be used to update some of all clips that refer to the changed portion of script 972 .
- Clip 1 and any other clips that refer to the dialogue of Scene 1 may be automatically updated using corresponding revised metadata 978 that includes the changes to Scene 1 .
- Such an embodiment may enable “master” changes made to script 972 to be automatically applied to the metadata of all clips that rely on the changed segment of metadata 974 of script 972 .
- changes made in metadata of a particular clip may be applied to all related metadata, such as the metadata of other clips that reference the same metadata of script 972 (e.g., clips that reference the same line of dialogue in the script).
- revising metadata is provided in accordance with embodiments described below with respect to FIGS. 10B-10E .
- a user may be provided an opportunity to define how revised metadata 978 is applied. For example, a user may be provided the option of applying revised metadata 978 to a particular clip, or all clips that reference the same source of metadata.
- clips may be arranged into an edited sequence 980 that is provided to a script generator 982 to generate a revised script 984 .
- revised sequence 980 includes Clips 1 , 3 , 5 , and 6 , as well as a new clip, Clip 3 A.
- new Clip 3 A not present in script 972 , has been added, and Clips 2 and 4 have been removed.
- Clip 3 A may include metadata that is similar to metadata 974 and revised metadata 978 .
- metadata of clip 3 A may include script elements, such as dialogue.
- Script generator 980 may compile the scene metadata in each clip of the edited sequence to generate revised script 984 .
- revised script 984 may include an ordered script that is arranged in accordance with an order of revised sequence 980 .
- revised script 984 includes Scenes 1 , 3 , 3 A, 5 , and 6 in the same order as Clips 1 , 2 , 3 A, 5 and 6 in revised sequence 980 .
- script elements of revised script 984 are generated based on metadata embedded in each clip of revised sequence 980 .
- dialogue of Scene 1 , 3 , 3 A, 5 , and 6 may include the dialogue embedded in metadata of Clips 1 , 3 , 3 A, 5 , and 6 .
- the revised clip metadata may be used to generate revised script 984 , as opposed to the metadata 974 provided in scripts 972 .
- Script metadata may, thus, be embedded into each clip, and may be used to generate a corresponding script from any sequence of clips, irrespective of an order the clips are arranged and/or the source of the clips and their embedded metadata. Thus, a user may generate a revised script via combining any number of clips having embedded metadata.
- revised script 984 may include a written document that is provided in an industry standard format, similar to that of FIGS. 1B and 9A .
- revised script 984 may be provided as a structured/tagged document including revised script metadata 985 .
- script 984 may be provided to other modules for additional processing.
- some or all of revised script 984 and/or revised script metadata 985 associated with script 984 may be provided to synchronization module 102 (see FIGS. 1A and 2 ), in place of or in combination with document (script) data 110 .
- Such an embodiment may aide with time-alignment by providing script metadata that more accurately reflects actual video/audio content 106 and/or the corresponding transcript 114 .
- FIG. 10B is a block diagram that illustrates components and dataflow of a system for providing script data (system) 1000 in accordance with one or more embodiments of the present technique.
- system 1000 implements a metadata reviser 1002 that provides for revision of script metadata 974 and/or clip metadata 1006 to generate revised clip metadata 1008 .
- metadata reviser 1002 may provide access to embedded metadata of Clip 1 for the revision thereof.
- the resulting revised clip metadata 1008 may be returned to metadata reviser 1002 for additional revisions, may be provided for display, and/or may be provided to another module for additional processing.
- the now revised metadata of Clip 1 may be subsequently accessed for additional revisions, or may be provided to another module, such as script generator 982 for use in generating revised script 984 .
- script metadata 974 may be provided via techniques similar to those described above.
- script 972 e.g., the same or similar to script document 104 described above
- a script extractor 1004 e.g., the same or similar to document extractor 108 described above
- Script extractor 1004 may generate corresponding script metadata 1004 , such as a tagged/structured document (e.g., the same or similar to script data 110 discussed above).
- script metadata 974 may include the program title, author names, scene headings, action elements, shot elements, character names, parentheticals, and dialogue).
- script metadata 974 may include additional information that is extracted or derived from script 972 or added by a user.
- script metadata 974 may include additional identifiers, such as scene numbers, shot numbers, and the like that are derived directly from the by script extractor 1004 and/or are manually inserted by a user.
- script metadata 974 may be generated using various techniques for extracting and embedding metadata, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “Systems and methods for Associating Metadata With Media Using Metadata Placeholders”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein.
- Script metadata 974 may be provided to various modules for processing and may be made available to user, such as production personnel on set, for viewing and revision.
- clip metadata 1006 may be extracted from script metadata 974 .
- segments of script metadata 974 may be associated with one or more clips (e.g., Clips 976 ) to generate a segment of clip metadata 1006 .
- Production personnel may modify clip metadata 1006 as changes are made to script 974 , and/or may modify clip metadata 1006 after a scene is shot to reflect what actually happened in the scene (e.g., the actual dialogue spoken) or how clips for the scene were actually shot.
- a user may directly modify clip metadata.
- a file, including clip metadata 1006 may be accessed via script reviser 1002 , clip metadata 1006 may be modified (e.g., with revised metadata 978 ), and the file saved for later use.
- Metadata reviser 1002 may enable access to clip metadata 1006 for review and/or revision.
- metadata reviser 1002 may include a module that provides for presenting (e.g., displaying) clip metadata 1006 to a user and/or may enable the revision/editing/modifying of clip metadata 1006 , thereby generating revised clip metadata 1008 .
- clip metadata 1006 may be revised to reflect changes that have been made during various phases of production. For example, prior to shooting a scene, production personnel may wish to make changes to the clip metadata 1006 associated with a current version of script 972 . As a further example, during or after shooting of a scene or clip, production personnel may desire to update clip metadata 1006 to reflect what actually occurred in the recorded takes of the scene.
- the production personnel may simply access clip metadata 1006 associated with a particular clip via metadata reviser 1002 , and may make appropriate changes that are reflected in revised clip metadata 1008 .
- the production personnel may access an electronic version of clip metadata 1006 via a display of at least a portion of the dialogue associated with the clip, the production personnel may navigate to the portion of the clip of interest (e.g., the scene containing the line dialogue), and the production personnel may edit the line of dialogue as appropriate.
- revised clip metadata 1008 is saved such that subsequent modifications to the clip metadata are based on the already revised clip metadata 1008 , as represented in FIG. 10B by the arrow returning revised clip metadata 1002 to metadata reviser 1002 for subsequent processing.
- the techniques described herein may be applied to subsequent revisions/edits/modifications of clip metadata 1006 and/or revised script metadata 1008 .
- Revised clip metadata 1008 may include a current and accurate representation of the current version of a clip (with modifications) at any given time during production.
- revised clip metadata 1008 may be used to generate revised script 984 .
- one or more of clips 980 including revised clip metadata 1008 may be provided to script generator 982 , which compiles metadata of each of the clips, including revised clip metadata 1008 , to generate a revised script. 984 .
- revised script 984 may include a final script based off of a final version of revised clip metadata 1008 and revised sequence 980 .
- revised script 984 and/or corresponding revised script metadata of script 984 may be an accurate representation of one or more versions of the actual recorded program based on script 972 .
- revised clip metadata 1008 , revised script metadata 985 , and/or corresponding revised script 984 may be provided to a storage medium 1014 (e.g., the same or similar to storage medium 118 discussed above), a graphical user interface (GUI) 1016 (e.g., the same or similar to display device 120 discussed above), and/or may be provided to other modules 1018 (e.g., the same or similar to other modules 122 discussed above) for subsequent processing.
- other module 1012 may include script generator 982 and/or synchronization module 102 , as described above.
- revised script metadata 985 may be provided to synchronization module 102 (see FIGS. 1A and 2 ), in place of or in combination with document (script) data 110 .
- synchronization module 102 may provide for time alignment based on revised script metadata 1006 . For example, where a line of dialogue is replaced with a new line of dialogue and/or a scene is reordered during production, synchronization module 102 may align revised script metadata 985 to the resulting video content, as opposed to script metadata 974 that is reflective of the content of an original/unedited script 972 ). Where revised script metadata 985 has been updated to reflect changes that are present in the video content (e.g., video content 106 ), revised script metadata 985 may provide an accurate representation of the resulting video content 106 and may, thus, provide for an efficient and accurate revised script and alignment of the revised script to video/audio for a corresponding recorded version of the program.
- revised script metadata 985 may provide an accurate representation of the resulting video content 106 and may, thus, provide for an efficient and accurate revised script and alignment of the revised script to video/audio for a corresponding recorded version of the program.
- metadata reviser 1002 provides for a visual depiction of clip metadata 1006 and/or script metadata 974 via a graphical user interface (GUI) (e.g., graphical user interface 1016 discussed in more detail below).
- GUI graphical user interface
- metadata reviser 1002 may provide for the display of a current version of a script, including modifications made during production.
- graphical user interface 1016 may enable a user to navigate through metadata to identify where modifications have been made to the script. A user may have the option of accepting the changes, rejecting the modifications and/or may make additional modifications.
- the GUI can also display alternate versions of content, such as the difference in a line of dialogue read in one take vs. another, and/or allow the user to see these differences and choose which one to use in the final edit.
- FIG. 10C is a diagram depicting an illustrative display of a graphical user interface (GUI) 1020 for viewing/revising metadata in accordance with one or more embodiments of the present technique.
- GUI 1020 may be displayed using a display device, such as display device 1016 .
- GUI 1020 includes a first (script) portion 1022 that provides for the display of a visual depiction of a current script 1024 , which may be a current version of script 972 . Where no modifications have been made to script 972 , script 1024 may be based on script metadata 974 . Where modifications have been made to script 972 , current script 1024 may be based on revised script metadata 985 .
- current script 1024 includes script elements 1026 a - 1026 h .
- Script elements 1026 a - 1026 h may include the program title, author names, scene headings, action elements, shot elements, character names, parentheticals, dialogue, scene numbers, shot numbers, and the like.
- script elements 1026 a , 1026 c and 1026 h may include action elements
- script elements 1026 b , 1026 d , 1026 f and 1026 g may include dialogue elements
- script element 1026 e may include a shot element.
- Current script 1024 may be displayed in accordance with an industry standard for scripts, similar to that of FIG. 1B .
- script portion 1022 of user interface 1020 may enable a user to navigate to various portions of current script 1024 . For example a user may scroll up/down through the entire current script 1024 .
- GUI 1020 includes a second (video/audio content) portion 1030 that provides for the display of a visual depiction of information associated with recorded video/audio content.
- video/audio content portion 1030 includes graphical depictions indicative of plurality of Clips 1032 a - 1032 e , and their associated metadata.
- audio/video content 1030 is categorized.
- Clips 1032 a - 1032 e are grouped in association with other clips from the same scene (e.g., scene 1 or scene 2 ).
- audio/video portion 1030 of user interface 1020 may enable a user to navigate to various portions of clips and scene information. For example a user may scroll up/down through the entire listing of clips associated with current script 1024 .
- a user may interact with one or both of script portion 1022 and audio/video content portion 1030 to modify corresponding script metadata. For example, in the illustrated embodiment, a user may “click” on Clip 1 ( 1032 a ) and immediately thereafter, “click” on dialogue 1026 b to associate Clip 1 to dialogue 1026 b . Thus, in subsequent processing, metadata of Clip 1 may be associated/merged with metadata of dialogue 1026 b . For example, during time-alignment of revised script 1012 , a transcript associated with Clip 1 1032 a will be associated/matched with dialogue 1026 b . In some embodiments, multiple clips may be associated with the same portion of script 1024 .
- a user may then “click” on Clip 1 ( 1032 b ) and immediately after, “click” on dialogue 1026 b to also associate Clip 2 to dialogue 1026 b .
- Such an embodiment may be of use where Clip 1 1032 a and Clip 2 1032 b are two overlapping takes of the same scene (e.g., scene 1 ).
- a transcript associated with Clip 1 1032 a may be aligned with dialogue 1026 b and a separate transcript associated with Clip 2 1032 b may be aligned with dialogue 1026 b.
- a user may also review the association of various portions of script 1024 and the clips and scenes displayed in second portion 1030 .
- a user selects (e.g., clicks on or hovers over) each of the items in GUI 1020 , the corresponding items may be highlighted.
- FIG. 10D is a flowchart that illustrates a method 1100 of providing script data in accordance with one or more embodiments of the present technique.
- Method 1100 may employ various techniques described herein, including those discussed above with respect to components and dataflow implemented at system 1000 .
- Method 1100 generally includes providing clip metadata, revising clip metadata, providing revised clip metadata, providing a revised script based on revised clip metadata, displaying revised clip metadata, and processing revised clip metadata.
- Method 1100 may include providing clip metadata content, as depicted at block 1102 .
- Providing clip metadata may include embedding metadata information about a script within a document or file containing an associated clip.
- clip metadata 1006 may be provided via techniques similar to those described above with regard to FIG. 10B .
- clip metadata 1006 may be derived from script metadata 974 provided from script 972 via script extractor 1004 .
- script metadata 974 and/or clip metadata 1006 may simply be provided from a source, such as a user inputting the metadata.
- script metadata 1004 may be generated using various techniques for extracting and embedding metadata, such as those described in described in U.S.
- Method 1100 may include revising clip metadata, as depicted at block 1104 .
- Revising clip metadata may include modifying at least a portion of the clip metadata.
- revising clip metadata may include revising/editing/modifying clip metadata 1006 and/or revised clip metadata 1008 , as described above with respect to FIG. 10B .
- clip metadata may be modified via a user interface, such as that discussed above with respect to FIG. 10C .
- revising clip metadata is provided in response to receiving a request to modify clip metadata.
- clip metadata may be revised in accordance with a user request to add, delete or modify a portion of the current clip metadata via metadata reviser 1002 , as described above.
- Method 1100 may include providing revised clip metadata, as depicted at block 1106 .
- Providing revised clip metadata may include providing the revised clip metadata in a format that is accessible by other modules and or a user.
- providing revised clip metadata may include providing revised clip metadata 1008 in a file format that can be compiled into revised script 984 via script generator 982 , can be opened and displayed on a graphical display, may be stored for later use, or may be used for subsequent processing.
- providing revised clip metadata includes providing metadata that reflects a current version of the clip, including some or all of the modifications to the clip metadata up until a given point in the production of the program.
- a revised clip metadata file may be dynamically updated as any changes are made such that the revised clip metadata accurately reflects all changes made to the script metadata during production up until the given point in time the revised clip metadata is accessed.
- Method 1100 may include providing a revised script based on revised clip metadata, as depicted at block 1108 .
- Providing a revised script may include generating a revised script based on revised script metadata that reflects a current version of the script, including some or all of the modifications to the script and clip metadata up until a given point in the production of the program.
- a version of revised clip metadata 1008 may be used to generate revised script 984 that is reflective of some or all of the revisions to script metadata 974 and/or clip metadata 1006 during production.
- revised script 984 may include a final script based off of a final version of revised clip metadata 1008 .
- Method 1100 may include displaying the revised script, as depicted at block 1110 .
- Displaying revised script may include providing for the visualization of one or more portions of revised script 984 in a graphical user interface.
- script reviser 1002 may employ a display device 120 / 1016 to provide for the display similar to that of GUI 1020 .
- Method 1100 may include processing the revised script metadata, as depicted at block 1112 .
- Processing revised script metadata may include performing one or more processing techniques using revised script metadata.
- revised script metadata 985 may be provided to synchronization module 102 (see FIGS. 1A and 2 ), in place of or in combination with document (script) data 110 .
- synchronization module 102 may provide for time alignment based on revised script metadata 985 .
- processing script metadata may include generating video descriptions, as described above, using revised script metadata 985 and/or revised script 984 .
- an action element of revised script metadata 1006 and/or revised script 984 may be used in place of a corresponding action element present in script 972 .
- revised script metadata 985 may provide an accurate representation of the resulting video content and may, thus, provide for an efficient and accurate final script and alignment of the final script and/or video descriptions that accurately represent the resulting video/audio content.
- FIG. 10E is a block diagram that illustrates components and dataflow for processing a script (workflow) 1120 in accordance with one or more embodiments of the present technique.
- Workflow 1120 may be accomplished using techniques discussed above with respect to FIG. 10A-10D .
- two different revised versions of a script are provided.
- script revisions may be made during preproduction that can be incorporated into an original script document.
- Clips may be generated based on the original script and metadata associated with the clips may be revised during production.
- the clips may be used to generate two separate sequences of clips.
- a first sequence of clips may be provided for a first version of the script
- a second sequence of clips may be provided for a second version of the script.
- two different versions may be provided in the form of two revised scripts, versions #1 and version #2.
- Other embodiments may include any number of combinations of clips to provide any number versions having different variations between them.
- FIG. 11 Various components of embodiments of a document time-alignment technique as described herein may be executed on one or more computer systems, which may interact with various other devices.
- computer system 2000 includes one or more processors 2010 coupled to a system memory 2020 via an input/output (I/O) interface 2030 .
- Computer system 2000 further includes a network interface 2040 coupled to I/O interface 2030 , and one or more input/output devices 2050 , such as cursor control device 2060 , keyboard 2070 , audio device 2090 , and display(s) 2080 .
- embodiments may be implemented using a single instance of computer system 2000 , while in other embodiments multiple such systems, or multiple nodes making up computer system 2000 , may be configured to host different portions or instances of embodiments.
- some elements may be implemented via one or more nodes of computer system 2000 that are distinct from those nodes implementing other elements.
- computer system 2000 may be a uniprocessor system including one processor 2010 , or a multiprocessor system including several processors 2010 (e.g., two, four, eight, or another suitable number).
- processors 2010 may be any suitable processor capable of executing instructions.
- processors 2010 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA.
- ISAs instruction set architectures
- each of processors 2010 may commonly, but not necessarily, implement the same ISA.
- At least one processor 2010 may be a graphics processing unit.
- a graphics processing unit or GPU may be considered a dedicated graphics-rendering device for a personal computer, workstation, game console or other computer system.
- Modern GPUs may be very efficient at manipulating and displaying computer graphics and their highly parallel structure may make them more effective than typical CPUs for a range of complex graphical algorithms.
- a graphics processor may implement a number of graphics primitive operations in a way that makes executing them much faster than drawing directly to the screen with a host central processing unit (CPU).
- the methods disclosed herein for layout-preserved text generation may be implemented by program instructions configured for execution on one of, or parallel execution on two or more of, such GPUs.
- the GPU(s) may implement one or more application programmer interfaces (APIs) that permit programmers to invoke the functionality of the GPU(s).
- APIs application programmer interfaces
- Suitable GPUs may be commercially available from vendors such as NVIDIA Corporation having headquarters in Santa Clara, Calif., ATI Technologies of AMD having headquarters in Sunnyvale, Calif., and others.
- System memory 2020 may be configured to store program instructions and/or data accessible by processor 2010 .
- System memory 2020 may include tangible a non-transitory storage medium for storing program instructions and other data thereon.
- system memory 2020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory.
- SRAM static random access memory
- SDRAM synchronous dynamic RAM
- Flash-type memory nonvolatile/Flash-type memory
- program instructions and data implementing desired functions such as those described above for time-alignment methods, are shown stored within system memory 2020 as program instructions 2025 and data storage 2035 , respectively.
- program instructions and/or data may be received, sent or stored upon different types of computer-accessible media or on similar media separate from system memory 2020 or computer system 2000 .
- a computer-accessible medium may include storage media or memory media such as magnetic or optical media, e.g., disk or CD/DVD-ROM coupled to computer system 2000 via I/O interface 2030 .
- Program instructions and data stored via a computer-accessible medium may be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link, such as may be implemented via network interface 2040 .
- I/O interface 2030 may be configured to coordinate I/O traffic between processor 2010 , system memory 2020 , and any peripheral devices in the device, including network interface 2040 or other peripheral interfaces, such as input/output devices 2050 .
- I/O interface 2030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 2020 ) into a format suitable for use by another component (e.g., processor 2010 ).
- I/O interface 2030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example.
- PCI Peripheral Component Interconnect
- USB Universal Serial Bus
- I/O interface 2030 may be split into two or more separate components. In addition, in some embodiments some or all of the functionality of I/O interface 2030 , such as an interface to system memory 2020 , may be incorporated directly into processor 2010 .
- Network interface 2040 may be configured to allow data to be exchanged between computer system 2000 and other devices attached to a network, such as other computer systems, or between nodes of computer system 2000 .
- network interface 2040 may support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks; via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
- Input/output devices 2050 may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer system 2000 .
- Multiple input/output devices 2050 may be present in computer system 2000 or may be distributed on various nodes of computer system 2000 .
- similar input/output devices may be separate from computer system 2000 and may interact with one or more nodes of computer system 2000 through a wired or wireless connection, such as over network interface 2040 .
- memory 2020 may include program instructions 2025 , configured to implement embodiments of a layout-preserved text generation method as described herein, and data storage 2035 , comprising various data accessible by program instructions 2025 .
- program instructions 2025 may include software elements of a layout-preserved text generation method illustrated in the above Figures.
- Data storage 2035 may include data that may be used in embodiments, for example input PDF documents or output layout-preserved text documents. In other embodiments, other or different software elements and/or data may be included.
- computer system 2000 is merely illustrative and is not intended to limit the scope of a layout-preserved text generation method as described herein.
- the computer system and devices may include any combination of hardware or software that can perform the indicated functions, including computers, network devices, internet appliances, PDAs, wireless phones, pagers, etc.
- Computer system 2000 may also be connected to other devices that are not illustrated, or instead may operate as a stand-alone system.
- the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality may be available.
- instructions stored on a computer-accessible medium separate from computer system 2000 may be transmitted to computer system 2000 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link.
- Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Accordingly, the present invention may be practiced with other computer system configurations.
- portions of the techniques described herein e.g., preprocessing of script and metadata may be hosted in a cloud computing infrastructure.
- a computer-accessible storage medium may include a non-transitory storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
- a non-transitory storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
- RAM e.g. SDRAM, DDR, RDRAM, SRAM, etc.
- ROM etc
- such quantities may take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared or otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to such signals as bits, data, values, elements, symbols, characters, terms, numbers, numerals or the like. It should be understood, however, that all of these or similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic computing device.
- a special purpose computer or a similar special purpose electronic computing device is capable of manipulating or transforming signals, typically represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the special purpose computer or similar special purpose electronic computing device.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Machine Translation (AREA)
Abstract
A method includes receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
Description
- This patent application claims priority to U.S. Provisional Patent Application No. 61/323,121 entitled “Method and Apparatus for Time Synchronized Script Metadata” by Jerry R. Scoggins II, et. al, filed Apr. 12, 2010, which is hereby incorporated by reference as though fully set forth herein.
- In a video production environment, a script serves as a roadmap to when and how elements of a movie/video will be produced. In addition to specifying dialogue to be recorded, scripts are a rich source of additional metadata and include numerous references to characters, people, places, and things. During the production process, directors, editors, sound engineers, set designers, marketing, advertisers, and other production personnel are interested in knowing which people, places, and things occurred or will occur in certain scenes. This information is often present in the script but is not typically directly correlated to the corresponding video content (e.g., video and audio) because timing information is missing from the script. That is, elements of the script are not correlated with a time in which they appear in the corresponding video content. Thus, it may be difficult to link script elements (e.g., spoken dialogue) with the time when they actually occur within the corresponding video. For example, although production personnel may know that a character speaks a certain line of dialogue in a scene based on the script, the production personnel may not be able to readily determine the precise time in the working or final video when the particular line was spoken. A full script can include several thousand script elements or entities. If one were to try to find the actual point in time when a particular event (e.g., when a line was spoken) in a corresponding movie/video, the video content may have to be manually searched by a viewer to locate the event such that the corresponding timecode can be manually recorded. Thus, production personnel may not be able to easily to search or index their scripts and video content. Further, during production, the actual recorded clip may vary from the script and, thus, the script and the actual recorded video and audio may not correlate well with one another. Typically, these changes are tracked manually, if at all. This can lead to increased difficulties in post-production operations, such as aligning the script the recorded video and audio.
- When a known, written script text is time-matched to raw speech transcript produced from an analysis of recorded dialogue, the script text is said to be “aligned” with the recorded dialogue, and the resulting script may be referred to as an “aligned script.” Aligned scripts may be useful as production personnel often desire to search or index video/audio content based on the text provided in the script. Moreover, production personnel may desire to generate closed caption text that is synchronized to actual spoken dialogue in video content. However, due to variations in spoken dialogue versus the corresponding written text, as well as gaps, pauses, sound effects, music, etc. in the recorded dialogue, time aligning is a difficult task to automate. Typically, the task of time-aligning textual scripts and metadata to actual video content is a tedious task that is accomplished by a manual process that can be expensive and time-consuming. For example, a person may have to view and listen to video content and manually transcribe the corresponding audio to generate an index of what took place and when, or to generate closed captioning text that is synchronized to the video. To manually locate and record a timecode for even a small fraction of the dialogue words and script elements within a full-length movie often requires several hours of manual work, and doing this for the entire script might require several days or more. Searching may be even more difficult in view of differences between the script and what was actually recorded and how it was ordered during production. Similar difficulties may be encountered while creating video descriptions for the hearing impaired. For example, a movie may be manually searched to identify gaps in dialogue for the insertion of video description narrations that describe visual elements (e.g., actions, settings) and a more complete description of what is taking place on screen.
- Although some automated techniques for time-synchronizing scripts and corresponding video have been implemented, such as using a word alignment matrix (e.g., script words vs. transcript words), they are traditionally slow and error-prone. These techniques often require a great deal of processing and may contain a large number of errors, rendering the output inaccurate. For example, due to noise or other non-dialogue artifacts, in speech-to-text transcripts the wrong time values, off by several minutes or more, are often assigned to script text. As a result, the output may not be reliable, thereby requiring additional time to identify and correct the errors, or causing users to shy away from its use altogether.
- Accordingly, it is desirable to provide a technique for providing efficient and accurate time-alignment of a script document and corresponding video content.
- Various embodiments of methods and apparatus for time aligning documents (e.g., scripts) to associated video/audio content (e.g., movies) are described. In some embodiments, provided is a method that includes providing script data that includes ordered script words indicative of dialogue and providing audio data corresponding to at least a portion of the dialogue. The audio data includes timecodes associated with dialogue. The method includes correlating the script data with the audio data, and generating time-aligned script data that includes time-aligned words indicative of dialogue spoken in the audio data and corresponding timecodes for time-aligned words.
- In some embodiments, provided is a computer implemented method that includes providing video content data corresponding to the script data including ordered script words indicative of dialogue. The video content data includes audio data includes a transcript including transcript words corresponding to at least a portion of the dialogue and timecodes associated with the transcript words. The method also includes correlating the script data with the video content data, and generating time-aligned script data that includes time-aligned words indicative of words spoken in the video content and corresponding timecodes for time-aligned words.
- Provided in some embodiments is a method that includes receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
- Provided in some embodiments is a non-transitory computer readable storage medium having program instructions stored thereon, wherein the program instructions are executable to cause a computer system to perform a method that includes receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
- Provided in some embodiments is a computer system for receiving script metadata extracted from a script for a program, wherein the script metadata includes clip metadata associated with a particular portion of the program, associating the clip metadata with a clip corresponding to the particular portion of the program, receiving a request to revise the clip metadata, revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata associated with the clip, and generating a revised script using the revised clip metadata.
-
FIG. 1A is a block diagram that illustrates components and dataflow for document time-alignment in accordance with one or more embodiments of the present technique. -
FIG. 1B is text that illustrates exemplary script data in accordance with one or more embodiments of the present technique. -
FIG. 1C is text that illustrates exemplary transcript data in accordance with one or more embodiments of the present technique. -
FIG. 1D is text that illustrates exemplary time-aligned script data in accordance with one or more embodiments of the present technique. -
FIG. 2 is a block diagram that illustrates components and dataflow for script time-alignment in accordance with one or more embodiments of the present technique. -
FIG. 3 is a flowchart that illustrates a script time-alignment method in accordance with one or more embodiments of the present technique. -
FIG. 4 is a flowchart that illustrates a script synchronization method in accordance with one or more embodiments of the present technique. -
FIG. 5A is a depiction of an exemplary alignment matrix in accordance with one or more embodiments of the present technique. -
FIG. 5B is a depiction of an exemplary alignment sub-matrix in accordance with one or more embodiments of the present technique. -
FIG. 6 is a depiction of an exemplary graphical user interface sequence in accordance with one or more embodiments of the present technique. -
FIG. 7A is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique. -
FIG. 7B is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique. -
FIG. 7C is a depiction of a line of text and corresponding in/out ranges in accordance with one or more embodiments of the present technique. -
FIGS. 8A and 8B are block diagrams that illustrate components and dataflow of a script time-alignment technique in accordance with one or more embodiments of the present technique. -
FIG. 9A is a depiction of an exemplary script document in accordance with one or more embodiments of the present technique. -
FIG. 9B is a depiction of a portion of an exemplary video description script in accordance with one or more embodiments of the present technique. -
FIG. 9C is a flowchart that illustrates a method of generating a video description in accordance with one or more embodiments of the present technique. -
FIG. 10A is a block diagram that illustrates a script workflow in accordance with one or more embodiments of the present technique. -
FIG. 10B is a block diagram that illustrates components and dataflow for providing script data in accordance with one or more embodiments of the present technique. -
FIG. 10C is diagram depicting an illustrative display of a graphical user interface for viewing/revising script metadata in accordance with one or more embodiments of the present technique. -
FIG. 10D is a flowchart that illustrates a method of providing script data in accordance with one or more embodiments of the present technique. -
FIG. 10E is a block diagram that illustrates components and dataflow for processing a script in accordance with one or more embodiments of the present technique. -
FIG. 11 is a block diagram that illustrates an example computer system in accordance with one or more embodiments of the present technique. - While the invention is described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to. As used throughout this application, the singular forms “a”, “an” and “the” include plural referents unless the content clearly indicates otherwise. Thus, for example, reference to “an element” includes a combination of two or more elements.
- In the following detailed description, numerous specific details are set forth to provide a thorough understanding of claimed subject matter. However, it will be understood by those skilled in the art that claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
- Speech-To-Text (STT)—a process by which source audio containing dialogue or narrative is automatically transcribed to a textual representation of the dialogue or narrative. The source audio may also contain music, noise, and/or sound effects that generally contribute to lower transcription accuracy.
- STT transcript—a document generated by a STT transcription engine containing the transcription of the dialogue or narrative of the audio source. Each word in the transcript may include an associated timecode which indicates precisely when the audio content associated with each word of the dialogue or narrative occurred. Timecodes are typically provided in hours, minutes, seconds and frames. Feature films are typically shot at 24 frames per second, thus twelve frames is about ½ second in duration.
- Script—a document that outlines all of the visual, audio, behavioral, and spoken elements required to tell the story in a corresponding video or movie. Dramatic scripts are often referred to as a “screenplay”. Scripts may not include timecode data, such that they may not provide information about when an element of the script actually occurs within corresponding video content (e.g., a script may not provide a relative time within the video content that indicates precisely when the audio content associated with each word of the dialogue or narrative occurred).
- Shooting Script—a version of a script that contains scene numbers, individual shots and other production notes that is used during production and recording of the program.
- Take—a recorded shot, usually repeated multiple times to get the best performance or offer different editing choices.
- Script dialogue/narrative—the script lines to be spoken in a corresponding video or movie. Each script line may include text that includes one or more words.
- Script alignment—a process by which a set of words of a dialogue or narrative in a script are matched to corresponding transcribed words of video content. Script alignment may include providing an output that is indicative of a relative time within the video content that words of dialogue or narrative contained in the script are spoken.
- Aligned Script—a script that outlines all of the visual, audio, behavioral, and spoken elements required to tell the story in a corresponding video or movie and includes timecode data indicative of when elements of the script actually occur within corresponding video content (e.g., a time aligned script may include a relative time within the video content that indicates precisely when the audio content associated with each word of the dialogue or narrative occurred).
- Word n-gram—a consecutive subsequence of N words from a given sequence. For example, (The, rain, in), (rain, in, Spain) and (in, Spain, falls) are valid 3-grams from the sentence, “The rain in Spain falls mainly on the plain.”
- Alignment matrix—a mathematical structure used to represent how the words from a script source will align with the transcribed words of a transcript (e.g., an STT transcript generated via a speech-to-text (STT) process). For example, a vertical axis of the matrix may be formed of words in a script in the sequence/order in which they occur (e.g., ordered script words), and a horizontal axis of the matrix may be formed of words in the transcript in the sequence/order in which they occur (e.g., ordered transcript words). Each matrix cell at the intersection of a corresponding row/column may indicate the accumulated number of word insert, update or delete operations needed to match the sequence of ordered script words to the sequence of ordered transcript words to the (row, col) entry. A path with the lowest score through the matrix is indicative of the best word alignment.
- Natural Language Processing (NLP)—a technique in which natural language text is input and then sentences, part-of-speech, noun and verb phrases, and other semantics are automatically extracted. NLP may be provided as a component in processing unstructured or semi-structured text where a large quantity of rich metadata can be found, (e.g., in spec. movie scripts and dramatic screenplays).
- Program—a visual and audio production that is recorded and played back to an audience, such as a movie, television show, documentary, etc.
- Edited Program—(sequence or cut) a visual and audio production that is recorded and played back to an audience, e.g.: a movie, television show, documentary, etc.
- Dialogue—the words spoken by actors or other on-screen talent during a program.
- Video Description (or Audio Description)—an audio track in a program containing descriptions of the setting and action. The video description may be inserted into the natural pauses in dialogue or between critical sound elements. A video description often includes narration to fill in the story gaps for the blind or visually impaired by helping to describe visual elements and provide a more complete description of what's happening (e.g., visually) in the program.
- Describer—a person who develops the description to be recorded by the voicer. In some cases, the describer is also the voicer.
- Voicer (or Voice Talent)—a person who voices the Video Description.
- Secondary Audio Program (SAP)—an auxiliary audio channel for analog television that is broadcast or transmitted both over the air and by cable TV. It is often used for an alternate language or Descriptive Video Service.
- Digital Television broadcasting (DTV)—Analog broadcasting ceased in the U.S. in 2009 and was replaced by DTV.
- Script GUI—a “what you see is what you get” (WYSIWYG) graphical representation of the written script. A Script GUI may provide a representation of the script in an industry standard format.
- Various embodiments of methods and apparatus for aligning features of a script document with features of corresponding video content are provided. Embodiments described herein facilitate aligning script data to the video content data, and to use the script data to improve the accuracy of corresponding speech transcript (e.g., using the script data in place of the potentially inaccurate SST audio transcript from the video content data). In some embodiments, a document includes at least a portion of a script document, such as a movie or speculative script (e.g., dramatic screenplay), that outlines visual, audio, behavioral, and spoken elements required to tell a story. In certain embodiments, video content includes video and/or audio data that corresponds to at least a portion of the script document. In some embodiments, the audio data of the video content is transcribed into a textual format (e.g., spoken dialogue/narration is translated into words). In certain embodiments, the transcription is provided via a speech-to-text (STT) engine that automatically generates a transcript of words that correspond to the audio data of the video content. In some embodiments, the transcript includes timing information that is indicative of a point in time within the video content that one or more words were actually spoken. In certain embodiments, the words of the transcript (“transcript words”) are aligned with corresponding words of the script (“script words”). In some embodiments, aligning the transcript words with corresponding script words includes implementation of various processing techniques, such as matching sequences of words, assessing confidence/probabilities that the words identified are in fact correct, and substitution/replacement of script/transcript word with transcript/script words. In some embodiments, the resulting output includes time-aligned script data. In certain embodiments, the script data includes a time-aligned script document including accurate representation of each of the words actually spoken in the video content, and timing information that is indicative of when the word of the script were actually spoken within the video content (e.g., a timecode associated with each word of dialogue/narration). In some embodiments, time-aligned data may include timecodes for other elements of the script, such as scene headings, action elements, character names, parentheticals, transitions, shot elements, and the like.
- In some embodiments, two source inputs are provided: (1) a script (e.g., plain dialogue text or a Hollywood Spec. Script/Dramatic screenplay) and (2) an audio track dialogue (e.g., an audio track dialogue from video content corresponding to the script). In certain embodiments, a coarse-grain alignment of blocks of text is performed by first matching identical or near identical N-gram sequences of words to generate corresponding “hard alignment points”. The hard-alignment points may include matches between portions of the script and transcript (e.g., N-gram matches of a sequence of script words with a sequence of transcript words) which are used to partition an initial single alignment matrix (e.g., providing a correspondence of all ordered script words vs. all ordered transcript words) into a number of smaller sub-matrices (e.g., providing a correspondence of script words that occur between the hard alignment points vs. transcript words that occur at or between the hard alignment points). Using an algorithm, such as a standard or optimized Levenshtein word edit distance algorithm, additional words matches—between the words of the script and the transcript—may be indentified as “soft alignment points” within each sub-matrix block of text. The soft alignment points may define multiple non-overlapping interpolation intervals. In some instances, unmatched words may be located between the matched words (e.g., between the hard alignment points and/or the soft alignment points). Knowing the time data (e.g., timecode) information for the matched words, an interpolation (e.g., linear or non-linear interpolation) may be performed to determine timecodes for each of the non-matched words (e.g., words that have not been assigned timecode information) occurring between the matched points. As a result, all words (e.g., matched and unmatched) are provide with corresponding timecode information, and the timecode information may be merged with the words of the script and/or transcript documents to generate a time-aligned script document that includes all of the words spoken and their corresponding timecode information to indicate when each of the words was actually spoken within the video content. Such a technique may benefit from combining the accuracy of the script words and the timecodes of the transcript words.
- As described in more detail below, the techniques described herein may provide techniques by which all textual elements (e.g., dialogue/narration) of a script (e.g., a Hollywood movie script or dramatic screenplay script) can be automatically time-aligned to the specific points in time within corresponding video content, to identify when specific dialogue, text, or actions within the script actually occur within the video content. This enables identifying and locating when dialogue and important semantic metadata provided in a script actually occurs within corresponding production video content. In some embodiments, time alignment may be applied to all elements of the script (e.g., scene headings, action elements, etc.) to enable a user to readily identify where various elements, not just dialogue words, occur within the script. In certain embodiments, the timecode information may also be used to identify gaps in dialogue for the insertion of video description content that includes narrations to fill in the story gaps for the blind or visually impaired, thereby helping to describe visual elements and provide a more complete description of what's happening (e.g., visually) in the program
- The techniques described herein may be employed to automatically and accurately synchronize the written movie script (e.g., which may contain accurate text, but no time information) to a corresponding audio transcript (e.g., which contains accurate time information but may include very noisy or erroneous text). In certain instances, techniques may employ the transcript to identify actual words/phrases spoken that vary from the text of the script. The accuracy of the words in the script or transcript may, thus, be combined with accurate timing information in the transcript to provide an accurate time aligned script. The techniques described herein may demonstrate good tolerance to noisy transcripts or transcripts that have a large number of errors. By partitioning the alignment matrix into many smaller sub-matrices, the techniques described herein may also provide improved performance including increased processing speeds while maintaining significantly higher overall accuracy.
-
FIG. 1 is a block diagram that illustrates system components and dataflow of a system for implementing time-alignment (system) 100 in accordance with one or more embodiments of the present technique. In some embodiments,system 100 implements asynchronization module 102 to analyze adocument 104 andcorresponding video content 106. Based on the analysis,system 100 generates time-aligned data (e.g., time aligned script document) 116 that associates various portions ofdocument 104 with corresponding portions ofvideo content 106. Time aligneddata 116 may provide the specific points in time withinvideo content 106 that elements (e.g., specific dialogue, text, or actions) defined indocument 104 actually occur. - In the illustrated embodiment, document 104 (e.g., a script) is provided to a
document extractor 108.Document extractor 108 may generate acorresponding document data 110, such as a structured/tagged document. A structured/tagged document may include embedded script data that is provided tosynchronization module 102 for processing. - In some embodiments,
document 104 may include a script document, such as a movie script (e.g., a Hollywood script), a speculative script, a shooting script (e.g., a Hollywood shooting script), a closed caption (SRT) video transcript or the like. For simplicity,document 104 may be referred to as a “script” although it will be appreciated thatdocument 104 may include other forms of documents including dialogue text, as described herein. - A movie script may include a document that outlines all of the visual, audio, behavioral, and spoken elements required to tell a story. A speculative (“spec”) script or screenplay may include a preliminary script used in both film and television industries. A spec script for film generally includes an original screenplay and may be a unique plot idea, an adaptation of a book, or a sequel to an existing movie. A “television” spec script is typically written for an existing show using characters and storylines that have already been established. A “pilot” spec script typically includes an original idea for a new show. A television spec script is typically 20-30 pages for a half hour of programming, 40-60 pages for a full hour of programming, or 80-120 pages for two hours of programming. It will be appreciated that once a spec script is purchased, it may undergo a series of complete rewrites or edits before it is put into production. Once in “production”, the script may evolve into a “Shooting Script” or “Production Script” having a more complex format. Numerous scripts exist and new scripts are continually created and sold.
-
Script 104 may include a full script including several thousand script elements or entities, for instance, or a partial script including only a portion of the full script, such as a few lines, a full scene, or several scenes. For example,script 104 may include a portion of a script that corresponds to a clip provided asvideo content 106. Since film production is a highly collaborative process, the director, cast, editors, and production crew may use various forms of the script to interpret the underlying story during the production process. Further, since numerous individuals are involved in the making of a film, it is generally desirable that a script conform to specific standards and conventions that all involved parties understand (e.g., it will use a specific format w.r.t. layout, margins, notation, and other production conventions). Thus, a script document is intended to structure all of the script elements used in a screenplay into a consistent layout. Scripts generally include script elements embedded in the script document. Script elements often include a title, author name(s), scene headings, action elements, character names, parentheticals, transitions, shot elements, dialogue/narrations, and the like. An exemplary portion of ascript segment 130 is depicted inFIG. 1B .Script segment 130 includes a scene heading 130 a,action elements 130 b,character names 130 c,dialogues 130 d, and parentheticals 130 e. - Document (script)
extractor 108 may process script 104 to provide document (script)data 110, such as a structured/tagged script document. Words contained in the document (script) data may be referred to as script words. A structured/tagged (script) document may include a sequential listing of the lines of the document in accordance with their order inscript 104, along with a corresponding tag (e.g., tags—“TRAN”, “SCEN”, “ACTN”, “CHAR”, “DIAG”, “PARN” or the like) identifying a determined element type associated with some, substantially all, or all of each of the lines or groupings of the lines. In some embodiments, a structured/tagged document may include an Extensible Markup Language (XML) format, such as *.ASTX format used by certain products, such as those produced by Adobe Systems, Inc., having headquarters in San Jose, Calif. (hereinafter “Adobe”). In some embodiments,document extractor 108 may obtain script 104 (e.g., a layout preserved version of the document), perform a statistical analysis and/or feature matching of features contained within the document, identify document elements based on the statistical analysis and/or the feature matching, pass the identified document elements through a finite state machine to assess/determine/verify the identified document elements, assess whether or not document elements are incorrectly identified, and, if it is determined that there are incorrectly identified document elements, re-performing at least a portion of the identification steps, or, if it is determined that there are no (or sufficiently few) incorrectly identified document elements, and generate/store/output a structured/tagged (script) document or other forms of document (script)data 110 that is provided tosynchronization module 102. In some embodiments,document extractor 108 may employ various techniques for extracting and transcribing audio data, such as those described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein. - In the illustrated embodiment,
video content 106 is provided to anaudio extractor 112.Audio extractor 112 may generate acorresponding transcript 114.Video content 106 may include video image data and corresponding audio soundtracks that include dialogue (e.g., character's spoken words or narrations), sound effects, music, and the like.Video content 106 for a movie may be produced in segments (e.g., clips) and then assembled together to form the final movie or video product during the editing process. For example, a movie may include several scenes, and each scene may include a sequence of several different shots that typically specify a location and a sequence of actions and dialogue for the characters of the scene. The sequence of shots may include several video clips that are assembled into a scene, and multiple scenes may be combined to form the final movie product. A clip, includingvideo content 106, may be recorded for each shot of a scene, resulting in a large number of clips for the movie. Tools, such as Adobe Premiere Pro by Adobe Systems, Inc., may be used for editing and assembling clips from a collection of shots or video segments. In some embodiments, audio content (e.g., without corresponding video content may be provided). For example, audio content, such as that of a radio show) may be provided toaudio extractor 112 in place of or along with content that includes video. Although a number of embodiments described here refer tovideo content 106 as including both video data and audio data, the techniques described herein may be applied to audio content in a similar manner. -
Audio extractor 112 may processvideo content 106 to generate a corresponding transcript that includes an interpretation of words (e.g., dialogue or narration) spoken invideo content 106.Transcript 114 may be provided as a transcribed document or transcribed data that is capable of being provided to other portions ofsystem 100 for subsequent processing. In some embodiments,audio extractor 112 includes a speech-to-text engine that takes an audio segment fromvideo content 106 containing spoken dialogue, and uses speech-to-text (STT) technology to generate a time-code transcript of the dialogue. Thus,transcript 114 may indicate the timecode and duration for each spoken word that is identified by the audio extractor. Words oftranscript 114 may be referred to as transcript words. - In some embodiments, speech-to-text (STT) technology may implement a custom language model such as that described herein. In some embodiments, speech-to-text (STT) technology may implement a custom language model and/or an enhanced multicore STT transcription engine such as those described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009 and/or U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, which are hereby incorporated by reference as though fully set forth herein. A
transcript 114 generated byaudio extractor 112 may include a raw transcript. An exemplary raw transcript (e.g., STT transcript) 132 is depicted inFIG. 1C .Raw transcript 132 includes a sequential listing of identified transcript words having associated time code, duration, STT word estimate and additional comments regarding the transcription. The timecode may indicate at what point in time within the video content the word was spoken (e.g., transcript word “dad” was spoken 7165.21 seconds from the beginning of the associated video content), the duration may indicate the amount of time the word was spoken from start to finish (e.g., it took about 0.27 sec to say the word “dad”), and comments may indicate potential problems (e.g., that noise in the audio data may have generated an error). In some embodiments, the raw transcript information may also include a confidence value that indicates the probability that the interpreted/indicated word is accurate. The raw transcript information may not include additional text features, such as punctuation, capitalization, and the like. - In some embodiments, document extraction and audio extraction may occur in parallel. For example, in the illustrated embodiment,
document extractor 108 receivesscript 104 and generatesscript data 110 independent ofaudio extractor 112 receivingvideo content 106 and generatingtranscript 114. Accordingly, these two processes may be performed in parallel with one another. In some embodiments, document extraction and audio extraction may occur in series. For example,document extractor 108 may receivedocument 104 and generatedocument data 110 prior toaudio extractor 112 receivingvideo content 106 and generatingtranscript 114, or vice versa. -
Synchronization module 102 may generate time-aligneddata 116. Time-aligneddata 116 may be provided as a document or raw data that is capable of being provided to other portions ofsystem 100 for subsequent processing. Time-aligneddata 116 may be based on script information (e.g., document data 110) and video content information (e.g., transcript 114). For example,synchronization module 102 may compare transcript words intranscript 114 to script words in the document (script)data 110 to determine whether or not the transcribed words are accurate. The comparison may use various indicators to assess the accuracy. For example, a plurality of words and phrases with exact matches betweentranscript 114 anddocument data 110 may have high probabilities of being correct, and may be referred to as “hard reference points”. Words and phrases with partial matches (e.g., single words or only a few matched words) may have a lower probability of being correct, and may be referred to as “soft reference points”. Words and phrases that do not appear to have matches may have a low probability of being correct. Words and phrases with a low probability of being correct may be subject to additional amounts of processing. For example, low probability matches may be subject to interpolation based on the hard and soft reference points. Words that are part of hard or soft reference pints may be referred to as words having a match, whereas words that are not part of a hard or soft reference point may be referred to as unmatched words or words not having a match. As described in more detail below, the hard-alignment points may be used to partition the document data and the transcript into smaller segments that correspond to one another, and additional processing may be performed on the smaller segments in substantial isolation. Further, as described in more detail below, the timecodes and other information associated with matched words may be used to derive (e.g., interpolate) timecode and other information about the unmatched words. - The results of the comparison may be used to generate time aligned
data 116. Time aligneddata 116 may include words (e.g., from the script words or transcript words) having a specific timecode associated therewith. In some embodiments, time aligneddata 116 may include words from bothdocument data 110 andtranscript data 114 used to generate a single script that accurately identifies words actually spoken invideo content 106 along with corresponding timecode information for each spoken word of dialogue or other elements. The timecode for each word may be obtained directly from matching words of the transcript, or may generated (e.g., via interpolation). Time aligneddata 116 may be stored at a storage medium 118 (e.g., a database), displayed at a display device 120 (e.g., a graphical display viewable by a user), or provided toother modules 122 for processing. An exemplary time-aligned script data/document 134 is depicted inFIG. 1D . As depicted, time-aligned data/document 134 includes spokenwords 136 grouped with other spoken words of theirrespective script elements 137, and provided along with their associatedtimecodes 138. Astart time 140 for each element grouping of lines is also provided. In the depicted time-aligned data/document, each of the script elements (and text of the script elements) is also assigned a corresponding time code. -
FIG. 2 is a block diagram that illustrates components and dataflow ofsystem 100 in accordance with one or more embodiments of the present technique. In the illustrated embodiment,synchronization module 102 includes ascript reader 200, ascript analyzer 202, a Speech-to-Text (STT)reader 204, anSTT analyzer 206, amatrix aligner 208, an interval generator/interpolator 210, and a time-codedscript generator 212. -
FIG. 3 is a flowchart that illustrates a script time-alignment method 300 according to one or more embodiments of the present technique.Method 300 may provide alignment techniques using components and dataflow implemented atsystem 100. In the illustrated embodiment,method 300 includes providing script content, as depicted atblock 302, providing audio content, as depicted atblock 304, aligning the script content and audio content, as depicted atblock 306, and providing time-coded script data, as depicted atblock 308. - In some embodiments, providing script content (block 302) includes inputting or otherwise providing a
script 104, such as a Hollywood Spec. Movie Script or dramatic screenplay script, tosystem 100. For example, a plain text document, such as a raw script document, may be provided in an electronic format to scriptextractor 108 which processes script 104 (e.g., to identify, structure, and extract the text of script 104) to generatescript data 110, such as a structured/tagged script document.Script extractor 108 may employ techniques for converting documents, such as those described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, and/or U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, are all hereby incorporated by reference as though fully set forth herein.Document data 110 may be provided tosynchronization module 102 for subsequent processing, as described in more detail below. - In some embodiments, providing audio content (block 304) includes inputting or otherwise providing
video content 106, such as a clip/shot of a Hollywood movie, having associated audio content that corresponds to ascript 104, tosystem 100. Audio data may be extracted fromvideo content 106 using various techniques. For example, an audio data track may be extracted fromvideo content 106 using a Speech-to-Text (STT) engine and/or a custom language model. In some embodiments,audio extractor 112 may employ an STT engine and/or custom language model to generatetranscript 114 that includes a transcription of spoken words (e.g., audio dialogue or narration) of the Hollywood movie or other audio data.Audio extractor 112 may employ various techniques for extracting and transcribing audio data, such as those described below and/or those techniques described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, and/or U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, which are both hereby incorporated by reference as though fully set forth herein. A resultingtranscript 114 may be provided tosynchronization module 102 for subsequent processing, as described in more detail below. - In some embodiments, aligning the script and audio content (block 306) includes employing a matching technique to align the script words (e.g., dialogue or narrations) of
script 104 to elements of thevideo content 106. This may include aligning script words to corresponding transcript words. In some embodiments, alignment includessynchronization module 102 implementing a two-level word matching system to align script words ofscript 110 to corresponding transcript words oftranscript 114. In some embodiments, a first matching routine is executed to partition a matrix of script words vs. transcript words into a sub-matrix. For example, an N-gram matching scheme may be used to identify high probability matches of a sequence of multiple words. N-gram matching may include attempting to exactly (or at least partially) match phrases of multiple transcript words with script words. The matched sequence of words may be referred to as hard-alignment points. The hard alignment points may include several matched words, and may be used to define boundaries of each sub-matrix. Thus, the hard-alignment points may define smaller matrices of script words vs. transcript words. Each of the smaller sub-matrices may, then, be processed (e.g., in series or parallel) using additional matching techniques to identify word matches within each of the sub-matrices. In some embodiments, processing may be provided via multiple processors. For example, processing in series or parallel may be performed using multiple processors of one or more hosted services or cloud computing environments. In some embodiments, each of the sub-matrix is processed independent of (e.g., in substantial isolation from) processing of the other sub-matrices. These resulting additional word matches may be referred to as soft alignment points. Where unmatched words remain between the hard and/or soft alignment points, the timecode information associated with the words of the hard and soft alignment points may be used to assess timecode information for the unmatched words (e.g., via interpolation). For example, timecodes associated with the words that make up the matched points at the end and beginning of an interval of time may be used as references to interpolate time values for unmatched words that fall within the interval between the matched words. Alignment techniques that may be implemented bysynchronization module 102 are discussed in more detail below. Further, techniques for matching are discussed in more detail below with respect toFIGS. 8A and 8B . - In some embodiments, providing time-coded script data includes providing timecodes assigned to all dialogue and other script element types. For example, in some embodiments, after
synchronization module 102 aligns word N-grams fromscript 110 with corresponding word N-grams oftranscript 114, it may output (e.g., to a client application) time information in the form of time-coded script data (e.g., time-aligned script data 116) that contains timecodes assigned to some or all dialogue and to some and/or all other script element types associated withscript 104. As described above, the data may be stored, displayed/presented or processed. In some embodiments, using the alignment processes described herein, a script (e.g., a Hollywood Spec. script or dramatic screenplay script) and a corresponding STT audio transcript are merged together by aligning script words with transcript words to provide resulting time-alignedscript data 116. Time-alignedscript data 116 may be processed and used by other applications, such as the Script Align feature of Adobe Premiere Pro. In some embodiments, processing may be implemented to time-align script elements other than audio (e.g., scene headings, action description words, etc.) directly to the video scene or full video content. For example, where a script element, other than dialogue (e.g., a scene heading) occurs between two script words, the timecodes of the script words may be used to determine a timecode of the script element. In some embodiments, each of the script elements may be provided in the time-aligned script data in association with a timecode, as discussed above with regard toFIG. 1D . Providing time-coded script data (block (308) may include providing the resulting time-aligneddata 116 to a storage medium, display device, or other modules for processing, as described above with regard toFIG. 1A . -
FIG. 4 is a flowchart that illustrates a time-alignment method 400 according to one or more embodiments of the present technique.Method 400 may provide alignment techniques using components and dataflow implemented atsynchronization module 102. In the illustrated embodiment,method 400 generally includes reading a script (SCR) file and a speech-to-text (STT) file, and processing the SCR and STT files using various techniques to generate an output that includes time-aligned script data. - In the illustrated embodiment,
method 400 includes reading an SCR file, as depicted atblock 402. This may include reading script data, such asscript data 110, described above with respect to block 302. For example, reading an SCR file may includescript reader 200 reading a generated SCR file (e.g., document data 110). The SCR file may include a record-format representation of a source Hollywood spec. script of dramatic screenplay script. Records contained in the SCR file may each include one complete script element.Script reader 200 may extract script element type and data values from each record and place these into an internal representation (e.g., a structured/tagged script document). - In the illustrated embodiment,
method 400 includes reading an STT file, as depicted atblock 404. This may include reading STT data, such astranscript 114, as described above with respect to block 304.Transcript 114 may include an STT file having transcribed data, such as that of the STTword transcript data 132 depicted inFIG. 1C . The STT data may provide a timecode for each spoken word in the audio sound track which corresponds in time tovideo content 106. - In the illustrated embodiment,
method 400 includes building a SCR N-gram dictionary, as depicted atblock 406. In some embodiments, building an SCR N-gram dictionary includes identifying all possible sequences of a given number of consecutive words. The number of words in the sequence may be represented by a number “N”. For example, the sentence, “The rain in Spain falls mainly on the plain” may be used to generate the following N-gram word sequences, where N is set to a value of 3: (The, rain, in), (rain, in, Spain), (in, Spain, falls), (Spain, falls, mainly), (falls, mainly, on), (mainly, on, the), and (on, the, plain). Note that additional N-gram word sequences may be generated based on words that precede or follow a phrase. For example, where the first word of a following sentence is “Why”, an additional 3-gram may include (the, plain, why). In some embodiments, the value of N may be set by a user. In some embodiments, the value of N is set to a predetermined value, such as four. For example, N may be automatically set to a default value of four, and the user may have the option to change the value of N to something other than four (e.g., one, two, three, five, etc.). - In some embodiments, some or all of the possible sequences of N number of consecutive words are identified for the script and/or the transcript, and the respective sequences are stored for use in processing. For example,
script analyzer 202 may build a word N-gram “dictionary” of all words fromscript 110 and may record their relative positions withinscript 110 and/orSTT analyzer 206 may build a word N-gram “dictionary” of all words fromtranscript 114 and may record their relative positions withintranscript 114. The resulting N-gram dictionaries may include an ordered table of 1-gram, 2-gram, 3-gram, or N-gram word sequences. - In the illustrated embodiment,
method 400 includes matching N-grams, as depicted atblock 408. In some embodiments, matching N-grams may include attempting to match N-grams of thescript 110 to corresponding N-grams oftranscript 114. For example,SCR analyzer 202 and/orSTT analyzer 206 may attempt to match all word N-grams of the N-gram dictionaries and may store the matches (e.g., in an internal table) in association with corresponding timecode information associated with the respective transcript word(s). The stored matching N-grams may indicate the potential for a matched sequence of words, and may be referred to as “candidate” N-grams for merging. For example, a phrase from the script N-gram dictionary may be matched with a corresponding phrase the transcript N-gram dictionary, however, due to the phrase being repeated several time within the script/video content, the match may not be accepted until the relative positions can be verified. - In the illustrated embodiment,
method 400 includes merging N-grams, as depicted atblock 410. In some embodiments, merging of N-grams may be provided bySCR analyzer 202 and/orSTT analyzer 206. In some embodiments, merging N-grams includes merge some or all sequential N-gram matches into longer matched N-grams. For example, where two consecutive matching N-grams are identified, such as two consecutive 3-grams of (The, rain, in) and (rain in Spain), they may be merged together to form a single N-gram, referred to as a single 4-gram of (The, rain, in, Spain). Such a technique may result in merged N-grams of length N+1 after each iteration. The technique may be repeated (e.g., iteratively) to merge all consecutive N-grams to provide N-grams having higher values of N. N-grams with higher values of N may have higher probabilities of being an accurate match. The iterative process may continue until no additional N-gram matches are identified. For example, where there are at most ten consecutive words identified as matching, increasing to an 11-gram length may yield no matching results, thereby terminating the merging process. Further, techniques for N-gram matching are discussed in more detail below with respect toFIGS. 8A and 8B . - With merging complete, the resulting set of merged N-grams may provide a set of “hard alignment points”. For example, each separate N-gram may indicate with relatively high certainty that a sequence of words in
script 110 precisely matches a sequence of words intranscript 114. The sequence of words may identify a hard-alignment point. Thus, a hard alignment point may include a series of matched words. In some case, the hard alignment points may include a series of words that each soft-align. - Due to the high probability of hard alignment points including accurate matches of words within
script 110 and words withintranscript 114, the timing data for each of the words of the matching N-grams (e.g., the corresponding timecode for transcript words) may be correlated with the corresponding script words. As discussed in more detail below, timing data for other words (e.g., unmatched words or words having low probabilities of accurate matches) may be assessed and determined based on the timecode data of words associated with matched words (e.g., words that make up one or at least a portion of one or more alignment points). For example, interpolation may be used to assess and determine the position of a script word that occurs between matched script words (e.g., script words associated with alignment points). - Hard alignment points may be found every 30-60 seconds within video content. In some embodiments, if hard alignment points are not found with N=4 (e.g., there are no matches of four consecutive words between the script and the transcript), N is decremented and the process repeated (e.g., returning to block 408). When N=1, words are matched one-to-one. In some embodiments, a default value of N=4 may be used, although the value of N may be modified.
- In the illustrated embodiment,
method 400 includes generating a sub-matrix, as depicted atblock 412. As noted above each hard alignment point may define a block of script text (e.g. a sequence of words in script 110) and a timecode indicative of where the hard alignment point occurs in the video. Although script and transcript words associated with hard alignment points may be associated with timecode data, other script words (e.g., unmatched words between each hard alignment point) may still need to be aligned to corresponding transcript words to assess and determine their respective timecode. In some embodiments, each successive pair of hard/soft alignment points is used to create an alignment sub-matrix. The alignment sub-matrix may include script words (e.g., sub-set of script words) that occur between matched script words (e.g., script words associated with hard alignment points) and intermediate transcript words (e.g., a sub-set of transcript words) that occur between matched transcript words (e.g., transcript words associated with hard alignment points). The script words may be provided along one axis (e.g., the y or x-axis) of the sub-matrix, and the intermediate transcript words may be provided along the other axis (e.g., the x or y-axis) of the sub-matrix. -
FIG. 5A depicts an exemplary (full)alignment matrix 500 in accordance with one or more embodiments of the present technique.Alignment matrix 500 may include some or all of the script words aligned in sequence along the y-axis and all of some of the transcript words aligned in sequence along the x-axis, or vice versa. In an ideal alignment match (which may rarely be the case) script words and transcript words would match exactly, resulting in a substantially straight line having a slope of about one or negative one. - As depicted in the illustrated embodiment, several (e.g., eight) hard alignment points 502 (denoted by circles) are identified. Between each of the hard-
alignment points 502 are a number of soft alignment points 504 (denoted by squares) and/or interpolated alignment points 506 (denoted by X's). Hard alignment points 502 may be determined as a result of matching/merging N-gram sequences as discussed above with respect toblocks Interpolation intervals 507 extend between adjacent soft alignments points 504. - As depicted,
alignment matrix 500 may include one or more alignment sub-matrices 508 a-508 g (referred to collectively as sub-matrices 508). Sub-matrices 508 a-508 g may be defined by the set of points (e.g., script words and transcript words) that are located between adjacent, respective, hard alignment points 502. For example, in the illustrated embodiment,matrix 500 includes seven sub-matrices 508 a-508 g. An exemplary sub-matrix 508 e is also depicted in detail inFIG. 5B . - In some embodiments,
method 400 includes pre-processing a sub-matrix, as depicted atblock 414. Pre-processing of the sub-matrix may be provided atmatrix aligner 208. In some embodiments, pre-processing the sub-matrix may include identifying the range of a particular sub-matrix (e.g., the range/sequence of associated script words and transcript words associated with the axis of the particular sub-matrix). For example, script and transcript words that fall between two words contained in adjacent hard alignment points 502 may be identified as a matrix sub-set of script words (SCR word sub-set) 510 (represented by outlined triangles) and a corresponding matrix sub-set of transcript words (STT word sub-set) 512 (represented by solid triangles), as depicted inFIG. 5B with respect to sub-matrix 508 e. It will be appreciated that the triangles ofFIGS. 5A and 5B represent only sub-sets of the script and transcript words, as each axis may represent all of the words for a particular portion of a clip, scene or entire movie being aligned. - In some embodiments, prior to words of
SCR word sub-set 510 being aligned to words ofSTT word sub-set 512 of sub-matrix 508 e, a timecode and position offset data structure used for booking is initialized. In some embodiments, all special symbols and punctuation are removed fromSCR word sub-set 510. This may provide for a more accurate alignment as both symbols and punctuations are typically not present in atranscript 114, and, are, thus, not present inSTT word sub-set 512. - In some embodiments, sub-matrices 508 of the
initial alignment matrix 500 are sequentially processed (e.g., in order of their location along the diagonal of the alignment matrix 500) to find the best time alignment for words between each pair ofhard reference points 502 that define each respective sub-matrix 508 a-508 g. Wheresystem 100 includes a single core system used to process the sub-matrices, alignment of the sub-matrices 508 may be processed sequentially (e.g., in series—one after the other). Wheresystem 100 includes a multi-core system used to process sub-matrices, alignment of some or all of sub-matrices 508 may be processed in parallel (e.g., simultaneously). Such parallel processing may be possible as the processing of each sub-matrix is independent of all of the other sub-matrices due to the bounding of the matrices with hard alignment points that are assumed to be accurate and that include known timecode information. - In the illustrated embodiment,
method 400 includes aligning the sub-matrix, as depicted atblock 416. Aligning the sub-matrix may be provided atmatrix aligner 208. In some embodiments, a sub-matrix may be aligned using an algorithm. An algorithm may employ a dynamic programming technique to assess multiple potential alignments for a sub-matrix, to determine the best fit alignment of the potential alignments, and employ the best fit alignment for the given sub-matrix. For example, an algorithm may identify several possible solutions within the sub-matrix, and may select the solution having the lowest indication of possible error. In some embodiments the algorithm may include a Levenshtein Word Edit Distance algorithm. Where a traditional Levenshtein algorithm is employed, a dynamic programming algorithm for computing the Levenshtein distance may require the use of an (n+1)×(m+1) matrix, where n and m are the lengths of the two respective word sets (e.g., the SCR word set and the STT word set). The algorithm may be based on the Wagner-Fischer algorithm for edit distance. - In some embodiments, an alignment path defines a potential sequence of words that may be used between hard alignment points. In some embodiments, aligning the sub-matrix may include breaking alignment paths within each sub-matrix into discrete sections during processing to more accurately assess individual portions of the alignment path. Based on match probabilities/strengths of various portions of the alignment path, a single alignment path may be broken into separate discrete intervals that are assessed individually. For example, where an alignment path within a sub-matrix includes a first portion having a relatively high match probability and an adjacent second portion having a relatively low match probability, the first and second portions can be separated. That is, the first portion may be identified as a sequence of words having a high probability of a match, and the second portion may be identified as a sequence of words having a low probability of a match. Accordingly, the first portion may be identified as an accurate match that can be relied on in subsequent processing and the second portion may be identified as an inaccurate match that should not be relied on in subsequent processing. Such a technique may be used in place of merely identifying a mediocre match of the entire alignment path that may or may not be reliable for use in subsequent processing.
- In some embodiments, aligning the sub-matrix may include weighting various processing operations to reflect operations that may be indicative of inaccuracies. For example, in some embodiments, aligning the sub-matrix may include assessing weighting penalties for matched words that are subject to an insert, delete, or substitute operation. Such a technique may help to adapt to false-positive word identifications produced by an STT engine.
- In some embodiments, the algorithm may be modified in an attempt to improve alignment. For example, in some embodiments, timecode information recorded with each word of an STT word set is correlated with a matching word of a corresponding SCR word set. The matching word may include a single word or a continuous sequence of words, wherein the sequence of words includes less than the number (“N”) of words required by the selected N-gram. The resulting alignments from this process are referred to as “soft alignment points.” In some embodiments, an algorithm, such as a Levenshtein Word Edit Distance algorithm, may be used to identify soft-alignment points. The soft designation is used to indicate that because of noise, error artifacts, and the like in
STT transcript 114, these alignments may have a lower probability of being accurate than the multi-word, hard-alignment points that define the range/partition of the respective sub-matrix. In some embodiments, soft-alignment points may be determined using heuristic and/or phonetic matching. - In some embodiments, aligning the sub-matrix may include heuristic filtering. Heuristic filtering of noise may include filtering (e.g., ignoring or removing) “stop words” (e.g., short articles such as “a”, “the”, etc.) that are typically inserted into an STT transcript when the STT engine is confused or otherwise unable to decipher the audio track. For example, STT engines often insert articles such as “a”, “the”, etc. while various events other than dialogue occur, such as the presence of noise, music or sound effects. Such articles may also be inserted when dialogue is present but cannot be deciphered by the STT engine, such as when noise, music or sound effects drown out dialogue or narration. As a result, the STT transcript may include a sequence of “the the the the the the the . . . ” indicative of a duration when music or other such events occur in the audio content. Thus, heuristics may be used to identity portion transcript words that should be ignored. For example, transcript words that should not be considered in the alignment process, and/or should not be included in the resulting time-aligned script data.
- In some embodiments, heuristics may be used to identify repetitive sequences of words, and to determine which of the repeated sequence of words, if any need to be included or ignored in the resulting script document. For example, where a clip includes repetitive dialogue, such as where an actor repeats their lines several times in an attempt to get the line correct,
transcript 114 may include several repetitions (e.g., “i'll be back i'll be back i'll be back). A corresponding portion ofscript 110 may include a single recitation of the line (e.g., “I'll be back.”). In one embodiment, heuristics may be implemented to identify the repeated phrases, to identify one of the phrases of the transcript for use in aligning with script words, and to align the corresponding script words to the selected phrase oftranscript 114. For example, only the timecodes for words of one of the three phrases intranscript 114 may be associated with the corresponding script words of the phrase “I'll be back”. In some embodiments, the other repeated phrases are ignored/deleted. For example, ignored/deleted transcript words may not be considered in the alignment process, and/or may not be included in the resulting time-aligned script data. Ignoring/deleting the phrases may help to ensure that they do not create errors in aligning other portions ofscript 110. For example, if the additional phrases were not ignored/deleted, alignment may attempt to match the other two repeated phrases (e.g., those not selected) with phrases preceding or following the corresponding phrase ofscript 110. In some embodiments, instead of just throwing out (ignoring/deleting) the other repeated takes, they can also be aligned as “alternate takes”. For example, it may not know which take will eventually be used in a finished edit, so regardless of which take is used, the correct script text and timing information may flow through to that portion of the recorded clip in use. In some embodiments, a single portion script text may be aligned to each of the repeated portions of the transcript text. - In some embodiments, aligning the sub-matrix may include matching based at least partially on phonetic characteristics of words. For example, a word/phrase of the SCR word set may be considered a match to a word/phrase of the STT word set when the two words/phrases sound similar. In some embodiments, a special phonetic word comparator may be used to assess word/phrase matches. A phonetic comparator may include “fuzzy” encodings that provide for matching script words/phrases that may sound similar to a word identified in the STT transcript. Thus, a word/phrase may be considered a match if they fall within a specific phonetic match threshold. For example, a script word may be considered a match to a transcript word if the transcript word is a word identified as being an phonetic equivalent to the word in
script 110, or vice versa. For example, the terms “their” and “there” may be identified as phonetic matches although the terms do not exactly match one another. Such a technique may account for variations in spoken language (e.g., dialects) that may not be readily identified by an STT engine. Use of phonetic matching may be used in place of or in combination with an exact word/phrase match for each word/phrase. - In the illustrated embodiment,
method 400 includes generating and/or interpolating intervals, as depicted atblock 418. Generating and/or interpolating intervals may be provided at interval generator/interpolator 210. In some embodiments, generating and/or interpolating intervals may include identifying intervals between identified matched words (e.g., words of hard and/or soft reference points), interpolating the relative position of un-matched words between the matched words. An interpolated timecode for the un-matched words may be based on their interpolated position between the matched words and the known timecodes of the matched words. For example, after some or all of the sub-matrices are aligned, the sub-matrices are combined to form a list including script words and corresponding transcript words for each word associated with a hard or soft alignment point. At this stage of processing, all possible word alignment correspondences have been identified, leaving only unmatched script dialogue words (e.g., words that are not associated with hard nor soft reference points), and non-dialogue words within the script such as scene action descriptions and other information. These unmatched dialogue words still need to be assigned accurate timecodes to complete the script time-synchronization process. - In some embodiments, the timecode information for the unmatched script words is provided via linear timecode interpolation. Linear time code interpolation may include defining an interval that extends between two adjacent reference points, and spacing each of the unmatched words that occur between the two reference points across equal time spacing (e.g., sub-interpolation intervals) within the interval. A sub-interpolation interval may be defined as:
-
- Where t1 is a timecode of a first reference point defining a first end of an interpolation interval, t2 is a timecode of second reference point defining a second end of the interpolation interval, and n is the number of unmatched words.
- Where three unmatched words are identified in the script as being located between two matched words having timecodes of one second and two seconds, a first of the unmatched words may be determined to occur at 1.25 seconds, a second of the unmatched words may be determined to occur at 1.50 seconds, and a third of the unmatched words may be determined to occur at 1.75 seconds. In the above described embodiment, the sub-interpolation interval is equal to (2 sec-1 sec)/(3+1), or 0.25 sec.
FIG. 5B illustrates interpolatedpoints 506 for unmatched script words that are evenly spaced between soft alignment points in accordance with the above described linear interpolation technique. A similar technique may be repeated for each respective interpolation interval between hard/soft alignment points. - In the illustrated embodiment of
FIG. 4 ,method 400 includes assigning timecodes, as depicted atblock 420. Assigning timecodes may be provided at time-codedscript generator 212. In some embodiments, assigning time codes includes assigning times for each of the script words based on the reference points and interpolated points. For example, in some embodiments, the entire list of soft alignment points is scanned and each successive pair of soft alignment points defines an interpolation interval. Upon defining each interpolation interval, sub-interpolation intervals are determined, and timecode data aligning with the sub-interpolation intervals is assigned to all of the script words of the respective script word set. For example, the unmatched words of the above described interpolation interval may be assigned timecodes of 1.25 seconds, 1.50 seconds, and 1.75 seconds, respectively. Further, techniques for interpolating are discussed in more detail below with respect toFIGS. 8A and 8B . - In some embodiments, a non-linear interpolation technique may be employed to assess and determine timecode information associated with words/phrases within a script document. For example, non-linear interpolation or similar matching techniques may be used in place of or in combination with linear interpolation techniques employed to determine timecodes for script words. Non-linear interpolation may be useful to account for words that were not spoken at even rate between alignment points. For example, where two alignment points define an interval having matched words on either end and several unmatched words between them, linear interpolation may assign timecode information to the unmatched words assuming an even spacing across the interval as discussed above. The resulting timecodes may be reflective of someone speaking at a constant cadence across the interval. Unfortunately, the resulting timecode information may be inaccurate due to different rates of speech across the interval, pauses within the interval, or the like.
- In some embodiments, non-linear interpolation of timecode information may include assessing an expected rate (or cadence) for spoken words and applying that expected rate to assess and determine timecode information for the unmatched words. For example, non-linear interpolation may include, for a given script word, determining a rate of speaking for matched script words proximate the script word, and applying the rate of speaking to determine a timecode for the script word.
FIG. 7A illustrates alignment of a script phrase 700 (e.g., a portion of script data 110) with a spoken phrase 701 (e.g., a portion of transcript 114) that may be accomplished using non-linear interpolation in accordance with one or more embodiments of the present technique. In the illustrated embodiment, script phrase 700 is illustrated in association with analignment 702. Phrase 700 includes, “What is your answer to my question? I need to know your answer now!”Alignment 702 includes a series of word-match indicators (e.g., word associated with a hard alignment point (H) and words associated with a soft alignment point (S)) and words that are unmatched (U). The dots/points between the unmatched representations of “question” and “I” may indicate a pause between speaking of the words (e.g., a pause that would be indicated by timecode information differential between transcript words “position” and “eye” of spoken phrase 701). The sequence of four words “What is your answer to” and “know your answer now” include matches, and the words, “my”, “question”, “I”, “need” and “to” are unmatched. - In some embodiments, rates of speaking matched words proximate/adjacent (e.g., before or after) unmatched words may be used to assess and determine timecode information for the unmatched words. For example, in the illustrated embodiment, the rate of speaking “What is your answer to” may be used to assess and determine timecode information for the words “my” and “question.” That is, if it is determined that “What is your answer to” is spoken at a rate of one word every 0.1 seconds (e.g., via timecode information provided in the transcript and/or prior alignment/matching), the following words “my question” may be assigned timecode information in accordance with the rate of 0.1 words per second. For example, where the word “to” is determined to have been spoken at exactly twenty-one minutes (21:00.0) within a movie, it may be determined that the word “my” was spoken at twenty-one minutes and one-tenth of a second (21:00.1) and that the word “question” was spoken at twenty-one minutes and two-tenths of a second (21:00.2). Thus, timecodes associated with twenty-one minutes and one-tenth of a second (21:00.1) and twenty-one minutes and two-tenths of a second (21:00.2) may be assigned to the words “my” and “question”, respectively, in aligned
script data 116, for example. - In some embodiments, punctuation within the script may also be used to assess and determine timecode information. In one embodiment, for instance, punctuation indicative of the end of a phrase may be used to determine the presence of a pause between words or phrases. For example, the presence of the question mark in phrase 700 may indicate that the phrases “What is your answer to my question?” and “I need to know your answer now!” may be separated by a pause and, thus may each be spoken at different rates. Such a technique may be employed to assure that non-linear interpolation is applied to the individual phrases within a sub-matrix to account for an expected pause. For example, in the illustrated embodiment, the rate of speaking “know your answer now” may be used to assess and determine timecode information for the words “I”, “need” and “to”. That is, if it is determined that “know your answer now” was spoken at a rate of one word every 0.2 seconds (e.g., via timecode information provided in transcript 114), the preceding words “I need to” may be assigned timecode information in accordance with the rate of 0.2 words per second. For example, where the word “know” is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.00) within a movie, it may be determined that the word “I” was spoken at twenty-one minutes nine and four-tenths of a second (21:09.4), that the word “need” was spoken at twenty-one minutes nine and six-tenths of a second (21:09.6), and the word “to” was spoken twenty-one minutes nine and eight-tenths of a second (21:09.8). Timecodes associated with twenty-one minutes nine and four-tenths of a second (21:09.4), twenty-one minutes nine and six-tenths of a second (21:09.6), and twenty-one minutes nine and eight-tenths of a second (21:09.8) may be assigned to the words “I”, “need”, and “know”, respectively, in aligned
script data 116, for example. Accordingly, punctuation may be used to identify pauses or similar breakpoints that can be used to break words or phrases into discrete intervals such that respective rates of speaking (e.g., cadence) can be appropriately applied to each of the discrete intervals. Other indicators may be used to indicate characteristics of the spoken words. For example, “stopwords” present in the transcript may be indicative of a pause or break in speaking and may be interpreted as a pause and implemented as discussed above. - It is noted that with some linear interpolation techniques, the unmatched words may be assigned timecode information based on even spacing between the matched words, and thus, may not account for the pause or similar variations. For example, in the embodiment of
FIG. 7A , where the first of the words “to” is determined to have been spoken at exactly twenty-one minutes (21:00.0) and the word “know” is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.0), the five unmatched words “my”, “question”, “I”, “need” and “to” would be evenly spaced across the ten second interval at 1.67 second intervals, not accounting for the pause. Although minor in these small increments, this could lead to increased alignment errors where a pause in dialogue occurs for several minutes, for example. - In some embodiments, a rate of speech may be based on machine learning. For example, a rate of speech may be based on other words spoken proximate to the words in question. In some embodiments, a rate of speech may be determined based on elements of the script. For example, a long description of an action item may be indicative of a long pause in the actual dialogue spoken.
- In some embodiments, words of the script that occur proximate/between reference points may be aligned with unmatched words of the transcript that also occur proximate/between the same reference points. For example, in the illustrated embodiment of
FIG. 7A , the four unmatched words “my”, “question”, “I” and “need” of script phrase 700 fall within in the interval between matched words “to” and “know”. Where four unmatched words oftranscript phrase 701 also fall within the same interval, the timecodes associated with the unmatched words oftranscript phrase 701 may be assigned to the four unmatched words “my”, “question”, “I” and “need” of script phrase 700, respectively. That is the timecode of the first unmatched transcript word in the interval may be assigned to the first unmatched script word in the interval, the timecode of the second unmatched transcript word in the interval may be assigned to the second unmatched script word in the interval, and so forth. - In some embodiments, punctuation and/or capitalization from script text may be used to improve alignment. For example, if the first alignment point (hard or soft) occurs in the middle of the first sentence of the clip, it may be determined that the script words and transcript words preceding the alignment point in the script text and the corresponding transcript text should align with one another. In some embodiments, the timecodes for the script words may be interpolated (e.g., linearly or non-linearly) across the time interval that extends from the beginning of speaking of the corresponding transcript words in the scene to the corresponding alignment point. In some embodiments, the corresponding script words and transcript words may have a one-to one correspondence, and, thus, timecode information may be directly correlated. For example, the first script word of the sentence may be associated with the timecode information of the first transcript word of the clip, the second script word of the sentence may be associated with the timecode information of the second transcript word of the clip, and so forth. The beginning of a sentence may be identified by a capitalized word and the end of a sentence may be identified by a period, exclamation point, question mark, or the like.
-
FIG. 7B is a depiction of multiple lines of text that include a script phrase, a transcript phrase and a corresponding representation of alignment in accordance with one or more embodiments of the present technique. More specifically,FIG. 7B illustrates alignment of a script text 703 (e.g., a portion of script 110) with a spoken dialog 704 (e.g., a portion of transcript 114) that may be accomplished with the aid of capitalization and punctuation in accordance with one or more embodiments of the present technique.Script text 703 includes a portion of a script that is spoken throughout a clip/scene. More specifically, in the illustrated embodiment,script text 703 includes the first sentence of the clip/scene (e.g., “It is good to see you again”) and the last sentence of the clip/scene (e.g., “I will talk to you later tonight”).Spoken dialog 704 may include transcript text of a corresponding clip (e.g., “get it could to see you again” and “i will talk with you house get gator flight”). In the illustrated embodiment,script text 703 andtranscript text 704 is illustrated in association with analignment 705.Alignment 705 includes a series of word-match indicators (e.g., word associated with a hard alignment point (H) and words associated with a soft alignment point (S)) and words that are unmatched (U). As depicted, the first alignment point occurs midway though the first sentence of the scene/clip, and the first four words of the scene/clip are unmatched. In some embodiments, timecode for the script words at the beginning of the scene/clip that precede the first alignment point (e.g., “It is good”) may be interpolated across the time interval that extends from the beginning of speaking of the corresponding transcript words in the scene/clip to the corresponding alignment point (e.g., interpolated between the timecode of the transcript words “get” and “to” in the transcript phrase 704). In the illustrated embodiment, the number of corresponding unmatched script words and transcript words has a one-to-one correspondence, and, thus, timecode information may be directly correlated. For example, there are three words in each ofscript phrase 703 andtranscript phrase 704 that precede the first alignment point, and, thus, the first three script words (“It”, “is” and “good”) may each be assigned timecodes of the first three transcript words (“get”, “it” and “could”), respectively. Similarly, the location of the alignment points in the middle of the last sentence may enable the unmatched words “about”, “it”, “later”, and “tonight” that are located between the last alignment point of the scene/clip and the period indicative of the end of the scene/clip, to be interpolated across the interval between the transcript words “you” and “flight” and/or to each be assigned timecode information corresponding to transcript words “house”, “get”, “gator”, and “flight”, respectively. - In some embodiments, script elements may be used to identify the beginning or end of a sentence. For example, if between two lines of dialog, there is a parenthetical script element that corresponds to a sound effect, such as a car crash, the presence of the sound effect, indicated by a pause or stop words, may be used to identify the beginning or end of adjacent lines of dialog. In some embodiments, the techniques described with regard to alignment points in the middle of a sentence at the beginning or end of a scene/clip may be employed. For example, where the an alignment point within the dialog is preceded by or flowed by unmatched points and an identifiable script element (such as a sound effect), the timecodes for the unmatched words that occur between the alignment point and the identifiable script element may be interpolated across the corresponding interval or otherwise be determined. That is, the intermediate script element may be used in the same manner as capitalization and/or punctuation is used as described above.
- In some embodiments, the density of the words in the transcript may be used to assess and determine timecode information associated with the words in the script. For example, in the illustrated embodiment of
FIG. 7 , there are four unmatched transcript words in the interval ofphrase 701 between matched words (e.g., “two” and “know”) and there are five unmatched words (e.g., “my”, “question”, “I”, “need” and “to”) in the corresponding interval of phrase 700 between matched words (e.g., “to” and “know”). Based on the timecode information for the transcript words in the interval, it may be determined that two of the four unmatched transcript words are spoken at the beginning of the interval and that two of the four unmatched transcript words are spoken at the end of the interval. That is, about fifty percent of the spoken words were delivered in a first portion of the interval, no words were spoken in a second portion of the interval (e.g., during the pause) and about fifty percent of the words were spoken in a third portion of the interval. In one embodiment, a corresponding percentage of the script words (e.g., approximately equal to the percentage of transcript words) will be provided over the respective portions of the interval. For example, in the embodiment ofFIG. 7A , where the word “to” (in the first portion of the phrase 700) that defines a start of the interval is determined to have been spoken at exactly twenty-one minutes (21:00.0), the word “know” defining an end of the interval is determined to have been spoken at exactly twenty-one minutes and ten seconds (21:10.0), the word “position” is determined to have been spoken at exactly twenty-one minutes and ten and two-tenths seconds (21:00.2), and the word “eye” is determined to have been spoken at exactly twenty-one minutes and nine and four-tenths seconds (21:09.4), the two unmatched script words “my” and “question” may be evenly spaced over the first portion of the interval from twenty-one minutes (21:00.0) to twenty-one minutes and ten and two-tenths seconds (21:00.2), and the three unmatched words “I”, “need” and “to” may be evenly spaced across the third portion of the interval from twenty-one minutes and nine and four-tenths seconds (21:09.4) to twenty-one minutes and ten seconds (21:10.0). Thus, the distribution of script words within the interval is approximately equivalent to the distribution of transcript words in the corresponding interval. That is, about fifty percent of the script words in the interval are time aligned across the first portion of the interval before the pause and about fifty percent of the script words in the interval are time aligned across the third portion of the interval after the pause. - In some embodiments, a plurality of script words may be accepted for use in the time-aligned script data based on a confidence (e.g., high probability/density of word matches that were previously determined). Such a technique may enable blocks of text to be verified/imported from the script data to the time-aligned script data when matches within the blocks are indicative of a high probability that the corresponding script words are accurate. That is, the script data will be the text used in the time-aligned script data for those respective words of the script/dialogue. In some embodiments, a block of script words may be imported when word matches (e.g., hard alignment points and/or soft alignment points) meet a threshold level. For example, at least a portion of a block of words may be verified/imported for use in the aligned script when at least fifty percent of the words in the block are associated with a match (e.g., associated with hard and/or soft alignment points). In some embodiments, verifying/importing blocks of text may include using some individual script words having a match (e.g., associated with hard and/or soft alignment points) with words of the script, while importing/using unmatched transcript words (e.g., that are not associated with a soft and/or hard alignment points). In some embodiments, verifying/importing script words may include importing text characteristics, such as capitalization, punctuation, and the like. In the embodiment of
FIG. 7A , more than fifty-percent of the words of script phrase 700 are identified as having a hard and/or soft match. In some embodiments, upon determining that the script text and transcript text have a high enough percentage of matches (e.g., exceeding a block match threshold), the script text may be used for the entire block of text in the aligned script document, including matched and unmatched words for use in the script-aligned data. For example, the block of corresponding script text “What is your answer to my question? I need to know your answer now!” may be used in the aligned script although all of the words do not have a match. The imported script words have incorporated the capitalization and punctuation of the corresponding text of the script document. Timecode information may be associated with each of the script and transcript words using any of the techniques described herein to properly time align the unmatched words of the phrase (e.g., to provide timecodes for the words “my question? I need to”). As discussed in more detail below, where a high confidence for a block of transcript words is provided, the transcript words (including those not matched) may be used in the resulting time-aligned script. Accordingly, if the transcripts words of the phrase “What is your answer to by position eye do know your answer now!” have a high confidence leave but are not all matched, the phrase may be used in the resulting text of the time-aligned script data. Note that both, the matched and unmatched words of the raw STT have been imported. Such a technique may facilitate use of transcript words in place of script words where the actor ad-libs or otherwise does not recite the exacting wording of the script. - In some embodiments, a user could choose for themselves whether to use the Script word(s) or SST transcript word(s), based on an indication, such as confidence level. For example, even if the confidence level assumes one is more accurate than the other, it may not be so, and the user may be provided an opportunity to correct this by switching use of one or the other in the script data. Also, the user can manually edit in a correction, and this correction could be automatically stamped with a 100% confidence label. In some embodiments, the automated changes/imports may be marked such that a user can readily identify them, and modify them as needed.
- In some embodiments, confidence/probability information provided during STT operations may be employed to assess whether or not a word or block of words in a transcript meets threshold criteria, such that the transcript words may be used in the time-aligned script data in place of the corresponding script words. Such an embodiment may resolve discrepancies by using the transcript word in the aligned
script data 116 where there is a high confidence that the transcript word is accurate and the corresponding script word is not (e.g., where an actor ad-libs a line such that the actual words spoken are different from the words in the script). In one embodiment, an STT engine may provide a high confidence level (e.g., above 90%) for a given transcript word, and, thus, the transcript word is considered to meet the threshold criteria (e.g., 85% or above). That is, the word in the transcript may be more accurate than corresponding script words. As a result, the transcript word is provided in the aligned script data, in place of a corresponding script word. In some embodiments, a confidence/probability provided by an STT operation may be used in combination with matching criteria. For example, where a low confidence level (e.g., below 50%) is provided for a script word as a result of matching/merging, and the STT engine provides a high confidence level (e.g., above 90%) for a corresponding transcript words, the transcript word may be provided in the aligned script data, in place of a corresponding script word. Conversely, where a high confidence level (e.g., above 90%) is provided for a script word as a result of matching/merging, and the STT engine provides a low confidence level (e.g., below 50%) for a corresponding transcript word, the script word may be provided in the aligned script data, in place of a corresponding transcript word. - In some embodiments, a portion of the script may be longer than a corresponding clip. As a result, the portion of the script that is actually spoken may be time aligned appropriately, and the unspoken portions of the script may be bunched together between aligned points. The bunching of words may result in timecode information being associated with the bunched words that indicates them being spoken at an extremely high rate, when in fact they may not have been spoken at all. In some embodiments, a threshold is applied to ignore or delete words that appear to have been spoken too quickly such that bunched words may be ignored or deleted. For example, a threshold word rate may be set to a value that is indicative of the fastest reasonable rate for a person to speak (e.g., about six words per second). In some embodiments, the threshold word rate may be set to a default value, may be determined automatically, or may user selected. A speaking rate may be customized based on the character speaking the dialogue. For example, one actor may speak slowly whereas another actor may speak much faster, and thus the slower speaking character's dialogue may be associated with a lower threshold rate, where as the faster speaking character's dialogue may be associated with a higher threshold rate. Automatically determining a threshold word rate may include sampling other spoken portions of a script (e.g., other lines delivered by the same character) to determine a reasonable rate for words that are actually spoken, and the threshold rate may be set at that value or based off of that value. For example, where one portion of a script includes an average word rate of five words per second, a maximum word rate threshold may be set to approximately twenty percent greater than that value (e.g., about six words per second). Such a cushion may account for natural variations in speaking rate that may occur while still identifying unlikely variations in speaking rate. In some embodiments, words having spacing that do not fall within the maximum word rate threshold are ignored or deleted, such that they are not aligned. For example, a script may read:
-
- HENRY
- That's his name. Henry Jones, Junior.
-
- INDY
- I like Indiana more than the name Henry Jones, Junior.
-
- HENRY
- We named the dog Indiana.
- The corresponding video content (e.g., clip) however, may only include an actor reciting Henry's lines, one after the other. Thus, the lines delivered for Henry may be provided accurate timecode information associated with the time periods in which the two lines are spoken, however, the line associated with Indy, that is not spoken, may be bunched into the pause between delivery of Henry's first and second lines. For example, if Henry's lines were delivered one-after the other, with a half-second pause in-between, the phrase “I like Indiana more than the name Henry Jones, Junior” may not be matched (because it was not actually spoken) and, thus, may be interpolated (e.g., linearly) over the half-second time frame between the lines in the script. Corresponding timecode information may indicate that “I like Indiana more than the name Henry Jones, Junior” was spoken at a rate of one word about every five one-hundredths of a second, or about twenty words per second. Where the maximum word threshold is set to about six words per second, the determined rate of about twenty words per second would exceed the maximum word threshold. Thus the phrase “I like Indiana more than the name Henry Jones, Junior” may be ignored/deleted, such that alignment may be provided for only the lines actually spoken (e.g., Henry's lines). The phrase “I like Indiana more than the name Henry Jones, Junior” may not be provided in the time-aligned
data 116. - In some embodiments, words that were bunched at the beginning or end of dialogue (e.g., the script text that was linearly interpolated and bunched before or after the dialogue was actually spoken) may be identified and removed. For example, the following lines at the beginning of the dialogue were linearly interpolated:
- 01:58:00:02 1:5938 1:5939 Scene ̂EXT./01:58:00:02 ̂ENTRANCE/01:58:00:02
-
- ̂TO/01:58:00:02 ̂MOUNTAIN/01:58:00:02
- ̂TEMPLE/01:58:00:02 ̂-/01:58:00:02
- ̂Scene ̂AFTERNOON/01:58:00:02
Bunching of the words is indicated by them each having been assigned the same timecode, which may be a result of linearly interpolating over a very short period of time (e.g., prior to the start of actual dialogue of “Indy” following the above lines at time 01:58:00:04). In some embodiments, the bunched words are deleted/ignored such that they are not included or indicated as being aligned in the resulting aligned script data. Thus, interpolated alignment of text that is located at the beginning or end of dialogue and that is bunched into a short duration may be deleted/ignored.
- In some embodiments, ignoring/deleting words that appear to exceed a maximum threshold rate may also help to eliminate “stopwords” generated by an STT engine from being considered for alignment. For example, where an STT engine inserts a plurality of “the, the, the, . . . ” in place of music or sound effects, the high frequency of the words “the” may be identified and they may be ignored/deleted such that they are not aligned to words in the script. In some embodiments, the stopwords may be flagged (e.g., not recognized) so that a user can take further action if desired.
- In some instances, a clip may include audio content having extraneous spoken words that are not intended to be aligned with corresponding script words. For example, extraneous words and phrases may include an operator calling out “Speed!” shortly before starting the camera rolling while audio is already being recorded, the director calling out “Action!” shortly before the characters beginning to speak lines of dialogue, the director calling out “Cut!” at the end of a take, or conversations inadvertently recorded shortly before, after, or even in the middle of a take. These cues typically occur at the beginning and end of shots, and, thus, processing may be able to recognize these words based on their location and/or their audio-waveforms that are recognized and provided in a corresponding STT transcript. If the entire recorded audio from the clip were to be analyzed, the extraneous/incidental words may provide significant challenges during alignment. For example,
synchronization module 102 may align the extraneous words of the transcript to script words, resulting in numerous errors. User defined words, such as “Speed”, “Action” and “Cut” may be defined and can be recognized by their audio waveforms and provided in a corresponding STT transcript. The user defined words may be automatically flagged for the user or deleted. - In some embodiments, only a defined range of recorded dialogue is aligned to script text. Such a technique may be useful to ignore or eliminate extraneous recorded audio from the alignment analysis. For example, defining a range of recorded dialog may enable the analysis to ignore extraneous conversations or spoken words that are incidentally recorded just before or after a take for a given scene. In some embodiments, an in/out range defines the portion of the audio that is aligned to a corresponding portion of the script. Defining an in/out range may define discrete portions of the script (e.g., script word) and/or audio content (e.g., transcript words) to analyze while also defining discrete portions of the audio content data to ignore during the alignment of transcript words with corresponding script words, thereby preventing extraneous words (e.g., transcript words) from inadvertently being aligned with script words.
FIG. 7C is a depiction of a line of text and corresponding in/out ranges in accordance with one or more embodiments of the present technique. More specifically,FIG. 7C illustrates an exemplary in-range 710 and out-ranges 711. The in-range 710 and out-ranges 711 limits analysis to only audio content of in-range 710, referred to herein as audio content ofinterest 712, and excludes audio content not located within in-range 710 (e.g., content located in out-ranges 711). Audio content ofinterest 712 may include the dialogue or narration spoken during the respective clip that falls within one or more specified in/out-ranges. Extraneousaudio content 714 may include words captured on the audio that are not intended to be aligned with a corresponding portion of script document, and, thus, fall outside of the one or more specified in/out-ranges. In the illustrated embodiment, audio content ofinterest 712 includes the transcribed phrase “hello mike . . . I am doing well also” andextraneous audio content 714 includes the phrases/words “are we ready speed action” spoken at the head of the clip, just before audio content ofinterest 712 and “cut how did that look” spoken at the tail of the clip, just after audio content ofinterest 712. As depicted, inrange 710 is defined by an in-marker 710 a and an out-marker 710 b. In-marker 710 a defines a beginning of audio content ofinterest 712, and out-marker 710 b defines an end of audio content ofinterest 712. By specifying an in/out range, other portions of the dialog may be excluded from the analysis. For example, in the illustrated embodiment,extraneous content 714 at the head and tail of the clip is ignored during analysis, as indicated by the grayed out bar inFIG. 7C . In the illustrated embodiment, only a single in-range 710 is depicted, however, embodiments may include multiple discrete ranges defined within a single clip. For example, two additional in/out markers may be added within in-range 710, thereby dividing it into two discrete in-ranges and providing an additional out-range embedded therein. In some embodiments, the use of in/out-ranges may be employed to resolve issues normally associated with multiple takes of a given scene or clip. For example, a user could select the desired portion of the take by selecting an in-range that includes the desired take and/or selecting an out-range that excludes the undesired takes. In some embodiments, an out-range may be located at any portion of the clip. For example, in a case opposite from that depicted, the in/out-ranges may be swapped, thereby ignoring extraneous audio data in the middle of the clip, while analyzing audio content of interest at the head and tail of the clip. - In some embodiments,
markers FIG. 7C and may use a slider-type control to movemarkers ranges extraneous audio content 714 using in/out-ranges. In some embodiments,markers - In some embodiments, portions of the audio content may include extraneous audio other than spoken words, such as music or sound effects. If analyzed, the extraneous audio may create an additional processing burden on the system. For example,
synchronization module 102 may dedicate processing in an attempt to match/align extraneous transcript words (e.g., stop words) to script words. In some embodiments, the extraneous audio content may be identified and ignored during alignment. Such a technique may enable processing to focus on dialogue portions of audio content, while skipping over segments of extraneous audio. In some embodiments, the audio content may be processed to classify segments of the audio content into one of a plurality of discrete audio content types. For example, segments of the audio content identified as including dialogue may be classified as dialogue type audio, segments of the audio content identified as including music may be classified as music type audio, and segments of the audio content identified as including sound effects may be classified as sound effect type audio. For example, segments of transcript words that include a series of different words occurring one after another (e.g., how are you doing) and/or that are not indicative of stop words may be classified as a dialogue type audio, segments of transcript words that include a series of stop words of a long duration (e.g., the the the the . . . ) may be classified as a music type audio, and segments of transcript words that include a series of stop words of a short duration (e.g., the the the) may be classified as a sound effect type audio. In some embodiments, segments of the audio content that cannot be identified as one of dialogue, music or sound effect type audio may be categorized as unclassified type audio. During subsequent processing, each of the segments may or may not be subject to alignment or related processing based on their classification. For example, during alignment of transcript words to script words, the segments associated with dialogue type audio may be processed, whereas the segments associated with music and sound effect type audio may be ignored. By ignoring music and sound effect type segments, processing resources may be focused on the dialogue segments, and, thus, are not wasted attempting to align the transcript words associated with the music and sound effect to script words. In some embodiments, unclassified type audio may be considered for alignment or may be ignored. In some embodiments, what classifications are processed and what classifications are ignored may include a default setting and/or may be user selectable. - In some embodiments, a weighting value is assigned to each word based on the alignment type (e.g., interpolation, hard alignment, or soft alignment). Stronger alignments (e.g., hard and soft alignments) may have higher weighting than weaker alignments (e.g., interpolation). In some embodiments, a total weighting is assessed for a window/interval that includes several consecutive words. The interval of several words is a sliding window that is moved to assess adjacent intervals/windows of words. When the total weighting (e.g., sum of weightings) of the words in a given interval/window meets a threshold value, it may be determined that the words are not merely bunched words, and timecodes may be assigned to one or more of the words, thereby, not ignoring/deleting the words in the window. Such a technique may be provided at the beginning and end of a set of dialogue to assess and determine the start and stop of the actual spoken dialogue and to ignore/delete the script dialogue that preceded/followed the spoken dialogue in the script, but was not actually spoken (e.g., the script text that was linearly interpolated as was bunched before or after the dialogue actually spoken).
- In some embodiments, processing may be implemented to time-align script elements other than dialogue (e.g., scene headings, action description words, etc.) directly to the video scene or full video content. For example, where a script element, other than dialogue (e.g., a scene heading) occurs between two words having timecodes associated therewith (e.g., dialogue words in the time-aligned script data) the timecodes of the words may be used to determine a timecode of the intervening script element. For example, where a last word of a scene includes a timecode of 21:00.00 and the first word of the next scene includes a timecode of 21:10.00, a script element occurring in the script between the two words may be assigned a timecode between 21:00.00 and 21:10.00, such as 21:05.00. In some embodiments, one or more script elements may have their timecodes determined via linear and/or non-linear interpolation, similar to that described above. For example, the amount of content (e.g., the number of lines or number of words) within script elements may be used to assess a timecode for a given script element or plurality of script elements. Where a first script element between two words having timecodes includes half the amount of content of a second script element also located between the two words, the first script element may be assigned a timecode of 21:03.00 and the second script element may be assigned a time code of 21:05.00, thereby reflecting the smaller content and potentially shorter duration of the first element relative to the second element. In some embodiments, some or all of the script elements may be provided in the time-aligned script data in association with a timecode. In some embodiments, timecodes are first assigned to the dialogue words during initial alignment, and timecodes are assigned to the other script elements in a subsequent alignment process based on the timecodes of the dialogue determined in the initial alignment (e.g., via interpolation). The resulting time aligned
data 116 may include timecodes for some or all of the script elements ofscript 104. - In the illustrated embodiment,
method 400 includes generating a time-aligned script output, as depicted atblock 422, as discussed above. Generating time-aligned script output may be provided via time-codedscript generator 212. In some embodiments, each word or element of the script and/or transcript may be associated with a corresponding timecode. For example, the complete list of script word and/or transcript words that are associated with hard, soft and interpolated timecodes may be used to generate time-aligneddata 116, including a final TimeCodedScript (TCS) data file which contains some or all of the script elements with assigned time codes. In some embodiments, the TCS data file may be provided to another application, such as the Adobe Script Align and Replace feature of Adobe Premiere Pro, for additional processing. In some embodiments, time-aligneddata 116 may be stored in a database for use by other applications, such as the Script Align feature of Abode Premiere Pro. - In some embodiments, a graphical user interface may provide a graphical display that indicates where matches (e.g., hard and/or soft alignment points) or non-matches occur within a user interface. The user interface may include symbols or color coding to enable a user to readily identify various characteristics of the alignment. For example, hard alignments may be provided in red (or green) to indicate a good/high confidence, soft alignments in blue (or yellow) to indicate a lower confidence, and interpolated points in yellow (or red) to indicate an even lower confidence level. The user interface may enable a user to quickly scan the results to assess and determine where inaccuracies are most likely to have occurred. Thus, a user may commit resources for review and proofing efforts on portions of a time-aligned script that may be susceptible to errors (e.g., where no or few matches occur) and may not commit resources for review and proofing efforts on portions of a time-aligned script that may not be susceptible to errors (e.g., where a large number of matches occur). For example, a user may be presented with a chart, such as that illustrated in
FIG. 5A . The chart may enable a user to readily identify portions of the script that do not include a high percentage of matches (e.g., the sub-matrix 508 located at the uppermost left portion of the chart). In some embodiments, high confidence areas may include a similar visual indicator (e.g., grayed out) and portions that may require attention may have appropriate visual indicators (e.g., bright colors—not grayed out). - In some embodiments, a user may be provided the option to select whether or not to use the text from the raw STT analysis or the text from the written script. For example, a user may be provided a selection in association with the sub-matrix 508 located at the uppermost left portion of the chart that enables all, some, or individual words contained in the sub-matrix to use the text from the raw STT analysis or the text from the written script.
- In some embodiments, upon receiving a user input, the information may be returned to
synchronization module 102 and processed in accordance with the user input. For example, where a user opts to use STT text in place of script text,synchronization module 102 may conduct additional processing to provide the corresponding time-aligned script data. In some embodiments, the user may be prompted for input whilesynchronization module 102 is performing the time alignment. For example, as thesynchronization module 102 encounters a decision point, it may prompt the user for input. - Some embodiments may include additional features that help to improve the performance of
system 100. For example, in some embodiments, speech-to-text analysis (e.g.,audio extractor 112 and/or the method of block 304) may provide the option of creating a custom dictionary (e.g., custom language model). In some embodiments, a custom dictionary may be generated for a given clip based on one or more reference scripts that have content that is the same or similar to the given script, or based on a single reference script that at least partially corresponds to the video content or exactly matches the audio portions of the video content. In some embodiments, such as where the reference script exactly matches the audio content, some or all words of the reference script may be used to define a custom dictionary, a raw speech analysis may be performed to generate a transcript using words of the custom dictionary to transcribe words of the audio content, transcript words may then be matched against the script words of the reference script to find alignment points, and the words of the reference script text may be paired with the corresponding timecodes, thereby providing a time-aligned/coded version of the reference script. - In some embodiments, a custom language model is generated for one or more portions of video content. For example, where a movie or scene includes a plurality of clips, a custom language module may be provided for each clip to improve speech recognition accuracy. In some embodiments, a custom language model is provided to a STT engine such that the STT engine may be provided with terms that are likely to be used in the clip that is being analyzed by the STT engine. For example, during STT transcription, the STT engine may at least partially rely on terms or speech patterns defined in the custom language model. In some embodiments, a custom language model may be directed toward a certain sub-set of language. For example, the custom language model may specify a language (e.g., English, German, Spanish, French, etc.). In some embodiments, the custom language model may specify a certain language segment. For example, the custom language module may be directed to a certain profession or industry (e.g., a custom language module including common medical terms and phrases may be used for clips from a medical television series). In some embodiments, the STT engine may weight words/phrases found in the associated custom language module over the standard language model. For example, if the STT engine associates a word with a word that is present in the associated custom language model and a word that is present in a standard/default language model, the STT engine may select the word associated custom language model as opposed to the word present in the standard/default language model. In some embodiments, a word identified in a transcript that is found in the selected custom language model may be assigned a higher confidence level than a similar word that is only found in the standard/default language model.
- In some embodiments, a custom language model is generated from script text. For example,
script data 110 may include embedded script text (e.g., words and phrases) that can be extracted and used to define a custom language model. Embedded metadata may be provided using various techniques, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “SYSTEMS AND METHODS FOR ASSOCIATING METADATA WITH MEDIA USING METADATA PLACEHOLDERS”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein. A custom language model may include a word frequency table (e.g., how often each of the words in the custom language model is used within a given portion of the script) and a word tri-graph (e.g., indicative of other words that precede and followed a given word in a given portion of the script). In some embodiments, all or some of the text identified in the script may be used to populate the custom language model. Such a technique may be particularly accurate because the script and resulting language model should include all or at least a majority of the words that are expected to be spoken in the clip. In some embodiments, speech-to-text (STT) technology may implement a custom language model as described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, which is hereby incorporated by reference as though fully set forth herein - In some embodiments, metadata included in the script may be used to further improve accuracy of the STT analysis. For example, where the script includes a clip identifier, such as a scene number, the scene number may be associated with the clip such that a particular custom language model is used for STT analysis of video content that corresponds to the associated portion of the script. For example, where a first portion of the script is associated with scene one and a second portion of the script is associated with scene two, a first custom language model may be extracted from the first portion of the script, and a second custom language model may be extract from the second portions of the script. Then, during STT analysis of the first scene, the STT engine may automatically use the first custom language model, and during STT analysis of the second scene, the STT engine may automatically use the second custom language model.
- In some embodiments, when a clip contains only a few lines of dialogue in a short scene out of a very long script, knowing that the clip contains a specific scene number (e.g., harvested from the script metadata) allows focusing on the text in the script for that scene, and not having to assess the entire script.
-
FIG. 6 depicts a sequence ofdialogs 600 in accordance with one or more embodiments of the present technique. In some embodiments, a user may select a clip or group of clips, then chooses “Analyze Content” from a Clip menu, initiating the sequence ofdialogs 600. The Analyze Content dialog may allow a user to use embedded Adobe Story Script text if present for the speech analysis, or to add a reference script which will be used to improve speech analysis accuracy. The sequence ofdialogs 600 includes content analysis dialogs that allow users to import a reference script to create a custom dictionary/language model for speech analysis. A reference script may include a text document containing dialogue text similar to the recorded content in the project (e.g., a series of nature documentary scripts, or a collection of scripts from a client's previous training videos). In the Analyze Content dialog 602, a user may choose Add from the Reference Script menu. In theFile Open dialog 604, a user may navigate to the reference script text file, select it and click OK. The AddReference Script dialog 606 may open, where a user can name the reference script, choose a language, and view the text of the file below in a scrolling window. The “Script Text Matches Recorded Dialogue” option may be selected if the imported script exactly matches the recorded dialogue in the clips (e.g., a script the actors read their lines from). When a reference script is used that doesn't exactly match the recorded dialogue in the clips, the analysis engine automatically sets the weighting of the reference script vs. the base language model based on length, frequency of key words, etc. A user may click the OK button, the Import Script dialog closes, and the analysis of the reference script may begin. When analysis is complete, the reference script is selected in the Analyze Content's Reference Script menu. When a user clicks the OK button, the selected clip's speech content is analyzed. - Higher accuracy may be possible when the reference script matches the recorded dialogue exactly (e.g., the script that was written for the project or transcriptions of interview sound bites). In this scenario, a user may select the “Script Text Matches Recorded Dialogue” option in the Add
Reference Script dialog 606, as discussed above. This may override the automatic weighting against the base language model and give the selected reference script a much higher weighting. Significantly higher accuracy can be achieved using matching reference scripts, although accuracy may be primarily dependent on the clarity of the spoken words and the quality of the recorded dialogue. - High accuracy (e.g., up to 100%) may be achievable when additional associated software packages in the production workflow are used in conjunction with one another. For example, an Adobe Story to Adobe OnLocation workflow may be used to embed the dialogue from each scene into a clip's metadata. In such a workflow, a script written in Adobe Story may be imported into OnLocation, which may produce a list of shot placeholders for each scene. These placeholders may be recorded direct to disk using OnLocation during production or merged with clips that are imported into OnLocation after they were recorded on another device. In both cases, the text for each scene from the original script may be embedded in the metadata of all the clips that were shot for that scene. Embedded metadata may be provided using various techniques, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “SYSTEMS AND METHODS FOR ASSOCIATING METADATA WITH MEDIA USING METADATA PLACEHOLDERS”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein. When the clips are imported into Adobe Premiere Pro, the script text embedded in each of the clips may be automatically used as a reference script and, then, aligned with the recorded speech during the analysis. When enough hard alignment points reach a minimum accuracy threshold, the analyzed speech text is replaced with the script text embedded in the source clip's extensible metadata platform (XMP) metadata. This may result in speech analysis text that is at or near 100% accurate relative to the original script. Correct spelling, proper names and punctuation may also be carried over from the script. Accuracy in this workflow may be dictated by the closeness of the match between the reference script text and the recorded dialogue.
- With regard to
FIG. 6 , in some embodiments, when the “Use Embedded Adobe Story Script Option” of Analyze Content dialog 602 is selected, Adobe Story script text embedded in an XMP will be used for analysis, and the Reference Script popup menu may be disabled. If the selected clip contains Adobe Story script embedded text, the “Use Embedded Adobe Story Script Option” may be checked by default. For mixed states in the selection (e.g., where at least one clip has Adobe Story script text embedded, and at least one clip does not), the dialog will open with the “Use Embedded Adobe Story Script Option” checkbox indicating a mixed state and the Reference Script popup menu may be enabled. If the analysis is run in this mixed state, the clip with the Adobe Story script embedded will be analyzed using the Adobe Story script and the clip without the Adobe Story script embedded will be analyzed using the reference script. Selecting the mixed state may generate a check in the “Use Embedded Adobe Story Script Option” checkbox and disable the “Reference Script” menu. If the analysis is run in this state, the result may be the same as above. Selecting the checkbox again may remove the check mark at the “Use Embedded Adobe Story Script Option” checkbox and may re-enable the “Reference Script” menu. If the analysis is run in this state, all clips may use the assigned reference script, and ignore any embedded Story Script text that may be in one or more of the selected clips. - In some embodiments, an STT engine may require that a custom language model include a minimum number of words (e.g., a minimum word count). That is, an STT engine may return an error and/or ignore a custom language model if the model does not include a minimum number of words. For example, if a portion of a script includes only ten words, a corresponding custom language model may include only the ten words. If the STT engine required a minimum of twenty-five words, the STT may not be able to use the custom language model having only ten words. In some embodiments, the words in the custom language model may be duplicated to meet the minimum word count. For example, the ten words may be repeated two additional times in an associated document or file that defines the custom language model to generate a total of thirty words, thereby enabling the resulting custom language model to meet the minimum word requirement of twenty-five words. It is noted that if all of the words are replicated the same number of times, the word frequency table (e.g., how often each of the words in the custom language model is used), and the word tri-graph (e.g., indicative of other words that precede and followed a given word) of the custom language model should remain accurate. That is the frequencies and words that precede or follow a given word remain the same.
- In some embodiments, it may be desirable to automatically and systematically identifying some or all entities (e.g., dialogue and events) of a script that are of interest to production personnel who work with the script. For example, it may be desirable to identify people, places, and thing/noun entities contained in the script. In the usage chain of video content, such as a movie, users (e.g., marketing personnel, advertisers, and legal personnel) may be interested in identifying and locating when specific people, places, or things occur in the final production video or film to enable, for example, identifying prominent entities that occur in a scene in order to perform contextual advertising (e.g., an advertisement showing a certain type of car ad if the car appears in a crucial segment.) Thus, the processed script, extracted entities, and time-aligned dialogue/entity metadata may enable third-parties applications (e.g., contextual advertisers) to perform high relevancy ad placement.
- In some embodiments, a method for identifying and aligning some or all entities within a script includes receiving script data, processing the script data, receiving video content data (e.g., video and audio data), processing the video content data, and synchronizing the script data with the video content data to generate time-aligned script data, and categorizing each regular or proper noun entity within the time-aligned script data. In some embodiments, receiving and processing script data and receiving and processing video content data are performed in series or parallel prior to performing synchronizing the script data with the video content data which is flowed by categorizing each regular or proper noun entity within the time-aligned script data.
- Receiving script data may include processes similar to those above described with respect to
document extractor 108. For example, receiving script data may include accepting a Hollywood “Spec.” Movie Script or dramatic screenplay script document (e.g., document 104), converting this script into specific structured and tagged representation (e.g., document data 110) via systematically extracting and tagging all key script elements (e.g., Scene Headings, Action Descriptions, Dialogue Lines), and then storing these elements as objects in a specialized document object model (DOM) (e.g., a structured/tagged document) for subsequent processing. - Processing the script data may include extracting specific portions of the script. Extracted portions may include noun items. For example, for a given script DOM, processing script data may include processing the objects (e.g., entire sentences tagged by script section) within the script DOM using an NLP engine that identifies, extracts, and tags the noun items identified by the system for each sentence. The extracted and tagged noun elements are then recorded into a specialized metadata database.
- Receiving video content data may include processes similar to those described above with respect to
audio extractor 112. For example, receiving video content data may include receiving a video or audio file (e.g., video content 112) that contains spoken dialogue that closely but not necessarily exactly corresponds to the dialogue sections of the input script (e.g., document 104). The audio track in the provided video or audio file is then processed using a Speech-to-Text engine (e.g., audio extractor 112) to generate a transcription of the spoken dialogue (e.g., transcript 114). The transcription may include extremely accurate timecode information but potentially higher error rates due to noise and language model artifacts. All spoken words and timecode information of the transcript that indicates at exactly what point in time in the video or audio the words were spoken, is stored. - Synchronizing the script data with the video content data to generate time-aligned script data may include processes similar to those described above with respect to
synchronization module 102. For example, synchronizing the script data with the video content data to generate time-aligned script data may include analyzing and synchronizing the structured (but untimed) information in a tagged script document (e.g., document data 110) and the text resulting from the STT transcription stored in metadata repository (e.g., transcript 114) to generate a time-aligned script data (e.g., time aligned script data 116). The time-aligned script data is provided to a named Entity Recognition system to categorize each regular or proper noun entity contained within the time-aligned script data. -
FIGS. 8A and 8B are block diagrams that illustrates components of and dataflow in a document time-alignment technique in accordance with one or more embodiments of the present technique. Note, the dashed lines indicate potential communication paths between various portions of the two block diagrams.System 800 may include features similar to that of previously describedsystem 100. - In some embodiments, script data is provided to
system 800. Script document/data 802 may be similar todocument 104. For example, movie script documents, closed caption data, and source transcripts are presented as inputs to thesystem 100. Movie scripts may be represented using a semi-structured Hollywood “Spec.” or dramatic screenplay format which provides descriptions of all scene, action, and dialogue events within a movie. - In some embodiments,
script data 802 may be provided to ascript converter 804.Script converter 804 may be similar todocument extractor 108. For example, script elements may be systematically extracted and imported into a standard structured (e.g., XML, ASTX, etc.).Script converter 804 may enable all script elements (e.g., Scenes, Shots, Action, Characters, Dialogue, Parentheticals, and Camera transitions) to be accessible as metadata to applications (e.g., Adobe Story, Adobe OnLocation, and Adobe Premiere Pro) enabling indexing, searching, and organization of video by textual content.Script converter 804 may enable scripts to be captured from a wide variety of sources including: professional screenwriters using word processing or script writing tools, from fan-transcribed scripts of film and television content, and from legacy script archives captured by OCR.Script converter 804 may employ various techniques for extracting and transcribing audio data, such as those described in described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein. - In some embodiments, converted script data 805 (e.g., an ASTX format movie script) from
script converter 804 may be provided to ascript parser 806. In some embodiments, parser may be implemented as a portion ofdocument extractor 108. Spec. scripts captured and converted into a standard (e.g., Adobe) script format may be parsed byscript parser 806 to identify and tag specific script elements such as scenes, actions, camera transitions, dialogue, and parenthetical. The ability to capture, analyze, and generate structured movie scripts may be used in certain time-alignment workflows (e.g., Adobe Pro “Script Align” feature where dialogue text within a movie script is automatically synchronized to the audio dialogue portion of video content). - In some embodiments, parsed script data is processed by a natural language (processing) engine (NLP) 808. In some embodiments, a
filter 808 a analyzes dialogue and action text from the parsed script data. For example, the input text is normalized and then broken into individual sentences for further processing. Each sentence may form a basic information unit for lines of the script, such as lines of dialogue in the script, or descriptive sentences that describe the setting of a scene or the action within a scene. - In some embodiments, grammatical units of each sentence are tagged at a part-of-speech (POS)
tagger 808 b. For example, a specialized (POS)tagger 808 b is then used to parse, identify, and tag the grammatical units of each sentence with its POS tag (e.g., noun, verb, article, etc.).POS tagger 808 b may use a transformational grammar rules technique to first induce and learn a set of lexical and contextual grammar rules from an annotated and tagged reference corpus, and then apply the learned runs for performing the POS tagging step of submitted script sentences. - In some embodiments, tagged verb and noun phrases are submitted to a Named Entity Recognition (NER)
system 808 c.NER system 808 c may then identify and classify entities and actions within each verb or noun phrase.NER 808 c may employ one or more external world-knowledge ontologies (API's) to perform the final entity tagging and classification. - In some embodiments, some or all extracted entities from
NER system 808 c are then represented using a script Entity-Relationship (E-R)data model 810 that includes Scripts, Movie Sets, Scenes, Actions, Transitions, Characters, Parentheticals, Dialogue, and/or Entities. The instantiatedmodel 810 may be physically stored into arelational database 812. In some embodiments, the instantiatedmodel 810 may be mapped into an RDF-Triplestore 814 (seeFIG. 8B ). In some embodiments, a specialized relational database schema may be provided for certain application (e.g., for Adobe Story). For example, script metadata may be used to record all script metadata and entities and the interrelationships between all entities. - In some embodiments, a relational database to
RDF mapping processor 816 may then used automatically processes the relational database schema representation of theE-R model 810 to transfer all script entities in relational database table rows into the RDF-Triplestore 814. Mapping may include RDF mapping system and process techniques, such as those described in described in U.S. patent application Ser. No. 12/507,746 entitled “CONVERSION OF RELATIONAL DATABASES INTO TRIPLESTORES”, filed Jul. 22, 2009, which is hereby incorporated by reference as though fully set forth herein. - In some embodiments,
E-R model 810 may be saved torelational database 812.Relational database 812 may implementE-R model 810 though a set of specially defined tables and primary key/foreign key referential integrity constraints between tables. - In some embodiments, an RDF-
Triplestore 820 may be used to store to the mappedrelational database 812 using output of relational database toRDF mapping processor 816. RDF-Triplestore 820 may represent the relational information as a directed acyclic graph and may enable both sub-graph and inference chain queries needed by movie or script query applications that retrieve script metadata. Use of RDF-Triplestore 820 may allow video scene entities to be queried using an RDF query language such as SPARQL or a logic programming language, like Prolog. Use of the RDF-Triplestore enables certain kinds of limited machine reasoning and inferences on the script entities (e.g., finding prop objects common to specific movie sets, classifying a scene entity using its IS_A generalization chain for a particular prop, or determining the usage and ownership rights to specific cartoon characters within a movie, for example. Script dialogue data may be stored within RDF-Triplestore 820. - In some embodiments, an
application server 822 may be used to process incoming job requests and then communicate RDF-Triplestore data back to one ormore client applications 824, such as Adobe Story.Application server 822 may contain a workflow engine along with one or more optional web-servers. Script analysis requests or queries for video and script metadata may be processed byserver 822, and then dispatched to a workflow engine which invokes either theNLP analysis engine 808 or a multimodalvideo query engine 826.Application server 822 may include a Triad/Metasky web server. - In some embodiments,
client application 824 may be used to implement further processing. For example, Adobe Story is a product that a client may use to leverage outputs of the workflows described herein to allow script writers to edit and collaborate on movie scripts, to extract, index, and to tag script entities such as people, places, and objects mentioned in the dialogue and action sections of a script. Adobe story may include a script editing service. - The above described steps may describe certain aspects of text processing. The following described steps may describe certain aspects of video and audio processing.
- In some embodiments, video/
audio content 830 is input and accepted by theworkflow system 800. Video/audio content 830 may be similar to that ofvideo content 106. Video/audio content 830 may provide video footage and corresponding dialogue sound tracks. The audio data may be analyzed and transcribed into text using an STT engine, such as those described herein. A resulting generated STT transcript (e.g., similar to transcript 114) may be aligned with convertedtextual movie scripts 805. In the event scripts are not available for metadata and time-alignment, the STT transcript may be processed by the natural language analysis and entity extraction components for keyword searching of the video. Natural language analysis and entity extraction components for keyword searching of the video may use multimodal video search techniques, such as those described in U.S. patent application Ser. No. 12/618,353 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, which is hereby incorporated by reference as though fully set forth herein. - In some embodiments, audio content is provided. For example, input audio dialogue tracks may be directly provided by television or movie studios, or extracted from the provided video files using standard known extraction methods. For use with certain application (e.g., Adobe STT CLM and STT multicore application), the extracted audio may be converted to a mono channel format that uses 16-bit samples with a 16 kHz frequency response.
- In some embodiments, operation of an
STT engine 832 is modified by use of a custom language model (CLM). For example,STT engine 832 may employ transcription based at least partially or completely on a provided CLM. The CLM may be provided/built using certain methods, such as those described herein. In some embodiment,STT engine 832 includes a multicore STT engine. The multicore STT engine may segment the source audio data, may provide STT transcriptions using parallel processing. In some embodiments, speech-to-text (STT) technology may implement a custom language model and/or an enhanced multicore STT transcription engine such as those described in U.S. patent application Ser. No. 12/332,297 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, and/or U.S. patent application Ser. No. 12/332,309 entitled “MULTI-CORE PROCESSING FOR PARALLEL SPEECH-TO-TEXT PROCESSING”, filed Dec. 10, 2008, which are both hereby incorporated by reference as though fully set forth herein. - In some embodiments, a metadata
time synchronization service 834 aligns elements oftranscript 832 with corresponding portions ofscript data 802 to generate time-aligned script data. Metadatatime synchronization service 834 may be similar tosynchronization module 102. For example, in some embodiments, metadatatime synchronization service 834 implements a specialized STT/Script alignment component to provide time alignment of non-timecoded words in the script with timecoded words in the STT transcript using a hybrid two-level alignment process, such as that described herein with regard tosynchronization module 102. For example, in level one processing, smaller regions or partitions of text and STT transcription keywords are accurately identified and prepared for detailed alignment. In level two processing, known alignment methods based on Viterbi or dynamic programming techniques for edit distance can be used to align the words within each partition. However, in some embodiments, a modified Viterbi method and hybrid phonetic/text comparator may be implemented, as described below. As a result, each script word may be assigned an accurate video timecode. This facilitates keyword search and time-indexing of the video by client applications such as the multimodalvideo search engine 826, or other applications. - In some embodiments, a modified Viterbi and/or phonetic/text comparator is implemented by metadata
time synchronization service 834. Further, the alignment process may also implement special override rules to resolve alignment option ties. As described herein, a decision as to whether or not an alignment is made may not rely only on precise text matches between the transcribed STT word and the script word, but rather, may rely on how closely words sound to each other; this may be provided for using a specialize phonetic encoding of the STT words and script words. Such a technique may be applicable to supplement a wide variety of STT alignment applications. - In some embodiments, data relating to the user is provided a graphical display that presents source script dialogue, the resulting time aligned words, and/or video content in association with one another. For example, a GUI/visualization element of an application (e.g., CS5 Premiere Pro Script Align feature) may enable a user to see source script dialogue words time-aligned with video action.
- In some embodiments, a user may search a video based on the corresponding words in the time-aligned script data. For example, a multimodal video search engine may allow a user to search for specific segments of video based on provided query keywords. The search feature may implement various techniques, such as those described in U.S. patent application Ser. No. 12/618,353 entitled “ACCESSING MEDIA DATA USING METADATA REPOSITORY”, filed Nov. 13, 2009, which is hereby incorporated by reference as though fully set forth herein.
- In some embodiments, locations for the insertion of video descriptions can be located, video description content can be extracted from the script and automatically inserted into a time aligned script and/or audio track using time aligned script data (e.g., time aligned
script data 116 as described with respect toFIGS. 1 and 2 ) provided bysystem 100. Video descriptions may include an audio track in a movie or television program containing descriptions of the setting and action. Video description narrations fill in the story gaps by describing visual elements and provide a more complete description of what's happening in the program. This may be of particular value to the blind or visually impaired by helping to describe visual elements that they cannot view. The video description may be inserted into the natural pauses in dialogue or between critical sound elements, or the video and audio may be modified to enable insertion of video descriptions that may other wise be too long for the natural pauses. - Video description content may be generated by extracting descriptive information and narrative content from a script written for the project, syncing and editing it to the video program for playback. Video description content may be extracted directly from descriptive text embedded in the script. For example, location settings, actor movements, non-verbal events, etc. that may be provided in script elements (e.g., title, author name(s), scene headings, action elements, character names, parentheticals, transitions, shot elements, dialogue/narrations, and the like) may be extracted as the video description content, aligned to the correct portion of scenes (e.g., to pauses in dialogue) using time alignment data, and the video description content may be manually or automatically edited (if needed) to fit into the spaces available between dialogue segments.
- In some embodiments, the time aligned data acquired using
system 100 may be used to identify the location of pauses within the audio content for embedding narrative content (e.g., action elements). The locations of the pauses in the audio content may be provided to a user as locations for inserting video description content. Thus, a user may be able to quickly identify the location of pauses for adding video description content. In some embodiments, narrative content (e.g., action element descriptions embedded in the script) may be automatically inserted into corresponding pauses within the dialogue of the audio track to provide the corresponding video description content. The resulting video description content may be reviewable and editable by a user. A text version of the video description content can be used as a blueprint for recording by a human voiceover talent. Thus, a voicer may simply have to read the corresponding narration content as opposed to having to manually search through a program, manually identify breaks in the dialog, and derive/record narrations to describe the video. In some embodiments, the video description track can be created automatically using synthesized speech to read the video description content (e.g., without necessarily requiring any or at least a significant amount of human labor). - As noted above, a script may include a variety of script elements such as a scene heading, action, character, parenthetical, dialogue, transition, or other text that cannot be classified. Any or all of these and other script elements can be used to generate useful information for a video description track. A scene heading (also referred to as a “slugline”) includes a description of where the scene physically occurs. For example, a scene heading may indicate that the scene takes place indoors (e.g., INT.) or outdoors (e.g., EXT.), or possibly both indoors and outdoors (e.g., INT./EXT.) Typically, a location name follows the description of where the scene physically occurs. For example, “INT./EXT.” may be immediately followed by a more detailed description of where the scene occurs. (e.g., INT. KITCHEN, INT. LIVING ROOM, EXT. BASEBALL STADIUM, INT. AIRPLANE, etc.). The scene heading may also include the time of day (e.g., NIGHT, DAY, DAWN, EVENING, etc.). This information embedded in the script helps to “set the scene.” The scene type is typically designated as internal (INT.) or external (EXT.), and includes a period following the INT or EXT designation. A hyphen is typically used between other elements of the scene heading. For example, a complete scene heading may read, “INT. FERRY TERMINAL BAR—DAY” or “EXT. MAROON MOVIE STUDIO—DAY”.
- An action element (also referred to as a description element) typically describes the setting of the scene and introduces the characters in a scene. Action elements may also describe what will actually happen during the scene.
- A character name element may include an actual name (e.g., MS. SUTTER), description (e.g., BIG MAN) or occupation (e.g., BARTENDER) of a character. Sequence numbers are typically used to differentiate similar characters (e.g.,
COP # 1 and COP #2). A character name is almost always inserted prior to a character speaking (e.g., just before dialog element), to indicate that the character's dialogue follows. - A dialog element indicates what a character says when anyone on screen or off screen speaks. This may include conversation between characters, when a character speaks out loud to themselves, or when a character is off-screen and only their voice is heard (e.g., in a narration). Dialog elements may also include voice-overs or narration when the speaker is on screen but is not actively speaking on screen.
- A parenthetical typically includes a remark that indicates an attitude in dialog delivery, and/or specifies a verbal direction or action direction for the actor who is speaking the part of a character. Parentheticals are typically short, concise and descriptive statements located under the characters name.
- A transition typically includes a notation indicating an editing transition within the telling of a story. For example, “DISSOLVE TO:” means the action seems to blur and refocus into another scene, as generally used to denote a passage of time. Transitions almost always follow an action element and precede a scene heading. Common transitions include: “DISSOLVE TO:”, “CUT TO:”, “SMASH CUT:”, “QUICK CUT:”, “FADE IN:”, “FADE OUT:”, and “FADE TO:”.
- A shot element typically indicates what the camera sees. For example, a shot element that recites “TRACKING SHOT” generally indicates the camera should follow a character as he walks in a scene. “WIDE SHOT” generally indicates that every character appears in the scene. A SHOT tells the reader the focal point within a scene has changed. Example of shot elements include: “ANGLE ON . . . ”, “PAN TO . . . ”, “EXTREME CLOSE UP . . . ”, “FRANKIE'S POV . . . ”, and “REVERSE ANGLE . . . ”.
- In some embodiments, script elements may be identified and extracted as described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein. Moreover, the script elements may be time aligned to provide time-aligned
data 116 as described herein. The time aligned data may include dialogue as well as other script elements having corresponding timecodes that identify when each of the respective words/elements occur within the video/audio corresponding to the script. -
FIG. 9A illustrates anexemplary script document 900 in accordance with one or more embodiments of the present technique.Script document 900 depicts an exemplary layout of the above described script elements. For example,script document 900 includes atransition element 902, ascene heading element 904,action elements character name elements 908,dialog elements 910,parenthetical elements 912, and shotelement 914. - Script writers and describers often have closely aligned goals to describe onscreen actions succinctly, vividly and imaginatively. Often the action element text may be the most useful for creating video description content, as action elements typically provide the descriptions that clearly describe what has happened, is happening, or about to happen in a scene. Typically, long text passages in a script describing major changes in the setting or complex action sequences translate to longer spaces between dialogue in the recorded program (often filled with music and sound effects) and provide opportunities for including longer segments of video description content. For example, in the
script 900 ofFIG. 9A , the action described under the scene heading 904 andaction element 906 a is a wide establishing shot that follows the character out onto a busy studio lot. Since it describes a change of scene and establishes the new setting, there is a lot of descriptive text. The director filmed this shot on a crane, which swooped down from a high angle and followed the character through his action in this shot. Since there is a lot of information for the audience to take in during this lengthy transition shot, it begins without dialogue and continues for nearly half a minute. This gap in the dialogue provides a gap in which some or all of the descriptive action element text can be inserted. - Although some elements may be more useful than others, some or all of the script elements may be used to generate video description content. In some embodiments, a user may have control over which script elements to use in creating a video description. For example, a user may select to use only action elements and shot elements and to ignore other elements of the script. In some embodiments, the selection may be done before or after the video description is generated. For example, a user may allow the system to generate a video description using all or some of the script elements, and may subsequently pick-and-choose which elements to keep after the initial video description is generated.
-
FIG. 9B illustrates an exemplary portion of avideo description script 920 that corresponds to the portion ofscript 900 ofFIG. 9A .Video description script 920 includes avideo description track 922 broken into discrete segments (1-9) provided relative to gaps and dialogue of an audio track (e.g., main audio program recorded dialogue) 924 that corresponds to spoken words of dialogue content ofscript 920. In the illustrated embodiment, the content ofvideo description track 922 corresponds to action element text ofaction elements script 900 ofFIG. 9A . Each corresponding pause/gap in dialogue ofaudio track 922 is identified with a time of duration (e.g., “00:00:28:00 Gap” indicating a gap of twenty-eight seconds prior to the beginning of the script dialogue of segment 2). The corresponding content ofvideo description 922 is provided adjacent the gap/pause, and is identified with a time of duration for the video description content (e.g., “00:00:27:00” indicating twenty-seven seconds for the video description content to be spoken) where applicable. In some embodiments, the content ofvideo description 922 may be modified to fit within the corresponding gap. For example, in the illustrated embodiment, a portion of the first segment of video description content is removed to enable the resulting video description content to fit within the duration of the gap when spoken. In some embodiments, the entire video description content may be deleted or ignored where there is not a gap of sufficient length for the video description content. For example, the video description content ofsegment 3 was deleted/ignored as the corresponding pause in dialogue was only about twelve frames (or ½ a second) in duration—too short for the insertion of the corresponding video description content.Video description script 920 andvideo description content 922 can be used as a blueprint for recording by a human voiceover talent. Thus, a voicer may simply have to read the corresponding narration content as opposed to having to manually search through a program, manually identify breaks in the dialog, and derive/record narrations to describe the video. In some embodiments, the video description track can be created automatically using synthesized speech to read the video description content 922 (e.g., without necessarily requiring any or at least a significant amount of human labor). -
FIG. 9C is a flowchart that illustrates amethod 950 of generating a video description in accordance with one or more embodiments of the present technique.Method 950 may provide video description techniques using components and dataflow implemented atsystem 100.Method 950 generally includes identifying script elements, time aligning the script, identifying gaps/pauses in dialogue, aligning video description content to the gaps/pauses, generating a script with video description content, and generating a video description. -
Method 950 may include identifying script elements, as depicted atblock 952. Identifying script elements may include identifying some or all of the script elements contained within a script from which a video description is to be generated. For example, a script may be analyzed to provide script metadata that identifies a variety of script elements, such as scene headings, actions, characters, parentheticals, dialogue, transitions, or other text that cannot be classified. In some embodiments, script elements may be identified and extracted as described in U.S. patent application Ser. No. 12/713,008 entitled “METHOD AND APPARATUS FOR CAPTURING, ANALYZING, AND CONVERTING SCRIPTS”, filed Feb. 25, 2010, which is hereby incorporated by reference as though fully set forth herein. In some embodiments, the identification of the elements may not actually be performed but may simply be provided or retrieved for analysis. -
Method 950 may also include time aligning the script, as depicted atblock 954. Time aligning the script may include using techniques, such as those described herein with regard tosystem 100, to provide a timecode for some or all elements of the corresponding script. For example, a script may be processed to provide a timecode for some or all of the words within the script, including dialogue or other script elements. In some embodiments, the timecode information may provide stop and start time for various elements, including dialogue, which enables the identification of pauses between spoken words of dialogue. In some embodiments, the time alignment may not actually be performed but may simply be provided. For example, a system generating a video description may be provided or retrieve time alignedscript data 116. -
Method 950 may also include identifying gaps/pauses in dialogue, as depicted at block 956. In some embodiments, identifying gaps/pauses in dialogue may include assessing timecode information for each word of spoken dialogue to identify the beginning and end of spoken lines of dialogue, as well as any pauses in the spoken lines of dialogue that may provide gaps for the insertion of video description content. For example, invideo description script 920 ofFIG. 9B , a pause of twenty-eight seconds was identified atsegment 1, prior to the start of recorded dialogue ofsegment 2, a pause of 0.12 seconds was identified atsegment 3, and a pause of 4.06 seconds was identified atsegment 7. In some embodiments, a gap threshold may be used to identify what pauses are of sufficient length to constitute a gap that may be of sufficient length to be used for inserting video description content. For example, a gap threshold of three seconds may be set, thereby ignoring all pauses of less than three seconds and identifying only pauses equal to or greater than three-seconds as gaps of sufficient length to be used for inserting video description content. Such a technique may be useful to ignore normal pauses in speech (e.g., between spoken words) or short breaks between characters lines of dialogue that may be so short that it would be difficult to provide any substantive video description within the pause. In some embodiments, the gap threshold value may be user selectable. As depicted inFIG. 9B , the user may be provided with an indication that a gap is too short where there is a corresponding script element. For example,segment 3 of recordeddialogue 924 includes an inserted statement of “No gap available”, and the corresponding action text was deleted/ignored (as indicated by the strikethrough). Moreover, where there is no video description content (e.g., script elements) corresponding to a gap, the gap may be detected, but may be ignored. In some embodiments, the user may be alerted to the gap, thereby enabling them to readily identify gaps that could be used for the insertion of additional video description content. In some embodiments, video descriptions may be inserted into any available gaps, even out of sequence with their corresponding location in the script, according to rules or preferences provided by the user. For example, insegment 3, there was no available gap for the video description that would normally be inserted at that point according to the script. However, if there were another available gap within a prescribed number of seconds before or after that segment (e.g., segment 3), the video description could be inserted at that other location nearby within the prescribed number of seconds before or after that segment (e.g., segment 3). -
Method 950 may also include aligning video description content to gaps/pauses, as depicted at block 958. Aligning the video description content may include aligning the script elements with dialogue relative to where they occur within the script. InFIG. 9B , each of theaction elements FIG. 9B the script action elements have been aligned to the recorded dialog and the action element text from the script has been aligned with the available gaps when possible. Two gaps were identified atsegments - In some embodiments, a user may have control over the resulting video description. For example, a user may modify a video description at their choosing, or may be provided an opportunity to select how to truncate a video description that does not fit within a gap. For example, in the illustrated embodiment of
FIG. 9B , a user may select to remove the text of segment 1 (as indicated by the strikethrough) in an effort to make the video description fit within the corresponding gap. In some embodiments, video description content may be automatically modified to fit within a given gap. If a gap is too short to fit the corresponding video description content, the video description content may be automatically truncated using rules of grammar. For example, the last word(s) or entire last sentence(s) may be incrementally truncated/removed until the remaining video content description is short enough to fit within the gap. In the illustrated embodiment ofFIG. 9B , the last sentence “Maroon is leading an entourage of ASSISTANTS trying to keep up” may have been automatically removed, relieving the user of the need to manually modify the content. Of course, even in the event of automatic modification of the video description content, the user may have the opportunity to approve or modify the changes. In some embodiments, as the video description content is edited, the duration may be updated dynamically to indicate to the user whether the revised description will fit within an available gap. - In some embodiments, a gap in the recorded program may be created or the duration of a gap may be modified to provide for the insertion of video description content. For example, at
segment 3, the gap in the recorded audio may be increased (e.g., by inserting an additional amount of pause in the audio track between the end ofsegment 2 and the beginning of segment 4) to five seconds to enable the action element text to be fit within the resulting gap. Such a technique may be automatically applied at some or all instances where a gap is too short in duration to fit the corresponding video description content. Although such modifications of the dialogue may introduce delays or pauses within the corresponding video and, thus, may modify the video and dialogue of a traditional program, it may be particularly helpful in the context of audio-only programs. For example, for books-on-tape or similar audio tracks produced for the blind or visually impaired. - In some embodiments, video description content may be allowed to overlap certain portions of the audio track. For example, a user may have the option of modifying the video description content to overlap seemingly less important portions of the dialogue, music, sound effects, or the like. In some embodiments, the main audio recorded dialogue, music, sound effects, or the like may be dipped (e.g., reduced) in volume so that the video description may be heard more clearly. For example, the volume of music may be lowered while the video description content is being recited.
-
Method 950 may also include generating a script with video description content, as depicted atblock 960. Generating a script with video content may include generating a script document that includes video description content; script/recorded dialogue, and/or other script elements aligned with respect to one another.FIG. 9B illustrates an exemplaryvideo description script 920 that includesvideo description content 922 and recordeddialogue 924. In the illustrated embodiment, the modifications to the video description content are displayed. In some embodiments, a “clean” version of the video description script may be provided. For example, clean video description script may incorporate some or all of the modifications that are not visible. A text version of the video description content can be used as a blueprint for recording by a human voiceover talent. Thus, a voicer may simply have to read the corresponding narration content as opposed to having to manually search through a program, manually identify breaks in the dialog, compose appropriate video descriptions of correct lengths, and/or derive/record narrations to describe the program. -
Method 950 may also include generating a video description, as depicted atblock 962. Generating the video description may include recording a reading of the video description content. For example, a reading by a voicer and/or a synthesized reading of the video description content may be recorded to generate a video description track. In some embodiments, the video description track may be merged with the original audio of the program to generate a program containing both the original audio and the video description audio. - A script may go through many revisions between the time the production team begins working on the project and the time the final edited program is completed. Scenes may be added or deleted, dialogue may be re-written or ad-libbed during recording, and shots may be reordered during the editing process. In certain scenarios, the script may not be updated until the final cut of the program has been approved and someone spends the time to manually revise the script such that it matches the actual edited program. Another scenario may include creating different versions or cuts of an edited program, with each of the versions including a unique set of variations from the original script. Thus, there may be multiple versions of the script, with each version being accurate to a specific matching “cut” of the edited program. For example, a version may be created with one set of dialogue/video content that is appropriate for viewing by a restricted audience (e.g., adults only) and a different version with a different set of dialogue/video content that is appropriate for a broader audience (e.g., children). In some embodiments, changes to a script and information relating to certain portions of the script may be recorded using script metadata. The script metadata may be updated to reflect changes that occur during the production process. Thus, the script metadata may be an accurate representation of the audio/video of a program, and may be used to generate an accurate final script. An accurate final script may require less or no time for review and may be useful in subsequent processing, such as time aligning the script or other processes as described herein. For example, the final/revised script may be used in place of the original script as a source of script data (e.g., document (script) data 110) that is used for time aligning with a transcript (e.g., transcript 114) of corresponding video content (e.g., video content 106).
-
FIG. 10A is a block diagram that illustrates components and dataflow of a system for processing a script workflow (workflow) 970 in accordance with one or more embodiments of the present technique.Workflow 970 includes ascript 972 describing a plurality of scenes (e.g., scenes 1-6).Script 972 may include, for example, a written script similar to those described above with respect toFIGS. 1B and 9A . In some embodiments,script 972 may include an original version of the script. Although the original version of a script is followed during production, there are typically changes during production and editing. For example, scenes may be edited, added, deleted, or reordered during production. An original version of the script may include a version of the script prior to changes made during production. -
Script 972 may include embedded metadata, such as script elements, that provide information related to a scene. For example, in the illustrated embodiment,script 972 includes metadata 974 (e.g., dialogue elements), associated with each scene. During processing ofscript 972,metadata 974 may be broken into smaller segments, such as segments of metadata associated with a particular scene or shot. For example,script 972 may be processed to generate a structured/tagged scriptdocument including metadata 974. During production,metadata 974 may be extracted fromscript 972 and associated with one ormore clips 976 that are shot during production of the program forscript 972. Thus,metadata 974 may be broken into smaller segments and may be distributed among various recorded clips 976. Segments ofmetadata 974 from a portion ofscript 972 may be associated with a clip corresponding to the same portion ofscript 974. For example, in the illustrated embodiment, segments ofmetadata 974 from scenes (1-6) ofscript 972 are extracted and associated with one of a series ofclips 976 that are associated with a particular scene (1-6) ofscript 972. In some embodiments, segments ofmetadata 974 may be associated with a plurality of clips. For example, a segment of metadata forScene 1 ofscript 972 may be associated with both Clips 1A and 1B where they are both clips of a portion ofScene 1. - Each of
clips 976 may include electronic copies (e.g., files) of clips that are shot during production of the program forscript 972. Segments ofmetadata 974 may be embedded into each corresponding clip 976 (e.g., into the file containing the clip) such that the particular segment ofmetadata 974 travels with acorresponding clip 976. For example, whenClip 1 is opened/accessed in an application, the segment ofmetadata 974 embedded withClip 1 may be accessible by the application. Thus, clips 976 may act as metadata containers that enable segments of metadata to travel with a particular clip and be accessed independent of a source of the segment of metadata (e.g.,metadata 974 of script 972). - In some embodiments, one or more of
clips 976 may be accessed to add metadata or modify existing metadata associated with one ormore clips 976. For example, a user may accessClip 1, to embed a particular segment of metadata 974 (e.g., Script elements of Scene 1) withClip 1. In some embodiments, a user may accessClip 1 to provide revisedmetadata 978 that is embedded intoClip 1. For example, where a line of dialogue inScene 1 has changed, revisedmetadata 978 may include the revised line of dialogue, andClip 1 may be accessed to replace a corresponding line of dialogue fromScene 1 embedded inClip 1 with the revised line of dialogue contained in revisedmetadata 978. Similar techniques may be employed to update clip numbers, or other portions of metadata ofclips 976. For example, where a scene is reordered metadata of clips 976 (e.g., a scene/shot number) may be modified to reflect the new position of one or more of theclips 976 relative to other clips. - In some embodiments, revised
metadata 978 may reflect changes made to a portion ofscript 972, and may be used to update some of all clips that refer to the changed portion ofscript 972. For example, where dialogue inScene 1 is changed,Clip 1 and any other clips that refer to the dialogue ofScene 1 may be automatically updated using corresponding revisedmetadata 978 that includes the changes toScene 1. Such an embodiment may enable “master” changes made to script 972 to be automatically applied to the metadata of all clips that rely on the changed segment ofmetadata 974 ofscript 972. In some embodiments, changes made in metadata of a particular clip may be applied to all related metadata, such as the metadata of other clips that reference the same metadata of script 972 (e.g., clips that reference the same line of dialogue in the script). For example, where a line of dialogue contained in the metadata ofClip 1 is changed, the change may be reflected in the corresponding line of dialogue inmetadata 974 ofscript 972 and/or other clips including metadata including the line of dialogue. Thus, change in metadata at one location may be reflected in some or all of corresponding metadata other provided in other locations. In some embodiments, revising metadata is provided in accordance with embodiments described below with respect toFIGS. 10B-10E . In some embodiments, a user may be provided an opportunity to define how revisedmetadata 978 is applied. For example, a user may be provided the option of applying revisedmetadata 978 to a particular clip, or all clips that reference the same source of metadata. - As depicted in
FIG. 10A , clips may be arranged into an editedsequence 980 that is provided to ascript generator 982 to generate a revisedscript 984. For example, in the illustrated embodiment, revisedsequence 980 includesClips Clip 3A. Thus,new Clip 3A, not present inscript 972, has been added, andClips Clip 3A may include metadata that is similar tometadata 974 and revisedmetadata 978. For example, metadata ofclip 3A may include script elements, such as dialogue. -
Script generator 980 may compile the scene metadata in each clip of the edited sequence to generate revisedscript 984. In some embodiments, revisedscript 984 may include an ordered script that is arranged in accordance with an order of revisedsequence 980. For example, revisedscript 984 includesScenes Clips sequence 980. In some embodiments, script elements of revisedscript 984 are generated based on metadata embedded in each clip of revisedsequence 980. For example, dialogue ofScene Clips script 984, as opposed to themetadata 974 provided inscripts 972. - Script metadata may, thus, be embedded into each clip, and may be used to generate a corresponding script from any sequence of clips, irrespective of an order the clips are arranged and/or the source of the clips and their embedded metadata. Thus, a user may generate a revised script via combining any number of clips having embedded metadata.
- In some embodiments, revised
script 984 may include a written document that is provided in an industry standard format, similar to that ofFIGS. 1B and 9A . In some embodiments, revisedscript 984 may be provided as a structured/tagged document including revisedscript metadata 985. In some embodiments,script 984 may be provided to other modules for additional processing. For example, some or all of revisedscript 984 and/or revisedscript metadata 985 associated withscript 984, may be provided to synchronization module 102 (seeFIGS. 1A and 2 ), in place of or in combination with document (script)data 110. Such an embodiment may aide with time-alignment by providing script metadata that more accurately reflects actual video/audio content 106 and/or thecorresponding transcript 114. -
FIG. 10B is a block diagram that illustrates components and dataflow of a system for providing script data (system) 1000 in accordance with one or more embodiments of the present technique. In some embodiments, techniques described with respect tosystem 1000 may be employed in combination with other techniques described herein, such as techniques described above with respect toFIG. 10A and techniques relating to time-alignment of documents and generation of video descriptions. In some embodiments,system 1000 implements ametadata reviser 1002 that provides for revision ofscript metadata 974 and/orclip metadata 1006 to generate revisedclip metadata 1008. For example, with regard toFIG. 10A ,metadata reviser 1002 may provide access to embedded metadata ofClip 1 for the revision thereof. The resulting revisedclip metadata 1008 may be returned tometadata reviser 1002 for additional revisions, may be provided for display, and/or may be provided to another module for additional processing. For example, with regard toFIG. 10A , after revisedmetadata 978 is embedded withinClip 1, the now revised metadata ofClip 1 may be subsequently accessed for additional revisions, or may be provided to another module, such asscript generator 982 for use in generating revisedscript 984. - In some embodiments,
script metadata 974 may be provided via techniques similar to those described above. For example, script 972 (e.g., the same or similar toscript document 104 described above) may be provided to a script extractor 1004 (e.g., the same or similar todocument extractor 108 described above). Script extractor 1004 may generate corresponding script metadata 1004, such as a tagged/structured document (e.g., the same or similar to scriptdata 110 discussed above). For example,script metadata 974 may include the program title, author names, scene headings, action elements, shot elements, character names, parentheticals, and dialogue). In some embodiments,script metadata 974 may include additional information that is extracted or derived fromscript 972 or added by a user. For example,script metadata 974 may include additional identifiers, such as scene numbers, shot numbers, and the like that are derived directly from the by script extractor 1004 and/or are manually inserted by a user. In some embodiments,script metadata 974 may be generated using various techniques for extracting and embedding metadata, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “Systems and methods for Associating Metadata With Media Using Metadata Placeholders”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein.Script metadata 974 may be provided to various modules for processing and may be made available to user, such as production personnel on set, for viewing and revision. - In some embodiments,
clip metadata 1006 may be extracted fromscript metadata 974. For example, segments ofscript metadata 974 may be associated with one or more clips (e.g., Clips 976) to generate a segment ofclip metadata 1006. Production personnel may modifyclip metadata 1006 as changes are made to script 974, and/or may modifyclip metadata 1006 after a scene is shot to reflect what actually happened in the scene (e.g., the actual dialogue spoken) or how clips for the scene were actually shot. In some embodiments, a user may directly modify clip metadata. For example, a file, includingclip metadata 1006 may be accessed viascript reviser 1002,clip metadata 1006 may be modified (e.g., with revised metadata 978), and the file saved for later use. -
Metadata reviser 1002 may enable access toclip metadata 1006 for review and/or revision. For example,metadata reviser 1002 may include a module that provides for presenting (e.g., displaying)clip metadata 1006 to a user and/or may enable the revision/editing/modifying ofclip metadata 1006, thereby generating revisedclip metadata 1008. In some embodiments,clip metadata 1006 may be revised to reflect changes that have been made during various phases of production. For example, prior to shooting a scene, production personnel may wish to make changes to theclip metadata 1006 associated with a current version ofscript 972. As a further example, during or after shooting of a scene or clip, production personnel may desire to updateclip metadata 1006 to reflect what actually occurred in the recorded takes of the scene. In some embodiments, the production personnel may simply accessclip metadata 1006 associated with a particular clip viametadata reviser 1002, and may make appropriate changes that are reflected in revisedclip metadata 1008. For example, where production personnel desires to modify a line of dialogue after shooting a scene to reflect what was actually said in the recorded clip, the production personnel may access an electronic version ofclip metadata 1006 via a display of at least a portion of the dialogue associated with the clip, the production personnel may navigate to the portion of the clip of interest (e.g., the scene containing the line dialogue), and the production personnel may edit the line of dialogue as appropriate. - In some embodiments, revised
clip metadata 1008 is saved such that subsequent modifications to the clip metadata are based on the already revisedclip metadata 1008, as represented inFIG. 10B by the arrow returning revisedclip metadata 1002 tometadata reviser 1002 for subsequent processing. Thus, the techniques described herein may be applied to subsequent revisions/edits/modifications ofclip metadata 1006 and/or revisedscript metadata 1008. Revisedclip metadata 1008 may include a current and accurate representation of the current version of a clip (with modifications) at any given time during production. - In some embodiments, revised
clip metadata 1008 may be used to generate revisedscript 984. For example, as depicted inFIG. 10A , one or more ofclips 980 including revisedclip metadata 1008 may be provided toscript generator 982, which compiles metadata of each of the clips, including revisedclip metadata 1008, to generate a revised script. 984. Where production has completed, for example, revisedscript 984 may include a final script based off of a final version of revisedclip metadata 1008 and revisedsequence 980. Thus, revisedscript 984 and/or corresponding revised script metadata ofscript 984 may be an accurate representation of one or more versions of the actual recorded program based onscript 972. - In some embodiments, revised
clip metadata 1008, revisedscript metadata 985, and/or corresponding revisedscript 984 may be provided to a storage medium 1014 (e.g., the same or similar tostorage medium 118 discussed above), a graphical user interface (GUI) 1016 (e.g., the same or similar todisplay device 120 discussed above), and/or may be provided to other modules 1018 (e.g., the same or similar toother modules 122 discussed above) for subsequent processing. In some embodiments, other module 1012 may includescript generator 982 and/orsynchronization module 102, as described above. For example, revisedscript metadata 985 may be provided to synchronization module 102 (seeFIGS. 1A and 2 ), in place of or in combination with document (script)data 110. In some embodiments,synchronization module 102 may provide for time alignment based on revisedscript metadata 1006. For example, where a line of dialogue is replaced with a new line of dialogue and/or a scene is reordered during production,synchronization module 102 may align revisedscript metadata 985 to the resulting video content, as opposed toscript metadata 974 that is reflective of the content of an original/unedited script 972). Where revisedscript metadata 985 has been updated to reflect changes that are present in the video content (e.g., video content 106), revisedscript metadata 985 may provide an accurate representation of the resultingvideo content 106 and may, thus, provide for an efficient and accurate revised script and alignment of the revised script to video/audio for a corresponding recorded version of the program. - In some embodiments,
metadata reviser 1002 provides for a visual depiction ofclip metadata 1006 and/orscript metadata 974 via a graphical user interface (GUI) (e.g.,graphical user interface 1016 discussed in more detail below). Forexample metadata reviser 1002 may provide for the display of a current version of a script, including modifications made during production. In some embodiments,graphical user interface 1016 may enable a user to navigate through metadata to identify where modifications have been made to the script. A user may have the option of accepting the changes, rejecting the modifications and/or may make additional modifications. The GUI can also display alternate versions of content, such as the difference in a line of dialogue read in one take vs. another, and/or allow the user to see these differences and choose which one to use in the final edit. -
FIG. 10C is a diagram depicting an illustrative display of a graphical user interface (GUI) 1020 for viewing/revising metadata in accordance with one or more embodiments of the present technique.GUI 1020 may be displayed using a display device, such asdisplay device 1016. In the some embodiments,GUI 1020 includes a first (script)portion 1022 that provides for the display of a visual depiction of acurrent script 1024, which may be a current version ofscript 972. Where no modifications have been made to script 972,script 1024 may be based onscript metadata 974. Where modifications have been made to script 972,current script 1024 may be based on revisedscript metadata 985. In some embodiments,current script 1024 includes script elements 1026 a-1026 h. Script elements 1026 a-1026 h may include the program title, author names, scene headings, action elements, shot elements, character names, parentheticals, dialogue, scene numbers, shot numbers, and the like. For example,script elements script elements script element 1026 e may include a shot element.Current script 1024 may be displayed in accordance with an industry standard for scripts, similar to that ofFIG. 1B . In some embodiments,script portion 1022 ofuser interface 1020 may enable a user to navigate to various portions ofcurrent script 1024. For example a user may scroll up/down through the entirecurrent script 1024. - In the some embodiments,
GUI 1020 includes a second (video/audio content)portion 1030 that provides for the display of a visual depiction of information associated with recorded video/audio content. For example, video/audio content portion 1030 includes graphical depictions indicative of plurality of Clips 1032 a-1032 e, and their associated metadata. In some embodiments, audio/video content 1030 is categorized. For example, in the illustrated embodiment, Clips 1032 a-1032 e are grouped in association with other clips from the same scene (e.g.,scene 1 or scene 2). In some embodiments, audio/video portion 1030 ofuser interface 1020 may enable a user to navigate to various portions of clips and scene information. For example a user may scroll up/down through the entire listing of clips associated withcurrent script 1024. - In some embodiments, a user may interact with one or both of
script portion 1022 and audio/video content portion 1030 to modify corresponding script metadata. For example, in the illustrated embodiment, a user may “click” on Clip 1 (1032 a) and immediately thereafter, “click” ondialogue 1026 b toassociate Clip 1 todialogue 1026 b. Thus, in subsequent processing, metadata ofClip 1 may be associated/merged with metadata ofdialogue 1026 b. For example, during time-alignment of revised script 1012, a transcript associated withClip 1 1032 a will be associated/matched withdialogue 1026 b. In some embodiments, multiple clips may be associated with the same portion ofscript 1024. For example, in the illustrated embodiment, a user may then “click” on Clip 1 (1032 b) and immediately after, “click” ondialogue 1026 b to alsoassociate Clip 2 todialogue 1026 b. Such an embodiment may be of use whereClip 1 1032 a andClip 2 1032 b are two overlapping takes of the same scene (e.g., scene 1). For example, during time-alignment of revised script 1012, a transcript associated withClip 1 1032 a may be aligned withdialogue 1026 b and a separate transcript associated withClip 2 1032 b may be aligned withdialogue 1026 b. - A user may also review the association of various portions of
script 1024 and the clips and scenes displayed insecond portion 1030. When a user selects (e.g., clicks on or hovers over) each of the items inGUI 1020, the corresponding items may be highlighted. For example, wherescene 2 ofsecond portion 1030 is associated withscene 2 1028 b ofscript 1024, whereClip 3 1032 c is associated withdialogue 1026 d,Clip 4 1032 d is associated withdialogues Clip 5 1032 e is associated withdialogue 1026 g; when a user selectsClip 4 1032 d, bothdialogue dialogue 1026 f,Clip 4 1032 d may be highlighted. Such an interface may enable a user to easily navigate, view and modify script metadata. -
FIG. 10D is a flowchart that illustrates amethod 1100 of providing script data in accordance with one or more embodiments of the present technique.Method 1100 may employ various techniques described herein, including those discussed above with respect to components and dataflow implemented atsystem 1000.Method 1100 generally includes providing clip metadata, revising clip metadata, providing revised clip metadata, providing a revised script based on revised clip metadata, displaying revised clip metadata, and processing revised clip metadata. -
Method 1100 may include providing clip metadata content, as depicted atblock 1102. Providing clip metadata may include embedding metadata information about a script within a document or file containing an associated clip. In some embodiments,clip metadata 1006 may be provided via techniques similar to those described above with regard toFIG. 10B . For example,clip metadata 1006 may be derived fromscript metadata 974 provided fromscript 972 via script extractor 1004. In some embodiments,script metadata 974 and/orclip metadata 1006 may simply be provided from a source, such as a user inputting the metadata. In some embodiments, script metadata 1004 may be generated using various techniques for extracting and embedding metadata, such as those described in described in U.S. patent application Ser. No. 12/168,522 entitled “Systems and methods for Associating Metadata With Media Using Metadata Placeholders”, filed Jul. 7, 2008, which is hereby incorporated by reference as though fully set forth herein. -
Method 1100 may include revising clip metadata, as depicted atblock 1104. Revising clip metadata may include modifying at least a portion of the clip metadata. For example, revising clip metadata may include revising/editing/modifyingclip metadata 1006 and/or revisedclip metadata 1008, as described above with respect toFIG. 10B . In some embodiments, clip metadata may be modified via a user interface, such as that discussed above with respect toFIG. 10C . In some embodiments, revising clip metadata is provided in response to receiving a request to modify clip metadata. For example, clip metadata may be revised in accordance with a user request to add, delete or modify a portion of the current clip metadata viametadata reviser 1002, as described above. -
Method 1100 may include providing revised clip metadata, as depicted atblock 1106. Providing revised clip metadata may include providing the revised clip metadata in a format that is accessible by other modules and or a user. For example, providing revised clip metadata may include providing revisedclip metadata 1008 in a file format that can be compiled into revisedscript 984 viascript generator 982, can be opened and displayed on a graphical display, may be stored for later use, or may be used for subsequent processing. In some embodiments, providing revised clip metadata includes providing metadata that reflects a current version of the clip, including some or all of the modifications to the clip metadata up until a given point in the production of the program. For example, a revised clip metadata file may be dynamically updated as any changes are made such that the revised clip metadata accurately reflects all changes made to the script metadata during production up until the given point in time the revised clip metadata is accessed. -
Method 1100 may include providing a revised script based on revised clip metadata, as depicted atblock 1108. Providing a revised script may include generating a revised script based on revised script metadata that reflects a current version of the script, including some or all of the modifications to the script and clip metadata up until a given point in the production of the program. For example, as discussed above, at the conclusion of production, a version of revisedclip metadata 1008 may be used to generate revisedscript 984 that is reflective of some or all of the revisions to scriptmetadata 974 and/orclip metadata 1006 during production. Where production has completed, for example, revisedscript 984 may include a final script based off of a final version of revisedclip metadata 1008. -
Method 1100 may include displaying the revised script, as depicted atblock 1110. Displaying revised script may include providing for the visualization of one or more portions of revisedscript 984 in a graphical user interface. For example,script reviser 1002 may employ adisplay device 120/1016 to provide for the display similar to that ofGUI 1020. -
Method 1100 may include processing the revised script metadata, as depicted atblock 1112. Processing revised script metadata may include performing one or more processing techniques using revised script metadata. For example, revisedscript metadata 985 may be provided to synchronization module 102 (seeFIGS. 1A and 2 ), in place of or in combination with document (script)data 110. In some embodiments,synchronization module 102 may provide for time alignment based on revisedscript metadata 985. For example, where a line of dialogue is replaced with a new line of dialogue and/or a scene is reordered during production,synchronization module 102 may align revisedscript metadata 985 to the resulting video content, as opposed toscript metadata 974 that is reflective of the content of an original/unedited script (e.g., script 104). In some embodiments, processing script metadata may include generating video descriptions, as described above, using revisedscript metadata 985 and/or revisedscript 984. For example, an action element of revisedscript metadata 1006 and/or revisedscript 984 may be used in place of a corresponding action element present inscript 972. Where revisedscript metadata 1006 has been updated to reflect changes that are present in the video content (e.g., video content 106), revisedscript metadata 985 may provide an accurate representation of the resulting video content and may, thus, provide for an efficient and accurate final script and alignment of the final script and/or video descriptions that accurately represent the resulting video/audio content. -
FIG. 10E is a block diagram that illustrates components and dataflow for processing a script (workflow) 1120 in accordance with one or more embodiments of the present technique.Workflow 1120 may be accomplished using techniques discussed above with respect toFIG. 10A-10D . In the illustrated embodiments, two different revised versions of a script are provided. For example, in the illustrated embodiment, script revisions may be made during preproduction that can be incorporated into an original script document. Clips may be generated based on the original script and metadata associated with the clips may be revised during production. Further, the clips may be used to generate two separate sequences of clips. For example, a first sequence of clips may be provided for a first version of the script, and a second sequence of clips may be provided for a second version of the script. As a result, two different versions may be provided in the form of two revised scripts,versions # 1 andversion # 2. Other embodiments may include any number of combinations of clips to provide any number versions having different variations between them. - Various components of embodiments of a document time-alignment technique as described herein may be executed on one or more computer systems, which may interact with various other devices. One such computer system is illustrated by
FIG. 11 . In the illustrated embodiment,computer system 2000 includes one or more processors 2010 coupled to asystem memory 2020 via an input/output (I/O)interface 2030.Computer system 2000 further includes anetwork interface 2040 coupled to I/O interface 2030, and one or more input/output devices 2050, such ascursor control device 2060,keyboard 2070, audio device 2090, and display(s) 2080. In some embodiments, it is contemplated that embodiments may be implemented using a single instance ofcomputer system 2000, while in other embodiments multiple such systems, or multiple nodes making upcomputer system 2000, may be configured to host different portions or instances of embodiments. For example, in one embodiment some elements may be implemented via one or more nodes ofcomputer system 2000 that are distinct from those nodes implementing other elements. - In various embodiments,
computer system 2000 may be a uniprocessor system including one processor 2010, or a multiprocessor system including several processors 2010 (e.g., two, four, eight, or another suitable number). Processors 2010 may be any suitable processor capable of executing instructions. For example, in various embodiments, processors 2010 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 2010 may commonly, but not necessarily, implement the same ISA. - In some embodiments, at least one processor 2010 may be a graphics processing unit. A graphics processing unit or GPU may be considered a dedicated graphics-rendering device for a personal computer, workstation, game console or other computer system. Modern GPUs may be very efficient at manipulating and displaying computer graphics and their highly parallel structure may make them more effective than typical CPUs for a range of complex graphical algorithms. For example, a graphics processor may implement a number of graphics primitive operations in a way that makes executing them much faster than drawing directly to the screen with a host central processing unit (CPU). In various embodiments, the methods disclosed herein for layout-preserved text generation may be implemented by program instructions configured for execution on one of, or parallel execution on two or more of, such GPUs. The GPU(s) may implement one or more application programmer interfaces (APIs) that permit programmers to invoke the functionality of the GPU(s). Suitable GPUs may be commercially available from vendors such as NVIDIA Corporation having headquarters in Santa Clara, Calif., ATI Technologies of AMD having headquarters in Sunnyvale, Calif., and others.
-
System memory 2020 may be configured to store program instructions and/or data accessible by processor 2010.System memory 2020 may include tangible a non-transitory storage medium for storing program instructions and other data thereon. In various embodiments,system memory 2020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing desired functions, such as those described above for time-alignment methods, are shown stored withinsystem memory 2020 asprogram instructions 2025 anddata storage 2035, respectively. In other embodiments, program instructions and/or data may be received, sent or stored upon different types of computer-accessible media or on similar media separate fromsystem memory 2020 orcomputer system 2000. Generally speaking, a computer-accessible medium may include storage media or memory media such as magnetic or optical media, e.g., disk or CD/DVD-ROM coupled tocomputer system 2000 via I/O interface 2030. Program instructions and data stored via a computer-accessible medium may be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link, such as may be implemented vianetwork interface 2040. - In one embodiment, I/
O interface 2030 may be configured to coordinate I/O traffic between processor 2010,system memory 2020, and any peripheral devices in the device, includingnetwork interface 2040 or other peripheral interfaces, such as input/output devices 2050. In some embodiments, I/O interface 2030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 2020) into a format suitable for use by another component (e.g., processor 2010). In some embodiments, I/O interface 2030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 2030 may be split into two or more separate components. In addition, in some embodiments some or all of the functionality of I/O interface 2030, such as an interface tosystem memory 2020, may be incorporated directly into processor 2010. -
Network interface 2040 may be configured to allow data to be exchanged betweencomputer system 2000 and other devices attached to a network, such as other computer systems, or between nodes ofcomputer system 2000. In various embodiments,network interface 2040 may support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks; via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol. - Input/
output devices 2050 may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one ormore computer system 2000. Multiple input/output devices 2050 may be present incomputer system 2000 or may be distributed on various nodes ofcomputer system 2000. In some embodiments, similar input/output devices may be separate fromcomputer system 2000 and may interact with one or more nodes ofcomputer system 2000 through a wired or wireless connection, such as overnetwork interface 2040. - As shown in
FIG. 11 ,memory 2020 may includeprogram instructions 2025, configured to implement embodiments of a layout-preserved text generation method as described herein, anddata storage 2035, comprising various data accessible byprogram instructions 2025. In one embodiment,program instructions 2025 may include software elements of a layout-preserved text generation method illustrated in the above Figures.Data storage 2035 may include data that may be used in embodiments, for example input PDF documents or output layout-preserved text documents. In other embodiments, other or different software elements and/or data may be included. - Those skilled in the art will appreciate that
computer system 2000 is merely illustrative and is not intended to limit the scope of a layout-preserved text generation method as described herein. In particular, the computer system and devices may include any combination of hardware or software that can perform the indicated functions, including computers, network devices, internet appliances, PDAs, wireless phones, pagers, etc.Computer system 2000 may also be connected to other devices that are not illustrated, or instead may operate as a stand-alone system. In addition, the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality may be available. - Those skilled in the art will also appreciate that, while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them may be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components may execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures may also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a computer-accessible medium separate from
computer system 2000 may be transmitted tocomputer system 2000 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link. Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Accordingly, the present invention may be practiced with other computer system configurations. In some embodiments, portions of the techniques described herein (e.g., preprocessing of script and metadata may be hosted in a cloud computing infrastructure. - Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Generally speaking, a computer-accessible storage medium may include a non-transitory storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
- Some portions of the detailed description provided herein are presented in terms of algorithms or symbolic representations of operations on binary digital signals stored within a memory of a specific apparatus or special purpose computing device or platform. In the context of this particular specification, the term specific apparatus or the like includes a general purpose computer once it is programmed to perform particular functions pursuant to instructions from program software. Algorithmic descriptions or symbolic representations are examples of techniques used by those of ordinary skill in the signal processing or related arts to convey the substance of their work to others skilled in the art. An algorithm is here, and is generally, considered to be a self-consistent sequence of operations or similar signal processing leading to a desired result. In this context, operations or processing involve physical manipulation of physical quantities. Typically, although not necessarily, such quantities may take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared or otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to such signals as bits, data, values, elements, symbols, characters, terms, numbers, numerals or the like. It should be understood, however, that all of these or similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic computing device. In the context of this specification, therefore, a special purpose computer or a similar special purpose electronic computing device is capable of manipulating or transforming signals, typically represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the special purpose computer or similar special purpose electronic computing device.
- Various methods as illustrated in the Figures and described herein represent examples of embodiments of methods. The methods may be implemented in software, hardware, or a combination thereof. The order of method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.
- Various modifications and changes may be to the above technique made as would be obvious to a person skilled in the art having the benefit of this disclosure. For example, although several embodiments are discussed with regard to dialogue/narrative elements of script documents, the techniques described herein may be applied to assess and determine data relating other elements of a script document. It is intended that the invention embrace all such modifications and changes and, accordingly, the above description to be regarded in an illustrative rather than a restrictive sense.
- Adobe and Adobe PDF are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and other countries.
Claims (21)
1. A method, comprising:
receiving script metadata extracted from a script for a program, wherein the script metadata comprises clip metadata associated with a particular portion of the program;
storing the clip metadata in a clip corresponding to the particular portion of the program;
receiving a request to revise the clip metadata stored in the clip;
revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata stored in the clip; and
generating a revised script using the revised clip metadata stored in the clip.
2. The method of claim 1 , wherein generating a revised script using the revised clip metadata comprises compiling clip metadata from a sequence of a plurality of clips, wherein the revised script comprises the clip metadata from each of the plurality of clips arranged in accordance with the sequence of plurality of clips.
3. The method of claim 1 , wherein the clip metadata comprises a segment of the script metadata.
4. The method of claim 1 , wherein the clip comprises a recording of audio/video that corresponds to the particular portion of the program.
5. The method of claim 1 , further comprising generating a time-aligned script using metadata contained in the revised script.
6. The method of claim 5 , wherein generating a time-aligned script using metadata contained in the revised script comprises time-aligning script words contained in the revised clip metadata with transcript words indicative of dialogue spoken in a corresponding recorded portion of the program based on the script.
7. The method of claim 1 , wherein the request to revise the script metadata comprises a request to revise the script to reflect changes to the script during production of the program.
8. A non-transitory computer readable storage medium having program instructions stored thereon, wherein the program instructions are executable to cause a computer system to perform a method, comprising:
receiving script metadata extracted from a script for a program, wherein the script metadata comprises clip metadata associated with a particular portion of the program;
storing the clip metadata in a clip corresponding to the particular portion of the program;
receiving a request to revise the clip metadata stored in the clip;
revising the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata stored in the clip; and
generating a revised script using the revised clip metadata stored in the clip.
9. The storage medium of claim 8 , wherein generating a revised script using the revised clip metadata comprises compiling clip metadata from a sequence of a plurality of clips, wherein the revised script comprises the clip metadata from each of the plurality of clips arranged in accordance with the sequence of plurality of clips.
10. The storage medium of claim 8 , wherein the clip metadata comprises a segment of the script metadata.
11. The storage medium of claim 8 , wherein the clip comprises a recording of audio/video that corresponds to the particular portion of the program.
12. The storage medium of claim 8 , further comprising generating a time-aligned script using metadata contained in the revised script.
13. The storage medium of claim 12 , wherein generating a time-aligned script using metadata contained in the revised script comprises time-aligning script words contained in the revised clip metadata with transcript words indicative of dialogue spoken in a corresponding recorded portion of the program based on the script.
14. The storage medium of claim 8 , wherein the request to revise the script metadata comprises a request to revise the script to reflect changes to the script during production of the program.
15. A computer system configured to:
receive script metadata extracted from a script for a program, wherein the script metadata comprises clip metadata associated with a particular portion of the program;
store the clip metadata in a clip corresponding to the particular portion of the program;
receive a request to revise the clip metadata stored in the clip;
revise the clip metadata in accordance with the request to revise the clip metadata to generate revised clip metadata stored in the clip; and
generate a revised script using the revised clip metadata stored in the clip.
16. The computer system of claim 15 , wherein generating a revised script using the revised clip metadata comprises compiling clip metadata from a sequence of a plurality of clips, wherein the revised script comprises the clip metadata from each of the plurality of clips arranged in accordance with the sequence of plurality of clips.
17. The computer system of claim 15 , wherein the clip metadata comprises a segment of the script metadata.
18. The computer system of claim 15 , wherein the clip comprises a recording of audio/video that corresponds to the particular portion of the program.
19. The computer system of claim 15 , further comprising generating a time-aligned script using metadata contained in the revised script.
20. The computer system of claim 19 , wherein generating a time-aligned script using metadata contained in the revised script comprises time-aligning script words contained in the revised clip metadata with transcript words indicative of dialogue spoken in a corresponding recorded portion of the program based on the script.
21. The computer system of claim 15 , wherein the request to revise the script metadata comprises a request to revise the script to reflect changes to the script during production of the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/789,749 US20130124984A1 (en) | 2010-04-12 | 2010-05-28 | Method and Apparatus for Providing Script Data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US32312110P | 2010-04-12 | 2010-04-12 | |
US12/789,749 US20130124984A1 (en) | 2010-04-12 | 2010-05-28 | Method and Apparatus for Providing Script Data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130124984A1 true US20130124984A1 (en) | 2013-05-16 |
Family
ID=48280305
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/789,791 Active 2031-01-31 US9066049B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for processing scripts |
US12/789,720 Active 2031-02-08 US8825488B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for time synchronized script metadata |
US12/789,749 Abandoned US20130124984A1 (en) | 2010-04-12 | 2010-05-28 | Method and Apparatus for Providing Script Data |
US12/789,708 Active 2031-03-17 US8447604B1 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for processing scripts and related data |
US12/789,785 Active 2031-04-04 US8825489B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for interpolating script data |
US12/789,760 Active 2031-10-25 US9191639B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for generating video descriptions |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/789,791 Active 2031-01-31 US9066049B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for processing scripts |
US12/789,720 Active 2031-02-08 US8825488B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for time synchronized script metadata |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/789,708 Active 2031-03-17 US8447604B1 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for processing scripts and related data |
US12/789,785 Active 2031-04-04 US8825489B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for interpolating script data |
US12/789,760 Active 2031-10-25 US9191639B2 (en) | 2010-04-12 | 2010-05-28 | Method and apparatus for generating video descriptions |
Country Status (1)
Country | Link |
---|---|
US (6) | US9066049B2 (en) |
Cited By (125)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120092232A1 (en) * | 2010-10-14 | 2012-04-19 | Zebra Imaging, Inc. | Sending Video Data to Multiple Light Modulators |
US20120166175A1 (en) * | 2010-12-22 | 2012-06-28 | Tata Consultancy Services Ltd. | Method and System for Construction and Rendering of Annotations Associated with an Electronic Image |
US20130191440A1 (en) * | 2012-01-20 | 2013-07-25 | Gorilla Technology Inc. | Automatic media editing apparatus, editing method, broadcasting method and system for broadcasting the same |
US20130283143A1 (en) * | 2012-04-24 | 2013-10-24 | Eric David Petajan | System for Annotating Media Content for Automatic Content Understanding |
US20130308922A1 (en) * | 2012-05-15 | 2013-11-21 | Microsoft Corporation | Enhanced video discovery and productivity through accessibility |
US20140040713A1 (en) * | 2012-08-02 | 2014-02-06 | Steven C. Dzik | Selecting content portions for alignment |
US20140082091A1 (en) * | 2012-09-19 | 2014-03-20 | Box, Inc. | Cloud-based platform enabled with media content indexed for text-based searches and/or metadata extraction |
US20140114657A1 (en) * | 2012-10-22 | 2014-04-24 | Huseby, Inc, | Apparatus and method for inserting material into transcripts |
US20140139738A1 (en) * | 2011-07-01 | 2014-05-22 | Dolby Laboratories Licensing Corporation | Synchronization and switch over methods and systems for an adaptive audio system |
US20140201778A1 (en) * | 2013-01-15 | 2014-07-17 | Sap Ag | Method and system of interactive advertisement |
US8825488B2 (en) | 2010-04-12 | 2014-09-02 | Adobe Systems Incorporated | Method and apparatus for time synchronized script metadata |
US20140288922A1 (en) * | 2012-02-24 | 2014-09-25 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for man-machine conversation |
US20140303974A1 (en) * | 2013-04-03 | 2014-10-09 | Kabushiki Kaisha Toshiba | Text generator, text generating method, and computer program product |
US20150006152A1 (en) * | 2013-06-26 | 2015-01-01 | Huawei Technologies Co., Ltd. | Method and Apparatus for Generating Journal |
US20150121218A1 (en) * | 2013-10-30 | 2015-04-30 | Samsung Electronics Co., Ltd. | Method and apparatus for controlling text input in electronic device |
US9098474B2 (en) | 2011-10-26 | 2015-08-04 | Box, Inc. | Preview pre-generation based on heuristics and algorithmic prediction/assessment of predicted user behavior for enhancement of user experience |
US20150237298A1 (en) * | 2014-02-19 | 2015-08-20 | Nexidia Inc. | Supplementary media validation system |
US9117087B2 (en) | 2012-09-06 | 2015-08-25 | Box, Inc. | System and method for creating a secure channel for inter-application communication based on intents |
US9135462B2 (en) | 2012-08-29 | 2015-09-15 | Box, Inc. | Upload and download streaming encryption to/from a cloud-based platform |
US9195519B2 (en) | 2012-09-06 | 2015-11-24 | Box, Inc. | Disabling the self-referential appearance of a mobile application in an intent via a background registration |
US20150379098A1 (en) * | 2014-06-27 | 2015-12-31 | Samsung Electronics Co., Ltd. | Method and apparatus for managing data |
US20160042766A1 (en) * | 2014-08-06 | 2016-02-11 | Echostar Technologies L.L.C. | Custom video content |
US9280613B2 (en) | 2012-05-23 | 2016-03-08 | Box, Inc. | Metadata enabled third-party application access of content at a cloud-based platform via a native client to the cloud-based platform |
US9292833B2 (en) | 2012-09-14 | 2016-03-22 | Box, Inc. | Batching notifications of activities that occur in a web-based collaboration environment |
US9304985B1 (en) | 2012-02-03 | 2016-04-05 | Google Inc. | Promoting content |
US20160124909A1 (en) * | 2014-10-29 | 2016-05-05 | International Business Machines Corporation | Computerized tool for creating variable length presentations |
US20160133251A1 (en) * | 2013-05-31 | 2016-05-12 | Longsand Limited | Processing of audio data |
US20160156575A1 (en) * | 2014-11-27 | 2016-06-02 | Samsung Electronics Co., Ltd. | Method and apparatus for providing content |
US9378191B1 (en) | 2012-02-03 | 2016-06-28 | Google Inc. | Promoting content |
US9396216B2 (en) | 2012-05-04 | 2016-07-19 | Box, Inc. | Repository redundancy implementation of a system which incrementally updates clients with events that occurred via a cloud-enabled platform |
US9396245B2 (en) | 2013-01-02 | 2016-07-19 | Box, Inc. | Race condition handling in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US9413587B2 (en) | 2012-05-02 | 2016-08-09 | Box, Inc. | System and method for a third-party application to access content within a cloud-based platform |
US20160269463A1 (en) * | 2011-11-25 | 2016-09-15 | Harry E. Emerson, III | Streaming the audio portion of a video ad to incompatible media players |
US9471551B1 (en) * | 2012-02-03 | 2016-10-18 | Google Inc. | Promoting content |
US9478059B2 (en) * | 2014-07-28 | 2016-10-25 | PocketGems, Inc. | Animated audiovisual experiences driven by scripts |
US9495364B2 (en) | 2012-10-04 | 2016-11-15 | Box, Inc. | Enhanced quick search features, low-barrier commenting/interactive features in a collaboration platform |
US9507795B2 (en) | 2013-01-11 | 2016-11-29 | Box, Inc. | Functionalities, features, and user interface of a synchronization client to a cloud-based environment |
US9535909B2 (en) | 2013-09-13 | 2017-01-03 | Box, Inc. | Configurable event-based automation architecture for cloud-based collaboration platforms |
US9535924B2 (en) | 2013-07-30 | 2017-01-03 | Box, Inc. | Scalability improvement in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US9558202B2 (en) | 2012-08-27 | 2017-01-31 | Box, Inc. | Server side techniques for reducing database workload in implementing selective subfolder synchronization in a cloud-based environment |
US9575981B2 (en) | 2012-04-11 | 2017-02-21 | Box, Inc. | Cloud service enabled to handle a set of files depicted to a user as a single file in a native operating system |
US9633037B2 (en) | 2013-06-13 | 2017-04-25 | Box, Inc | Systems and methods for synchronization event building and/or collapsing by a synchronization component of a cloud-based platform |
US9632997B1 (en) * | 2012-03-21 | 2017-04-25 | 3Play Media, Inc. | Intelligent caption systems and methods |
US9652741B2 (en) | 2011-07-08 | 2017-05-16 | Box, Inc. | Desktop application for access and interaction with workspaces in a cloud-based content management system and synchronization mechanisms thereof |
US9665349B2 (en) | 2012-10-05 | 2017-05-30 | Box, Inc. | System and method for generating embeddable widgets which enable access to a cloud-based collaboration platform |
US9691051B2 (en) | 2012-05-21 | 2017-06-27 | Box, Inc. | Security enhancement through application access control |
US9704111B1 (en) | 2011-09-27 | 2017-07-11 | 3Play Media, Inc. | Electronic transcription job market |
US9712510B2 (en) | 2012-07-06 | 2017-07-18 | Box, Inc. | Systems and methods for securely submitting comments among users via external messaging applications in a cloud-based platform |
US9741337B1 (en) * | 2017-04-03 | 2017-08-22 | Green Key Technologies Llc | Adaptive self-trained computer engines with associated databases and methods of use thereof |
US9774911B1 (en) * | 2016-07-29 | 2017-09-26 | Rovi Guides, Inc. | Methods and systems for automatically evaluating an audio description track of a media asset |
US9773051B2 (en) | 2011-11-29 | 2017-09-26 | Box, Inc. | Mobile platform file and folder selection functionalities for offline access and synchronization |
US9794256B2 (en) | 2012-07-30 | 2017-10-17 | Box, Inc. | System and method for advanced control tools for administrators in a cloud-based service |
US9805050B2 (en) | 2013-06-21 | 2017-10-31 | Box, Inc. | Maintaining and updating file system shadows on a local device by a synchronization client of a cloud-based platform |
US20180018325A1 (en) * | 2016-07-13 | 2018-01-18 | Fujitsu Social Science Laboratory Limited | Terminal equipment, translation method, and non-transitory computer readable medium |
US9894119B2 (en) | 2014-08-29 | 2018-02-13 | Box, Inc. | Configurable metadata-based automation and content classification architecture for cloud-based collaboration platforms |
US9904435B2 (en) | 2012-01-06 | 2018-02-27 | Box, Inc. | System and method for actionable event generation for task delegation and management via a discussion forum in a web-based collaboration environment |
US9904505B1 (en) * | 2015-04-10 | 2018-02-27 | Zaxcom, Inc. | Systems and methods for processing and recording audio with integrated script mode |
US20180095713A1 (en) * | 2016-10-04 | 2018-04-05 | Descript, Inc. | Platform for producing and delivering media content |
US9953036B2 (en) | 2013-01-09 | 2018-04-24 | Box, Inc. | File system monitoring in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US9959420B2 (en) | 2012-10-02 | 2018-05-01 | Box, Inc. | System and method for enhanced security and management mechanisms for enterprise administrators in a cloud-based environment |
US9965745B2 (en) | 2012-02-24 | 2018-05-08 | Box, Inc. | System and method for promoting enterprise adoption of a web-based collaboration environment |
US20180182374A1 (en) * | 2012-12-10 | 2018-06-28 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US20180203925A1 (en) * | 2017-01-17 | 2018-07-19 | Acoustic Protocol Inc. | Signature-based acoustic classification |
US10038731B2 (en) | 2014-08-29 | 2018-07-31 | Box, Inc. | Managing flow-based interactions with cloud-based shared content |
US10057537B1 (en) * | 2017-08-18 | 2018-08-21 | Prime Focus Technologies, Inc. | System and method for source script and video synchronization interface |
US20180286459A1 (en) * | 2017-03-30 | 2018-10-04 | Lenovo (Beijing) Co., Ltd. | Audio processing |
US10225621B1 (en) | 2017-12-20 | 2019-03-05 | Dish Network L.L.C. | Eyes free entertainment |
US20190079724A1 (en) * | 2017-09-12 | 2019-03-14 | Google Llc | Intercom-style communication using multiple computing devices |
US10235383B2 (en) | 2012-12-19 | 2019-03-19 | Box, Inc. | Method and apparatus for synchronization of items with read-only permissions in a cloud-based environment |
US10304458B1 (en) * | 2014-03-06 | 2019-05-28 | Board of Trustees of the University of Alabama and the University of Alabama in Huntsville | Systems and methods for transcribing videos using speaker identification |
US10354008B2 (en) * | 2016-10-07 | 2019-07-16 | Productionpro Technologies Inc. | System and method for providing a visual scroll representation of production data |
US10419818B2 (en) * | 2014-04-29 | 2019-09-17 | At&T Intellectual Property I, L.P. | Method and apparatus for augmenting media content |
US10452667B2 (en) | 2012-07-06 | 2019-10-22 | Box Inc. | Identification of people as search results from key-word based searches of content in a cloud-based environment |
US10509527B2 (en) | 2013-09-13 | 2019-12-17 | Box, Inc. | Systems and methods for configuring event-based automation in cloud-based collaboration platforms |
US10530854B2 (en) | 2014-05-30 | 2020-01-07 | Box, Inc. | Synchronization of permissioned content in cloud-based environments |
US10554426B2 (en) | 2011-01-20 | 2020-02-04 | Box, Inc. | Real time notification of activities that occur in a web-based collaboration environment |
US20200050677A1 (en) * | 2018-08-07 | 2020-02-13 | Disney Enterprises, Inc. | Joint understanding of actors, literary characters, and movies |
US10564817B2 (en) | 2016-12-15 | 2020-02-18 | Descript, Inc. | Techniques for creating and presenting media content |
US20200090701A1 (en) * | 2018-09-18 | 2020-03-19 | At&T Intellectual Property I, L.P. | Video-log production system |
US10599671B2 (en) | 2013-01-17 | 2020-03-24 | Box, Inc. | Conflict resolution, retry condition management, and handling of problem files for the synchronization client to a cloud-based platform |
US20200213478A1 (en) * | 2018-12-26 | 2020-07-02 | Nbcuniversal Media, Llc | Systems and methods for aligning text and multimedia content |
CN111462775A (en) * | 2020-03-30 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Audio similarity determination method, device, server and medium |
US10725968B2 (en) | 2013-05-10 | 2020-07-28 | Box, Inc. | Top down delete or unsynchronization on delete of and depiction of item synchronization with a synchronization client to a cloud-based platform |
US10846074B2 (en) | 2013-05-10 | 2020-11-24 | Box, Inc. | Identification and handling of items to be ignored for synchronization with a cloud-based platform by a synchronization client |
US10854190B1 (en) * | 2016-06-13 | 2020-12-01 | United Services Automobile Association (Usaa) | Transcription analysis platform |
US10891489B2 (en) * | 2019-04-08 | 2021-01-12 | Nedelco, Incorporated | Identifying and tracking words in a video recording of captioning session |
US10896294B2 (en) | 2018-01-11 | 2021-01-19 | End Cue, Llc | Script writing and content generation tools and improved operation of same |
CN112256672A (en) * | 2020-10-22 | 2021-01-22 | 中国联合网络通信集团有限公司 | Database change approval method and device |
US10924636B1 (en) * | 2020-04-30 | 2021-02-16 | Gopro, Inc. | Systems and methods for synchronizing information for videos |
US10922489B2 (en) * | 2018-01-11 | 2021-02-16 | RivetAI, Inc. | Script writing and content generation tools and improved operation of same |
US11055348B2 (en) * | 2017-12-29 | 2021-07-06 | Facebook, Inc. | Systems and methods for automatically generating stitched media content |
US11094327B2 (en) * | 2018-09-28 | 2021-08-17 | Lenovo (Singapore) Pte. Ltd. | Audible input transcription |
US11107503B2 (en) | 2019-10-08 | 2021-08-31 | WeMovie Technologies | Pre-production systems for making movies, TV shows and multimedia contents |
US11122099B2 (en) * | 2018-11-30 | 2021-09-14 | Motorola Solutions, Inc. | Device, system and method for providing audio summarization data from video |
US11138970B1 (en) * | 2019-12-06 | 2021-10-05 | Asapp, Inc. | System, method, and computer program for creating a complete transcription of an audio recording from separately transcribed redacted and unredacted words |
US11166086B1 (en) | 2020-10-28 | 2021-11-02 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
WO2021225608A1 (en) * | 2020-05-08 | 2021-11-11 | WeMovie Technologies | Fully automated post-production editing for movies, tv shows and multimedia contents |
US11210610B2 (en) | 2011-10-26 | 2021-12-28 | Box, Inc. | Enhanced multimedia content preview rendering in a cloud content management system |
US11232481B2 (en) | 2012-01-30 | 2022-01-25 | Box, Inc. | Extended applications of multimedia content previews in the cloud-based content management system |
US11238886B1 (en) * | 2019-01-09 | 2022-02-01 | Audios Ventures Inc. | Generating video information representative of audio clips |
US11238899B1 (en) * | 2017-06-13 | 2022-02-01 | 3Play Media Inc. | Efficient audio description systems and methods |
US11245950B1 (en) * | 2019-04-24 | 2022-02-08 | Amazon Technologies, Inc. | Lyrics synchronization |
US20220101880A1 (en) * | 2020-09-28 | 2022-03-31 | TCL Research America Inc. | Write-a-movie: unifying writing and shooting |
US20220130424A1 (en) * | 2020-10-28 | 2022-04-28 | Facebook Technologies, Llc | Text-driven editor for audio and video assembly |
US11321639B1 (en) | 2021-12-13 | 2022-05-03 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
US11330154B1 (en) | 2021-07-23 | 2022-05-10 | WeMovie Technologies | Automated coordination in multimedia content production |
WO2022098759A1 (en) * | 2020-11-03 | 2022-05-12 | Capital One Services, Llc | Computer-based systems configured for automated computer script analysis and malware detection and methods thereof |
US20220256156A1 (en) * | 2021-02-08 | 2022-08-11 | Sony Group Corporation | Reproduction control of scene description |
US11430485B2 (en) * | 2019-11-19 | 2022-08-30 | Netflix, Inc. | Systems and methods for mixing synthetic voice with original audio tracks |
US20220279228A1 (en) * | 2019-08-29 | 2022-09-01 | BOND Co., Ltd. | Program production method, program production apparatus, and recording medium |
US20220284886A1 (en) * | 2021-03-03 | 2022-09-08 | Spotify Ab | Systems and methods for providing responses from media content |
US11521639B1 (en) | 2021-04-02 | 2022-12-06 | Asapp, Inc. | Speech sentiment analysis using a speech sentiment classifier pretrained with pseudo sentiment labels |
US11564014B2 (en) | 2020-08-27 | 2023-01-24 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US20230027035A1 (en) * | 2021-07-09 | 2023-01-26 | Transitional Forms Inc. | Automated narrative production system and script production method with real-time interactive characters |
US11570525B2 (en) | 2019-08-07 | 2023-01-31 | WeMovie Technologies | Adaptive marketing in cloud-based content production |
US11580175B2 (en) * | 2012-10-05 | 2023-02-14 | Google Llc | Transcoding and serving resources |
US20230110905A1 (en) * | 2021-10-07 | 2023-04-13 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US11729475B2 (en) * | 2018-12-21 | 2023-08-15 | Bce Inc. | System and method for providing descriptive video |
US11735186B2 (en) | 2021-09-07 | 2023-08-22 | 3Play Media, Inc. | Hybrid live captioning systems and methods |
US11736654B2 (en) | 2019-06-11 | 2023-08-22 | WeMovie Technologies | Systems and methods for producing digital multimedia contents including movies and tv shows |
US11763803B1 (en) | 2021-07-28 | 2023-09-19 | Asapp, Inc. | System, method, and computer program for extracting utterances corresponding to a user problem statement in a conversation between a human agent and a user |
US11763099B1 (en) | 2022-04-27 | 2023-09-19 | VoyagerX, Inc. | Providing translated subtitle for video content |
US11812121B2 (en) | 2020-10-28 | 2023-11-07 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
CN117240983A (en) * | 2023-11-16 | 2023-12-15 | 湖南快乐阳光互动娱乐传媒有限公司 | Method and device for automatically generating sound drama |
US12067363B1 (en) | 2022-02-24 | 2024-08-20 | Asapp, Inc. | System, method, and computer program for text sanitization |
Families Citing this family (130)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9141860B2 (en) | 2008-11-17 | 2015-09-22 | Liveclips Llc | Method and system for segmenting and transmitting on-demand live-action video in real-time |
US20100263005A1 (en) * | 2009-04-08 | 2010-10-14 | Eric Foster White | Method and system for egnaging interactive web content |
US20120029918A1 (en) * | 2009-09-21 | 2012-02-02 | Walter Bachtiger | Systems and methods for recording, searching, and sharing spoken content in media files |
US20130311181A1 (en) * | 2009-09-21 | 2013-11-21 | Walter Bachtiger | Systems and methods for identifying concepts and keywords from spoken words in text, audio, and video content |
US20130138637A1 (en) * | 2009-09-21 | 2013-05-30 | Walter Bachtiger | Systems and methods for ranking media files |
US8355903B1 (en) | 2010-05-13 | 2013-01-15 | Northwestern University | System and method for using data and angles to automatically generate a narrative story |
US11989659B2 (en) | 2010-05-13 | 2024-05-21 | Salesforce, Inc. | Method and apparatus for triggering the automatic generation of narratives |
US9208147B1 (en) | 2011-01-07 | 2015-12-08 | Narrative Science Inc. | Method and apparatus for triggering the automatic generation of narratives |
US8888494B2 (en) * | 2010-06-28 | 2014-11-18 | Randall Lee THREEWITS | Interactive environment for performing arts scripts |
US10095367B1 (en) * | 2010-10-15 | 2018-10-09 | Tivo Solutions Inc. | Time-based metadata management system for digital media |
US8826354B2 (en) * | 2010-12-01 | 2014-09-02 | At&T Intellectual Property I, L.P. | Method and system for testing closed caption content of video assets |
US9720899B1 (en) | 2011-01-07 | 2017-08-01 | Narrative Science, Inc. | Automatic generation of narratives from data using communication goals and narrative analytics |
US9576009B1 (en) | 2011-01-07 | 2017-02-21 | Narrative Science Inc. | Automatic generation of narratives from data using communication goals and narrative analytics |
US10657201B1 (en) | 2011-01-07 | 2020-05-19 | Narrative Science Inc. | Configurable and portable system for generating narratives |
US9697178B1 (en) | 2011-01-07 | 2017-07-04 | Narrative Science Inc. | Use of tools and abstraction in a configurable and portable system for generating narratives |
US9697197B1 (en) | 2011-01-07 | 2017-07-04 | Narrative Science Inc. | Automatic generation of narratives from data using communication goals and narrative analytics |
US10185477B1 (en) | 2013-03-15 | 2019-01-22 | Narrative Science Inc. | Method and system for configuring automatic generation of narratives from data |
US20120303643A1 (en) * | 2011-05-26 | 2012-11-29 | Raymond Lau | Alignment of Metadata |
US20130035936A1 (en) * | 2011-08-02 | 2013-02-07 | Nexidia Inc. | Language transcription |
JP5642037B2 (en) * | 2011-09-22 | 2014-12-17 | 株式会社東芝 | SEARCH DEVICE, SEARCH METHOD, AND PROGRAM |
US9002703B1 (en) * | 2011-09-28 | 2015-04-07 | Amazon Technologies, Inc. | Community audio narration generation |
US20130117012A1 (en) * | 2011-11-03 | 2013-05-09 | Microsoft Corporation | Knowledge based parsing |
US9069850B2 (en) * | 2011-11-08 | 2015-06-30 | Comcast Cable Communications, Llc | Content descriptor |
JP6045175B2 (en) * | 2012-04-05 | 2016-12-14 | 任天堂株式会社 | Information processing program, information processing apparatus, information processing method, and information processing system |
US9367745B2 (en) | 2012-04-24 | 2016-06-14 | Liveclips Llc | System for annotating media content for automatic content understanding |
US9002702B2 (en) | 2012-05-03 | 2015-04-07 | International Business Machines Corporation | Confidence level assignment to information from audio transcriptions |
US20140013268A1 (en) * | 2012-07-09 | 2014-01-09 | Mobitude, LLC, a Delaware LLC | Method for creating a scripted exchange |
US20140067394A1 (en) * | 2012-08-28 | 2014-03-06 | King Abdulaziz City For Science And Technology | System and method for decoding speech |
PL401346A1 (en) * | 2012-10-25 | 2014-04-28 | Ivona Software Spółka Z Ograniczoną Odpowiedzialnością | Generation of customized audio programs from textual content |
US20140142925A1 (en) * | 2012-11-16 | 2014-05-22 | Raytheon Bbn Technologies | Self-organizing unit recognition for speech and other data series |
US8977555B2 (en) * | 2012-12-20 | 2015-03-10 | Amazon Technologies, Inc. | Identification of utterance subjects |
US9208784B2 (en) * | 2013-01-08 | 2015-12-08 | C21 Patents, Llc | Methododolgy for live text broadcasting |
US10339452B2 (en) | 2013-02-06 | 2019-07-02 | Verint Systems Ltd. | Automated ontology development |
US9378739B2 (en) * | 2013-03-13 | 2016-06-28 | Nuance Communications, Inc. | Identifying corresponding positions in different representations of a textual work |
US9804729B2 (en) | 2013-03-15 | 2017-10-31 | International Business Machines Corporation | Presenting key differences between related content from different mediums |
US9495365B2 (en) * | 2013-03-15 | 2016-11-15 | International Business Machines Corporation | Identifying key differences between related content from different mediums |
JP6155821B2 (en) * | 2013-05-08 | 2017-07-05 | ソニー株式会社 | Information processing apparatus, information processing method, and program |
US20140379346A1 (en) * | 2013-06-21 | 2014-12-25 | Google Inc. | Video analysis based language model adaptation |
US8947596B2 (en) * | 2013-06-27 | 2015-02-03 | Intel Corporation | Alignment of closed captions |
US20150019206A1 (en) * | 2013-07-10 | 2015-01-15 | Datascription Llc | Metadata extraction of non-transcribed video and audio streams |
US9230547B2 (en) | 2013-07-10 | 2016-01-05 | Datascription Llc | Metadata extraction of non-transcribed video and audio streams |
US20150058006A1 (en) * | 2013-08-23 | 2015-02-26 | Xerox Corporation | Phonetic alignment for user-agent dialogue recognition |
US20150066506A1 (en) | 2013-08-30 | 2015-03-05 | Verint Systems Ltd. | System and Method of Text Zoning |
KR101747873B1 (en) * | 2013-09-12 | 2017-06-27 | 한국전자통신연구원 | Apparatus and for building language model for speech recognition |
US9477752B1 (en) * | 2013-09-30 | 2016-10-25 | Verint Systems Inc. | Ontology administration and application to enhance communication data analytics |
US10078689B2 (en) | 2013-10-31 | 2018-09-18 | Verint Systems Ltd. | Labeling/naming of themes |
US9232063B2 (en) | 2013-10-31 | 2016-01-05 | Verint Systems Inc. | Call flow and discourse analysis |
US9977830B2 (en) | 2014-01-31 | 2018-05-22 | Verint Systems Ltd. | Call summary |
US10037380B2 (en) | 2014-02-14 | 2018-07-31 | Microsoft Technology Licensing, Llc | Browsing videos via a segment list |
US9699404B2 (en) | 2014-03-19 | 2017-07-04 | Microsoft Technology Licensing, Llc | Closed caption alignment |
US9805125B2 (en) | 2014-06-20 | 2017-10-31 | Google Inc. | Displaying a summary of media content items |
US9946769B2 (en) | 2014-06-20 | 2018-04-17 | Google Llc | Displaying information related to spoken dialogue in content playing on a device |
US9838759B2 (en) | 2014-06-20 | 2017-12-05 | Google Inc. | Displaying information related to content playing on a device |
US10206014B2 (en) | 2014-06-20 | 2019-02-12 | Google Llc | Clarifying audible verbal information in video content |
US9575936B2 (en) * | 2014-07-17 | 2017-02-21 | Verint Systems Ltd. | Word cloud display |
US20160042765A1 (en) * | 2014-08-05 | 2016-02-11 | Avid Technology, Inc. | Media composition with timing blocks |
US10747823B1 (en) | 2014-10-22 | 2020-08-18 | Narrative Science Inc. | Interactive and conversational data exploration |
US11238090B1 (en) | 2015-11-02 | 2022-02-01 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from visualization data |
US11288328B2 (en) | 2014-10-22 | 2022-03-29 | Narrative Science Inc. | Interactive and conversational data exploration |
US11922344B2 (en) | 2014-10-22 | 2024-03-05 | Narrative Science Llc | Automatic generation of narratives from data using communication goals and narrative analytics |
US9332221B1 (en) | 2014-11-28 | 2016-05-03 | International Business Machines Corporation | Enhancing awareness of video conference participant expertise |
US11030406B2 (en) | 2015-01-27 | 2021-06-08 | Verint Systems Ltd. | Ontology expansion using entity-association rules and abstract relations |
US9914218B2 (en) | 2015-01-30 | 2018-03-13 | Toyota Motor Engineering & Manufacturing North America, Inc. | Methods and apparatuses for responding to a detected event by a robot |
US10217379B2 (en) | 2015-01-30 | 2019-02-26 | Toyota Motor Engineering & Manufacturing North America, Inc. | Modifying vision-assist device parameters based on an environment classification |
US10037712B2 (en) | 2015-01-30 | 2018-07-31 | Toyota Motor Engineering & Manufacturing North America, Inc. | Vision-assist devices and methods of detecting a classification of an object |
US9886423B2 (en) * | 2015-06-19 | 2018-02-06 | International Business Machines Corporation | Reconciliation of transcripts |
US11232268B1 (en) | 2015-11-02 | 2022-01-25 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from line charts |
US11222184B1 (en) | 2015-11-02 | 2022-01-11 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from bar charts |
US11188588B1 (en) | 2015-11-02 | 2021-11-30 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to interactively generate narratives from visualization data |
US10349141B2 (en) | 2015-11-19 | 2019-07-09 | Google Llc | Reminders of media content referenced in other media content |
CN106782627B (en) * | 2015-11-23 | 2019-08-27 | 广州酷狗计算机科技有限公司 | Audio file rerecords method and device |
EP3182297A1 (en) * | 2015-12-18 | 2017-06-21 | Thomson Licensing | Method for generating semantic description of textual content and apparatus performing the same |
US10034053B1 (en) | 2016-01-25 | 2018-07-24 | Google Llc | Polls for media program moments |
US10169033B2 (en) * | 2016-02-12 | 2019-01-01 | International Business Machines Corporation | Assigning a computer to a group of computers in a group infrastructure |
BE1023431B1 (en) * | 2016-06-01 | 2017-03-17 | Limecraft Nv | AUTOMATIC IDENTIFICATION AND PROCESSING OF AUDIOVISUAL MEDIA |
US11409791B2 (en) | 2016-06-10 | 2022-08-09 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
CN106331844A (en) * | 2016-08-17 | 2017-01-11 | 北京金山安全软件有限公司 | Method and device for generating subtitles of media file and electronic equipment |
US10474703B2 (en) | 2016-08-25 | 2019-11-12 | Lakeside Software, Inc. | Method and apparatus for natural language query in a workspace analytics system |
US10853583B1 (en) | 2016-08-31 | 2020-12-01 | Narrative Science Inc. | Applied artificial intelligence technology for selective control over narrative generation from visualizations of data |
US20180130484A1 (en) * | 2016-11-07 | 2018-05-10 | Axon Enterprise, Inc. | Systems and methods for interrelating text transcript information with video and/or audio information |
US10546063B2 (en) * | 2016-12-13 | 2020-01-28 | International Business Machines Corporation | Processing of string inputs utilizing machine learning |
US11069253B2 (en) | 2016-12-29 | 2021-07-20 | Tata Consultancy Services Limited | Method and system for language skill assessment and development |
US11954445B2 (en) | 2017-02-17 | 2024-04-09 | Narrative Science Llc | Applied artificial intelligence technology for narrative generation based on explanation communication goals |
US10943069B1 (en) | 2017-02-17 | 2021-03-09 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation based on a conditional outcome framework |
US10755053B1 (en) | 2017-02-17 | 2020-08-25 | Narrative Science Inc. | Applied artificial intelligence technology for story outline formation using composable communication goals to support natural language generation (NLG) |
US11568148B1 (en) | 2017-02-17 | 2023-01-31 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation based on explanation communication goals |
US11068661B1 (en) | 2017-02-17 | 2021-07-20 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation based on smart attributes |
US10699079B1 (en) | 2017-02-17 | 2020-06-30 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation based on analysis communication goals |
US11263489B2 (en) * | 2017-06-29 | 2022-03-01 | Intel Corporation | Techniques for dense video descriptions |
US11190855B2 (en) | 2017-08-30 | 2021-11-30 | Arris Enterprises Llc | Automatic generation of descriptive video service tracks |
WO2019055827A1 (en) * | 2017-09-15 | 2019-03-21 | Oneva, Inc. | Personal video commercial studio system |
GB201715753D0 (en) * | 2017-09-28 | 2017-11-15 | Royal Nat Theatre | Caption delivery system |
US11397855B2 (en) * | 2017-12-12 | 2022-07-26 | International Business Machines Corporation | Data standardization rules generation |
US11042708B1 (en) | 2018-01-02 | 2021-06-22 | Narrative Science Inc. | Context saliency-based deictic parser for natural language generation |
US11023689B1 (en) | 2018-01-17 | 2021-06-01 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation using an invocable analysis service with analysis libraries |
US11182556B1 (en) | 2018-02-19 | 2021-11-23 | Narrative Science Inc. | Applied artificial intelligence technology for building a knowledge base using natural language processing |
US10726838B2 (en) * | 2018-06-14 | 2020-07-28 | Disney Enterprises, Inc. | System and method of generating effects during live recitations of stories |
US10706236B1 (en) | 2018-06-28 | 2020-07-07 | Narrative Science Inc. | Applied artificial intelligence technology for using natural language processing and concept expression templates to train a natural language generation system |
US10558761B2 (en) * | 2018-07-05 | 2020-02-11 | Disney Enterprises, Inc. | Alignment of video and textual sequences for metadata analysis |
CN110364145B (en) * | 2018-08-02 | 2021-09-07 | 腾讯科技(深圳)有限公司 | Voice recognition method, and method and device for sentence breaking by voice |
CN109271495B (en) * | 2018-08-14 | 2023-02-17 | 创新先进技术有限公司 | Question-answer recognition effect detection method, device, equipment and readable storage medium |
AU2019366366A1 (en) | 2018-10-22 | 2021-05-20 | William D. Carlson | Therapeutic combinations of TDFRPs and additional agents and methods of use |
CN109547831B (en) * | 2018-11-19 | 2021-06-01 | 网宿科技股份有限公司 | Method and device for synchronizing white board and video, computing equipment and storage medium |
CN109584882B (en) * | 2018-11-30 | 2022-12-27 | 南京天溯自动化控制系统有限公司 | Method and system for optimizing voice to text conversion aiming at specific scene |
CN109840273B (en) * | 2019-01-18 | 2020-09-15 | 珠海天燕科技有限公司 | Method and device for generating file |
US10990767B1 (en) | 2019-01-28 | 2021-04-27 | Narrative Science Inc. | Applied artificial intelligence technology for adaptive natural language understanding |
US11769012B2 (en) | 2019-03-27 | 2023-09-26 | Verint Americas Inc. | Automated system and method to prioritize language model and ontology expansion and pruning |
US11429789B2 (en) * | 2019-06-12 | 2022-08-30 | International Business Machines Corporation | Natural language processing and candidate response identification |
CN112231275B (en) | 2019-07-14 | 2024-02-27 | 阿里巴巴集团控股有限公司 | Method, system and equipment for classifying multimedia files, processing information and training models |
US20220414349A1 (en) * | 2019-07-22 | 2022-12-29 | wordly, Inc. | Systems, methods, and apparatus for determining an official transcription and speaker language from a plurality of transcripts of text in different languages |
US11276419B2 (en) * | 2019-07-30 | 2022-03-15 | International Business Machines Corporation | Synchronized sound generation from videos |
US11295084B2 (en) | 2019-09-16 | 2022-04-05 | International Business Machines Corporation | Cognitively generating information from videos |
CN110675896B (en) * | 2019-09-30 | 2021-10-22 | 北京字节跳动网络技术有限公司 | Character time alignment method, device and medium for audio and electronic equipment |
US11410658B1 (en) * | 2019-10-29 | 2022-08-09 | Dialpad, Inc. | Maintainable and scalable pipeline for automatic speech recognition language modeling |
CN114846540A (en) * | 2019-11-04 | 2022-08-02 | 谷歌有限责任公司 | Using video clips as dictionary use examples |
US11562743B2 (en) * | 2020-01-29 | 2023-01-24 | Salesforce.Com, Inc. | Analysis of an automatically generated transcription |
US11570099B2 (en) | 2020-02-04 | 2023-01-31 | Bank Of America Corporation | System and method for autopartitioning and processing electronic resources |
US11360937B2 (en) | 2020-03-20 | 2022-06-14 | Bank Of America Corporation | System for natural language processing-based electronic file scanning for processing database queries |
CN111711853B (en) | 2020-06-09 | 2022-02-01 | 北京字节跳动网络技术有限公司 | Information processing method, system, device, electronic equipment and storage medium |
US11669295B2 (en) * | 2020-06-18 | 2023-06-06 | Sony Group Corporation | Multiple output control based on user input |
US11625928B1 (en) * | 2020-09-01 | 2023-04-11 | Amazon Technologies, Inc. | Language agnostic drift correction |
US11871138B2 (en) * | 2020-10-13 | 2024-01-09 | Grass Valley Canada | Virtualized production switcher and method for media production |
US11922943B1 (en) * | 2021-01-26 | 2024-03-05 | Wells Fargo Bank, N.A. | KPI-threshold selection for audio-transcription models |
EP4060519A1 (en) * | 2021-03-18 | 2022-09-21 | Prisma Analytics GmbH | Data transformation considering data integrity |
US20240203277A1 (en) * | 2021-05-10 | 2024-06-20 | Sony Group Corporation | Information processing device, information processing method, and information processing program |
CN115514987B (en) * | 2021-06-23 | 2024-10-18 | 视见科技(杭州)有限公司 | System and method for automated narrative video production through the use of script annotations |
US12118986B2 (en) * | 2021-07-20 | 2024-10-15 | Conduent Business Services, Llc | System and method for automated processing of natural language using deep learning model encoding |
US20230237990A1 (en) * | 2022-01-27 | 2023-07-27 | Asapp, Inc. | Training speech processing models using pseudo tokens |
EP4283490A1 (en) * | 2022-05-27 | 2023-11-29 | Tata Consultancy Services Limited | Systems and methods for rules-based mapping of answer scripts to markers |
JP7538574B1 (en) | 2024-04-19 | 2024-08-22 | 史睦 川口 | Video creation device, video creation method, video creation program, and video creation system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070233738A1 (en) * | 2006-04-03 | 2007-10-04 | Digitalsmiths Corporation | Media access system |
US20100005070A1 (en) * | 2002-04-12 | 2010-01-07 | Yoshimi Moriya | Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, and metadata delivery method and hint information description method |
US20100262618A1 (en) * | 2009-04-14 | 2010-10-14 | Disney Enterprises, Inc. | System and method for real-time media presentation using metadata clips |
US20100299131A1 (en) * | 2009-05-21 | 2010-11-25 | Nexidia Inc. | Transcript alignment |
US7895037B2 (en) * | 2004-08-20 | 2011-02-22 | International Business Machines Corporation | Method and system for trimming audio files |
US20110239107A1 (en) * | 2010-03-29 | 2011-09-29 | Phillips Michael E | Transcript editor |
US8064752B1 (en) * | 2003-12-09 | 2011-11-22 | Apple Inc. | Video encoding |
US8572488B2 (en) * | 2010-03-29 | 2013-10-29 | Avid Technology, Inc. | Spot dialog editor |
Family Cites Families (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5677739A (en) * | 1995-03-02 | 1997-10-14 | National Captioning Institute | System and method for providing described television services |
US5900908A (en) | 1995-03-02 | 1999-05-04 | National Captioning Insitute, Inc. | System and method for providing described television services |
US5801685A (en) | 1996-04-08 | 1998-09-01 | Tektronix, Inc. | Automatic editing of recorded video elements sychronized with a script text read or displayed |
US6172675B1 (en) | 1996-12-05 | 2001-01-09 | Interval Research Corporation | Indirect manipulation of data using temporally related data, with particular application to manipulation of audio or audiovisual data |
JP3690043B2 (en) | 1997-03-03 | 2005-08-31 | ソニー株式会社 | Audio information transmission apparatus and method, and audio information recording apparatus |
US6336093B2 (en) | 1998-01-16 | 2002-01-01 | Avid Technology, Inc. | Apparatus and method using speech recognition and scripts to capture author and playback synchronized audio and video |
US6219642B1 (en) | 1998-10-05 | 2001-04-17 | Legerity, Inc. | Quantization using frequency and mean compensated frequency input data for robust speech recognition |
US6473778B1 (en) * | 1998-12-24 | 2002-10-29 | At&T Corporation | Generating hypermedia documents from transcriptions of television programs using parallel text alignment |
US6370503B1 (en) | 1999-06-30 | 2002-04-09 | International Business Machines Corp. | Method and apparatus for improving speech recognition accuracy |
US6442518B1 (en) * | 1999-07-14 | 2002-08-27 | Compaq Information Technologies Group, L.P. | Method for refining time alignments of closed captions |
US6477493B1 (en) | 1999-07-15 | 2002-11-05 | International Business Machines Corporation | Off site voice enrollment on a transcription device for speech recognition |
GB0008537D0 (en) | 2000-04-06 | 2000-05-24 | Ananova Ltd | Character animation |
US6505153B1 (en) | 2000-05-22 | 2003-01-07 | Compaq Information Technologies Group, L.P. | Efficient method for producing off-line closed captions |
WO2001095631A2 (en) | 2000-06-09 | 2001-12-13 | British Broadcasting Corporation | Generation subtitles or captions for moving pictures |
GB0023930D0 (en) | 2000-09-29 | 2000-11-15 | Canon Kk | Database annotation and retrieval |
US6975985B2 (en) | 2000-11-29 | 2005-12-13 | International Business Machines Corporation | Method and system for the automatic amendment of speech recognition vocabularies |
US6925455B2 (en) | 2000-12-12 | 2005-08-02 | Nec Corporation | Creating audio-centric, image-centric, and integrated audio-visual summaries |
JPWO2002103591A1 (en) * | 2001-06-13 | 2004-10-07 | 富士通株式会社 | Agenda progress support device and agenda progress support program |
US7668718B2 (en) | 2001-07-17 | 2010-02-23 | Custom Speech Usa, Inc. | Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile |
US7231351B1 (en) * | 2002-05-10 | 2007-06-12 | Nexidia, Inc. | Transcript alignment |
US7260738B2 (en) | 2002-06-17 | 2007-08-21 | Microsoft Corporation | System and method for splitting an image across multiple computer readable media |
US20040001106A1 (en) * | 2002-06-26 | 2004-01-01 | John Deutscher | System and process for creating an interactive presentation employing multi-media components |
US7123696B2 (en) | 2002-10-04 | 2006-10-17 | Frederick Lowe | Method and apparatus for generating and distributing personalized media clips |
US7168953B1 (en) | 2003-01-27 | 2007-01-30 | Massachusetts Institute Of Technology | Trainable videorealistic speech animation |
US20050120391A1 (en) | 2003-12-02 | 2005-06-02 | Quadrock Communications, Inc. | System and method for generation of interactive TV content |
US20050228663A1 (en) | 2004-03-31 | 2005-10-13 | Robert Boman | Media production system using time alignment to scripts |
US7836389B2 (en) | 2004-04-16 | 2010-11-16 | Avid Technology, Inc. | Editing system for audiovisual works and corresponding text for television news |
AU2006214311A1 (en) * | 2005-02-14 | 2006-08-24 | Teresis Media Management, Inc. | Multipurpose media players |
US7672830B2 (en) | 2005-02-22 | 2010-03-02 | Xerox Corporation | Apparatus and methods for aligning words in bilingual sentences |
US8249871B2 (en) | 2005-11-18 | 2012-08-21 | Microsoft Corporation | Word clustering for input data |
US7623755B2 (en) | 2006-08-17 | 2009-11-24 | Adobe Systems Incorporated | Techniques for positioning audio and video clips |
US20080114603A1 (en) | 2006-11-15 | 2008-05-15 | Adacel, Inc. | Confirmation system for command or speech recognition using activation means |
US20080243503A1 (en) | 2007-03-30 | 2008-10-02 | Microsoft Corporation | Minimum divergence based discriminative training for pattern recognition |
US8345159B2 (en) | 2007-04-16 | 2013-01-01 | Caption Colorado L.L.C. | Captioning evaluation system |
US8170396B2 (en) | 2007-04-16 | 2012-05-01 | Adobe Systems Incorporated | Changing video playback rate |
KR101445869B1 (en) * | 2007-07-11 | 2014-09-29 | 엘지전자 주식회사 | Media Interface |
US8990848B2 (en) * | 2008-07-22 | 2015-03-24 | At&T Intellectual Property I, L.P. | System and method for temporally adaptive media playback |
JP2010074823A (en) | 2008-08-22 | 2010-04-02 | Panasonic Corp | Video editing system |
EP2109096B1 (en) | 2008-09-03 | 2009-11-18 | Svox AG | Speech synthesis with dynamic constraints |
US8219899B2 (en) | 2008-09-22 | 2012-07-10 | International Business Machines Corporation | Verbal description method and system |
US8131545B1 (en) | 2008-09-25 | 2012-03-06 | Google Inc. | Aligning a transcript to audio data |
US9049477B2 (en) | 2008-11-13 | 2015-06-02 | At&T Intellectual Property I, Lp | Apparatus and method for managing media content |
US8497939B2 (en) | 2008-12-08 | 2013-07-30 | Home Box Office, Inc. | Method and process for text-based assistive program descriptions for television |
US20100260482A1 (en) | 2009-04-14 | 2010-10-14 | Yossi Zoor | Generating a Synchronized Audio-Textual Description of a Video Recording Event |
US8701007B2 (en) * | 2009-04-30 | 2014-04-15 | Apple Inc. | Edit visualizer for modifying and evaluating uncommitted media content |
US20100332225A1 (en) * | 2009-06-29 | 2010-12-30 | Nexidia Inc. | Transcript alignment |
US8843368B2 (en) | 2009-08-17 | 2014-09-23 | At&T Intellectual Property I, L.P. | Systems, computer-implemented methods, and tangible computer-readable storage media for transcription alignment |
US8799953B2 (en) | 2009-08-27 | 2014-08-05 | Verizon Patent And Licensing Inc. | Media content distribution systems and methods |
US8281231B2 (en) | 2009-09-11 | 2012-10-02 | Digitalsmiths, Inc. | Timeline alignment for closed-caption text using speech recognition transcripts |
US9066049B2 (en) | 2010-04-12 | 2015-06-23 | Adobe Systems Incorporated | Method and apparatus for processing scripts |
-
2010
- 2010-05-28 US US12/789,791 patent/US9066049B2/en active Active
- 2010-05-28 US US12/789,720 patent/US8825488B2/en active Active
- 2010-05-28 US US12/789,749 patent/US20130124984A1/en not_active Abandoned
- 2010-05-28 US US12/789,708 patent/US8447604B1/en active Active
- 2010-05-28 US US12/789,785 patent/US8825489B2/en active Active
- 2010-05-28 US US12/789,760 patent/US9191639B2/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100005070A1 (en) * | 2002-04-12 | 2010-01-07 | Yoshimi Moriya | Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, and metadata delivery method and hint information description method |
US8064752B1 (en) * | 2003-12-09 | 2011-11-22 | Apple Inc. | Video encoding |
US7895037B2 (en) * | 2004-08-20 | 2011-02-22 | International Business Machines Corporation | Method and system for trimming audio files |
US20070233738A1 (en) * | 2006-04-03 | 2007-10-04 | Digitalsmiths Corporation | Media access system |
US20100262618A1 (en) * | 2009-04-14 | 2010-10-14 | Disney Enterprises, Inc. | System and method for real-time media presentation using metadata clips |
US20100299131A1 (en) * | 2009-05-21 | 2010-11-25 | Nexidia Inc. | Transcript alignment |
US20110239107A1 (en) * | 2010-03-29 | 2011-09-29 | Phillips Michael E | Transcript editor |
US8572488B2 (en) * | 2010-03-29 | 2013-10-29 | Avid Technology, Inc. | Spot dialog editor |
Non-Patent Citations (1)
Title |
---|
Cour et al., "Movie/Script: Alignment and Parsing of Video and Text Transcription", University of Pennsylvania, Philadelphia, PA 19104, pages 1-14. * |
Cited By (207)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8825489B2 (en) | 2010-04-12 | 2014-09-02 | Adobe Systems Incorporated | Method and apparatus for interpolating script data |
US9191639B2 (en) | 2010-04-12 | 2015-11-17 | Adobe Systems Incorporated | Method and apparatus for generating video descriptions |
US9066049B2 (en) | 2010-04-12 | 2015-06-23 | Adobe Systems Incorporated | Method and apparatus for processing scripts |
US8825488B2 (en) | 2010-04-12 | 2014-09-02 | Adobe Systems Incorporated | Method and apparatus for time synchronized script metadata |
US20120092232A1 (en) * | 2010-10-14 | 2012-04-19 | Zebra Imaging, Inc. | Sending Video Data to Multiple Light Modulators |
US20120166175A1 (en) * | 2010-12-22 | 2012-06-28 | Tata Consultancy Services Ltd. | Method and System for Construction and Rendering of Annotations Associated with an Electronic Image |
US9443324B2 (en) * | 2010-12-22 | 2016-09-13 | Tata Consultancy Services Limited | Method and system for construction and rendering of annotations associated with an electronic image |
US10554426B2 (en) | 2011-01-20 | 2020-02-04 | Box, Inc. | Real time notification of activities that occur in a web-based collaboration environment |
US8838262B2 (en) * | 2011-07-01 | 2014-09-16 | Dolby Laboratories Licensing Corporation | Synchronization and switch over methods and systems for an adaptive audio system |
US20140139738A1 (en) * | 2011-07-01 | 2014-05-22 | Dolby Laboratories Licensing Corporation | Synchronization and switch over methods and systems for an adaptive audio system |
US9652741B2 (en) | 2011-07-08 | 2017-05-16 | Box, Inc. | Desktop application for access and interaction with workspaces in a cloud-based content management system and synchronization mechanisms thereof |
US11657341B2 (en) | 2011-09-27 | 2023-05-23 | 3Play Media, Inc. | Electronic transcription job market |
US10748532B1 (en) | 2011-09-27 | 2020-08-18 | 3Play Media, Inc. | Electronic transcription job market |
US9704111B1 (en) | 2011-09-27 | 2017-07-11 | 3Play Media, Inc. | Electronic transcription job market |
US9098474B2 (en) | 2011-10-26 | 2015-08-04 | Box, Inc. | Preview pre-generation based on heuristics and algorithmic prediction/assessment of predicted user behavior for enhancement of user experience |
US11210610B2 (en) | 2011-10-26 | 2021-12-28 | Box, Inc. | Enhanced multimedia content preview rendering in a cloud content management system |
US9860295B2 (en) * | 2011-11-25 | 2018-01-02 | Radu Butarascu | Streaming the audio portion of a video ad to incompatible media players |
US20160269463A1 (en) * | 2011-11-25 | 2016-09-15 | Harry E. Emerson, III | Streaming the audio portion of a video ad to incompatible media players |
US11537630B2 (en) | 2011-11-29 | 2022-12-27 | Box, Inc. | Mobile platform file and folder selection functionalities for offline access and synchronization |
US11853320B2 (en) | 2011-11-29 | 2023-12-26 | Box, Inc. | Mobile platform file and folder selection functionalities for offline access and synchronization |
US9773051B2 (en) | 2011-11-29 | 2017-09-26 | Box, Inc. | Mobile platform file and folder selection functionalities for offline access and synchronization |
US10909141B2 (en) | 2011-11-29 | 2021-02-02 | Box, Inc. | Mobile platform file and folder selection functionalities for offline access and synchronization |
US9904435B2 (en) | 2012-01-06 | 2018-02-27 | Box, Inc. | System and method for actionable event generation for task delegation and management via a discussion forum in a web-based collaboration environment |
US9172983B2 (en) * | 2012-01-20 | 2015-10-27 | Gorilla Technology Inc. | Automatic media editing apparatus, editing method, broadcasting method and system for broadcasting the same |
US20130191440A1 (en) * | 2012-01-20 | 2013-07-25 | Gorilla Technology Inc. | Automatic media editing apparatus, editing method, broadcasting method and system for broadcasting the same |
US11232481B2 (en) | 2012-01-30 | 2022-01-25 | Box, Inc. | Extended applications of multimedia content previews in the cloud-based content management system |
US10579709B2 (en) | 2012-02-03 | 2020-03-03 | Google Llc | Promoting content |
US9378191B1 (en) | 2012-02-03 | 2016-06-28 | Google Inc. | Promoting content |
US9471551B1 (en) * | 2012-02-03 | 2016-10-18 | Google Inc. | Promoting content |
US10061751B1 (en) | 2012-02-03 | 2018-08-28 | Google Llc | Promoting content |
US9304985B1 (en) | 2012-02-03 | 2016-04-05 | Google Inc. | Promoting content |
US20140288922A1 (en) * | 2012-02-24 | 2014-09-25 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for man-machine conversation |
US10713624B2 (en) | 2012-02-24 | 2020-07-14 | Box, Inc. | System and method for promoting enterprise adoption of a web-based collaboration environment |
US9965745B2 (en) | 2012-02-24 | 2018-05-08 | Box, Inc. | System and method for promoting enterprise adoption of a web-based collaboration environment |
US9632997B1 (en) * | 2012-03-21 | 2017-04-25 | 3Play Media, Inc. | Intelligent caption systems and methods |
US9575981B2 (en) | 2012-04-11 | 2017-02-21 | Box, Inc. | Cloud service enabled to handle a set of files depicted to a user as a single file in a native operating system |
US20130283143A1 (en) * | 2012-04-24 | 2013-10-24 | Eric David Petajan | System for Annotating Media Content for Automatic Content Understanding |
US9413587B2 (en) | 2012-05-02 | 2016-08-09 | Box, Inc. | System and method for a third-party application to access content within a cloud-based platform |
US9396216B2 (en) | 2012-05-04 | 2016-07-19 | Box, Inc. | Repository redundancy implementation of a system which incrementally updates clients with events that occurred via a cloud-enabled platform |
US20130308922A1 (en) * | 2012-05-15 | 2013-11-21 | Microsoft Corporation | Enhanced video discovery and productivity through accessibility |
US9691051B2 (en) | 2012-05-21 | 2017-06-27 | Box, Inc. | Security enhancement through application access control |
US9280613B2 (en) | 2012-05-23 | 2016-03-08 | Box, Inc. | Metadata enabled third-party application access of content at a cloud-based platform via a native client to the cloud-based platform |
US9552444B2 (en) | 2012-05-23 | 2017-01-24 | Box, Inc. | Identification verification mechanisms for a third-party application to access content in a cloud-based platform |
US10452667B2 (en) | 2012-07-06 | 2019-10-22 | Box Inc. | Identification of people as search results from key-word based searches of content in a cloud-based environment |
US9712510B2 (en) | 2012-07-06 | 2017-07-18 | Box, Inc. | Systems and methods for securely submitting comments among users via external messaging applications in a cloud-based platform |
US9794256B2 (en) | 2012-07-30 | 2017-10-17 | Box, Inc. | System and method for advanced control tools for administrators in a cloud-based service |
US9799336B2 (en) | 2012-08-02 | 2017-10-24 | Audible, Inc. | Identifying corresponding regions of content |
US10109278B2 (en) * | 2012-08-02 | 2018-10-23 | Audible, Inc. | Aligning body matter across content formats |
US20140040713A1 (en) * | 2012-08-02 | 2014-02-06 | Steven C. Dzik | Selecting content portions for alignment |
US9558202B2 (en) | 2012-08-27 | 2017-01-31 | Box, Inc. | Server side techniques for reducing database workload in implementing selective subfolder synchronization in a cloud-based environment |
US9135462B2 (en) | 2012-08-29 | 2015-09-15 | Box, Inc. | Upload and download streaming encryption to/from a cloud-based platform |
US9450926B2 (en) | 2012-08-29 | 2016-09-20 | Box, Inc. | Upload and download streaming encryption to/from a cloud-based platform |
US9195519B2 (en) | 2012-09-06 | 2015-11-24 | Box, Inc. | Disabling the self-referential appearance of a mobile application in an intent via a background registration |
US9117087B2 (en) | 2012-09-06 | 2015-08-25 | Box, Inc. | System and method for creating a secure channel for inter-application communication based on intents |
US9292833B2 (en) | 2012-09-14 | 2016-03-22 | Box, Inc. | Batching notifications of activities that occur in a web-based collaboration environment |
US20140082091A1 (en) * | 2012-09-19 | 2014-03-20 | Box, Inc. | Cloud-based platform enabled with media content indexed for text-based searches and/or metadata extraction |
US10915492B2 (en) * | 2012-09-19 | 2021-02-09 | Box, Inc. | Cloud-based platform enabled with media content indexed for text-based searches and/or metadata extraction |
US9959420B2 (en) | 2012-10-02 | 2018-05-01 | Box, Inc. | System and method for enhanced security and management mechanisms for enterprise administrators in a cloud-based environment |
US9495364B2 (en) | 2012-10-04 | 2016-11-15 | Box, Inc. | Enhanced quick search features, low-barrier commenting/interactive features in a collaboration platform |
US11580175B2 (en) * | 2012-10-05 | 2023-02-14 | Google Llc | Transcoding and serving resources |
US9665349B2 (en) | 2012-10-05 | 2017-05-30 | Box, Inc. | System and method for generating embeddable widgets which enable access to a cloud-based collaboration platform |
US20140114657A1 (en) * | 2012-10-22 | 2014-04-24 | Huseby, Inc, | Apparatus and method for inserting material into transcripts |
US9251790B2 (en) * | 2012-10-22 | 2016-02-02 | Huseby, Inc. | Apparatus and method for inserting material into transcripts |
US11721320B2 (en) * | 2012-12-10 | 2023-08-08 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US20220383852A1 (en) * | 2012-12-10 | 2022-12-01 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US11410640B2 (en) * | 2012-12-10 | 2022-08-09 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US20180182374A1 (en) * | 2012-12-10 | 2018-06-28 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US10395639B2 (en) * | 2012-12-10 | 2019-08-27 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US10832655B2 (en) * | 2012-12-10 | 2020-11-10 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US20190362705A1 (en) * | 2012-12-10 | 2019-11-28 | Samsung Electronics Co., Ltd. | Method and user device for providing context awareness service using speech recognition |
US10235383B2 (en) | 2012-12-19 | 2019-03-19 | Box, Inc. | Method and apparatus for synchronization of items with read-only permissions in a cloud-based environment |
US9396245B2 (en) | 2013-01-02 | 2016-07-19 | Box, Inc. | Race condition handling in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US9953036B2 (en) | 2013-01-09 | 2018-04-24 | Box, Inc. | File system monitoring in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US9507795B2 (en) | 2013-01-11 | 2016-11-29 | Box, Inc. | Functionalities, features, and user interface of a synchronization client to a cloud-based environment |
US20140201778A1 (en) * | 2013-01-15 | 2014-07-17 | Sap Ag | Method and system of interactive advertisement |
US10599671B2 (en) | 2013-01-17 | 2020-03-24 | Box, Inc. | Conflict resolution, retry condition management, and handling of problem files for the synchronization client to a cloud-based platform |
US20140303974A1 (en) * | 2013-04-03 | 2014-10-09 | Kabushiki Kaisha Toshiba | Text generator, text generating method, and computer program product |
US9460718B2 (en) * | 2013-04-03 | 2016-10-04 | Kabushiki Kaisha Toshiba | Text generator, text generating method, and computer program product |
US10725968B2 (en) | 2013-05-10 | 2020-07-28 | Box, Inc. | Top down delete or unsynchronization on delete of and depiction of item synchronization with a synchronization client to a cloud-based platform |
US10846074B2 (en) | 2013-05-10 | 2020-11-24 | Box, Inc. | Identification and handling of items to be ignored for synchronization with a cloud-based platform by a synchronization client |
US20160133251A1 (en) * | 2013-05-31 | 2016-05-12 | Longsand Limited | Processing of audio data |
US10877937B2 (en) | 2013-06-13 | 2020-12-29 | Box, Inc. | Systems and methods for synchronization event building and/or collapsing by a synchronization component of a cloud-based platform |
US9633037B2 (en) | 2013-06-13 | 2017-04-25 | Box, Inc | Systems and methods for synchronization event building and/or collapsing by a synchronization component of a cloud-based platform |
US9805050B2 (en) | 2013-06-21 | 2017-10-31 | Box, Inc. | Maintaining and updating file system shadows on a local device by a synchronization client of a cloud-based platform |
US11531648B2 (en) | 2013-06-21 | 2022-12-20 | Box, Inc. | Maintaining and updating file system shadows on a local device by a synchronization client of a cloud-based platform |
US8996360B2 (en) * | 2013-06-26 | 2015-03-31 | Huawei Technologies Co., Ltd. | Method and apparatus for generating journal |
US20150006152A1 (en) * | 2013-06-26 | 2015-01-01 | Huawei Technologies Co., Ltd. | Method and Apparatus for Generating Journal |
US9535924B2 (en) | 2013-07-30 | 2017-01-03 | Box, Inc. | Scalability improvement in a system which incrementally updates clients with events that occurred in a cloud-based collaboration platform |
US11822759B2 (en) | 2013-09-13 | 2023-11-21 | Box, Inc. | System and methods for configuring event-based automation in cloud-based collaboration platforms |
US11435865B2 (en) | 2013-09-13 | 2022-09-06 | Box, Inc. | System and methods for configuring event-based automation in cloud-based collaboration platforms |
US9535909B2 (en) | 2013-09-13 | 2017-01-03 | Box, Inc. | Configurable event-based automation architecture for cloud-based collaboration platforms |
US10509527B2 (en) | 2013-09-13 | 2019-12-17 | Box, Inc. | Systems and methods for configuring event-based automation in cloud-based collaboration platforms |
US20150121218A1 (en) * | 2013-10-30 | 2015-04-30 | Samsung Electronics Co., Ltd. | Method and apparatus for controlling text input in electronic device |
US20150237298A1 (en) * | 2014-02-19 | 2015-08-20 | Nexidia Inc. | Supplementary media validation system |
US9635219B2 (en) * | 2014-02-19 | 2017-04-25 | Nexidia Inc. | Supplementary media validation system |
US10304458B1 (en) * | 2014-03-06 | 2019-05-28 | Board of Trustees of the University of Alabama and the University of Alabama in Huntsville | Systems and methods for transcribing videos using speaker identification |
US10419818B2 (en) * | 2014-04-29 | 2019-09-17 | At&T Intellectual Property I, L.P. | Method and apparatus for augmenting media content |
US10530854B2 (en) | 2014-05-30 | 2020-01-07 | Box, Inc. | Synchronization of permissioned content in cloud-based environments |
US20150379098A1 (en) * | 2014-06-27 | 2015-12-31 | Samsung Electronics Co., Ltd. | Method and apparatus for managing data |
CN106471493A (en) * | 2014-06-27 | 2017-03-01 | 三星电子株式会社 | Method and apparatus for managing data |
US10691717B2 (en) * | 2014-06-27 | 2020-06-23 | Samsung Electronics Co., Ltd. | Method and apparatus for managing data |
US9478059B2 (en) * | 2014-07-28 | 2016-10-25 | PocketGems, Inc. | Animated audiovisual experiences driven by scripts |
US20160042766A1 (en) * | 2014-08-06 | 2016-02-11 | Echostar Technologies L.L.C. | Custom video content |
US10708321B2 (en) | 2014-08-29 | 2020-07-07 | Box, Inc. | Configurable metadata-based automation and content classification architecture for cloud-based collaboration platforms |
US9894119B2 (en) | 2014-08-29 | 2018-02-13 | Box, Inc. | Configurable metadata-based automation and content classification architecture for cloud-based collaboration platforms |
US10038731B2 (en) | 2014-08-29 | 2018-07-31 | Box, Inc. | Managing flow-based interactions with cloud-based shared content |
US11876845B2 (en) | 2014-08-29 | 2024-01-16 | Box, Inc. | Configurable metadata-based automation and content classification architecture for cloud-based collaboration platforms |
US11146600B2 (en) | 2014-08-29 | 2021-10-12 | Box, Inc. | Configurable metadata-based automation and content classification architecture for cloud-based collaboration platforms |
US10708323B2 (en) | 2014-08-29 | 2020-07-07 | Box, Inc. | Managing flow-based interactions with cloud-based shared content |
US11195544B2 (en) | 2014-10-29 | 2021-12-07 | International Business Machines Corporation | Computerized tool for creating variable length presentations |
US10360925B2 (en) * | 2014-10-29 | 2019-07-23 | International Business Machines Corporation | Computerized tool for creating variable length presentations |
US20160124909A1 (en) * | 2014-10-29 | 2016-05-05 | International Business Machines Corporation | Computerized tool for creating variable length presentations |
US20160156575A1 (en) * | 2014-11-27 | 2016-06-02 | Samsung Electronics Co., Ltd. | Method and apparatus for providing content |
US9904505B1 (en) * | 2015-04-10 | 2018-02-27 | Zaxcom, Inc. | Systems and methods for processing and recording audio with integrated script mode |
US10854190B1 (en) * | 2016-06-13 | 2020-12-01 | United Services Automobile Association (Usaa) | Transcription analysis platform |
US11837214B1 (en) | 2016-06-13 | 2023-12-05 | United Services Automobile Association (Usaa) | Transcription analysis platform |
US10489516B2 (en) * | 2016-07-13 | 2019-11-26 | Fujitsu Social Science Laboratory Limited | Speech recognition and translation terminal, method and non-transitory computer readable medium |
US10339224B2 (en) | 2016-07-13 | 2019-07-02 | Fujitsu Social Science Laboratory Limited | Speech recognition and translation terminal, method and non-transitory computer readable medium |
US20180018325A1 (en) * | 2016-07-13 | 2018-01-18 | Fujitsu Social Science Laboratory Limited | Terminal equipment, translation method, and non-transitory computer readable medium |
US9774911B1 (en) * | 2016-07-29 | 2017-09-26 | Rovi Guides, Inc. | Methods and systems for automatically evaluating an audio description track of a media asset |
US10674208B2 (en) | 2016-07-29 | 2020-06-02 | Rovi Guides, Inc. | Methods and systems for automatically evaluating an audio description track of a media asset |
US10154308B2 (en) | 2016-07-29 | 2018-12-11 | Rovi Guides, Inc. | Methods and systems for automatically evaluating an audio description track of a media asset |
US12118266B2 (en) | 2016-10-04 | 2024-10-15 | Descript, Inc. | Platform for producing and delivering media content |
US10445052B2 (en) * | 2016-10-04 | 2019-10-15 | Descript, Inc. | Platform for producing and delivering media content |
US20180095713A1 (en) * | 2016-10-04 | 2018-04-05 | Descript, Inc. | Platform for producing and delivering media content |
US11262970B2 (en) | 2016-10-04 | 2022-03-01 | Descript, Inc. | Platform for producing and delivering media content |
US10354008B2 (en) * | 2016-10-07 | 2019-07-16 | Productionpro Technologies Inc. | System and method for providing a visual scroll representation of production data |
US11294542B2 (en) | 2016-12-15 | 2022-04-05 | Descript, Inc. | Techniques for creating and presenting media content |
US10564817B2 (en) | 2016-12-15 | 2020-02-18 | Descript, Inc. | Techniques for creating and presenting media content |
US11747967B2 (en) | 2016-12-15 | 2023-09-05 | Descript, Inc. | Techniques for creating and presenting media content |
US20180203925A1 (en) * | 2017-01-17 | 2018-07-19 | Acoustic Protocol Inc. | Signature-based acoustic classification |
US20180286459A1 (en) * | 2017-03-30 | 2018-10-04 | Lenovo (Beijing) Co., Ltd. | Audio processing |
US9741337B1 (en) * | 2017-04-03 | 2017-08-22 | Green Key Technologies Llc | Adaptive self-trained computer engines with associated databases and methods of use thereof |
US11114088B2 (en) * | 2017-04-03 | 2021-09-07 | Green Key Technologies, Inc. | Adaptive self-trained computer engines with associated databases and methods of use thereof |
US20210375266A1 (en) * | 2017-04-03 | 2021-12-02 | Green Key Technologies, Inc. | Adaptive self-trained computer engines with associated databases and methods of use thereof |
US11238899B1 (en) * | 2017-06-13 | 2022-02-01 | 3Play Media Inc. | Efficient audio description systems and methods |
US10567701B2 (en) * | 2017-08-18 | 2020-02-18 | Prime Focus Technologies, Inc. | System and method for source script and video synchronization interface |
US20190058845A1 (en) * | 2017-08-18 | 2019-02-21 | Prime Focus Technologies, Inc. | System and method for source script and video synchronization interface |
US10057537B1 (en) * | 2017-08-18 | 2018-08-21 | Prime Focus Technologies, Inc. | System and method for source script and video synchronization interface |
US20190079724A1 (en) * | 2017-09-12 | 2019-03-14 | Google Llc | Intercom-style communication using multiple computing devices |
US10225621B1 (en) | 2017-12-20 | 2019-03-05 | Dish Network L.L.C. | Eyes free entertainment |
US10645464B2 (en) | 2017-12-20 | 2020-05-05 | Dish Network L.L.C. | Eyes free entertainment |
US11055348B2 (en) * | 2017-12-29 | 2021-07-06 | Facebook, Inc. | Systems and methods for automatically generating stitched media content |
US10922489B2 (en) * | 2018-01-11 | 2021-02-16 | RivetAI, Inc. | Script writing and content generation tools and improved operation of same |
US10896294B2 (en) | 2018-01-11 | 2021-01-19 | End Cue, Llc | Script writing and content generation tools and improved operation of same |
US11983183B2 (en) * | 2018-08-07 | 2024-05-14 | Disney Enterprises, Inc. | Techniques for training machine learning models using actor data |
US20200050677A1 (en) * | 2018-08-07 | 2020-02-13 | Disney Enterprises, Inc. | Joint understanding of actors, literary characters, and movies |
US10885942B2 (en) * | 2018-09-18 | 2021-01-05 | At&T Intellectual Property I, L.P. | Video-log production system |
US11605402B2 (en) | 2018-09-18 | 2023-03-14 | At&T Intellectual Property I, L.P. | Video-log production system |
US20200090701A1 (en) * | 2018-09-18 | 2020-03-19 | At&T Intellectual Property I, L.P. | Video-log production system |
US11094327B2 (en) * | 2018-09-28 | 2021-08-17 | Lenovo (Singapore) Pte. Ltd. | Audible input transcription |
US11122099B2 (en) * | 2018-11-30 | 2021-09-14 | Motorola Solutions, Inc. | Device, system and method for providing audio summarization data from video |
US11729475B2 (en) * | 2018-12-21 | 2023-08-15 | Bce Inc. | System and method for providing descriptive video |
US10785385B2 (en) * | 2018-12-26 | 2020-09-22 | NBCUniversal Media, LLC. | Systems and methods for aligning text and multimedia content |
US20200213478A1 (en) * | 2018-12-26 | 2020-07-02 | Nbcuniversal Media, Llc | Systems and methods for aligning text and multimedia content |
US11238886B1 (en) * | 2019-01-09 | 2022-02-01 | Audios Ventures Inc. | Generating video information representative of audio clips |
US10891489B2 (en) * | 2019-04-08 | 2021-01-12 | Nedelco, Incorporated | Identifying and tracking words in a video recording of captioning session |
US11245950B1 (en) * | 2019-04-24 | 2022-02-08 | Amazon Technologies, Inc. | Lyrics synchronization |
US11736654B2 (en) | 2019-06-11 | 2023-08-22 | WeMovie Technologies | Systems and methods for producing digital multimedia contents including movies and tv shows |
US11570525B2 (en) | 2019-08-07 | 2023-01-31 | WeMovie Technologies | Adaptive marketing in cloud-based content production |
US20220279228A1 (en) * | 2019-08-29 | 2022-09-01 | BOND Co., Ltd. | Program production method, program production apparatus, and recording medium |
US11659258B2 (en) * | 2019-08-29 | 2023-05-23 | BOND Co., Ltd. | Program production method, program production apparatus, and recording medium |
US11783860B2 (en) | 2019-10-08 | 2023-10-10 | WeMovie Technologies | Pre-production systems for making movies, tv shows and multimedia contents |
US11107503B2 (en) | 2019-10-08 | 2021-08-31 | WeMovie Technologies | Pre-production systems for making movies, TV shows and multimedia contents |
US11430485B2 (en) * | 2019-11-19 | 2022-08-30 | Netflix, Inc. | Systems and methods for mixing synthetic voice with original audio tracks |
US11138970B1 (en) * | 2019-12-06 | 2021-10-05 | Asapp, Inc. | System, method, and computer program for creating a complete transcription of an audio recording from separately transcribed redacted and unredacted words |
CN111462775A (en) * | 2020-03-30 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Audio similarity determination method, device, server and medium |
US11399121B2 (en) | 2020-04-30 | 2022-07-26 | Gopro, Inc. | Systems and methods for synchronizing information for videos |
US10924636B1 (en) * | 2020-04-30 | 2021-02-16 | Gopro, Inc. | Systems and methods for synchronizing information for videos |
WO2021225608A1 (en) * | 2020-05-08 | 2021-11-11 | WeMovie Technologies | Fully automated post-production editing for movies, tv shows and multimedia contents |
US12014752B2 (en) | 2020-05-08 | 2024-06-18 | WeMovie Technologies | Fully automated post-production editing for movies, tv shows and multimedia contents |
US11315602B2 (en) | 2020-05-08 | 2022-04-26 | WeMovie Technologies | Fully automated post-production editing for movies, TV shows and multimedia contents |
US11943512B2 (en) | 2020-08-27 | 2024-03-26 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US11564014B2 (en) | 2020-08-27 | 2023-01-24 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US20220101880A1 (en) * | 2020-09-28 | 2022-03-31 | TCL Research America Inc. | Write-a-movie: unifying writing and shooting |
US11423941B2 (en) * | 2020-09-28 | 2022-08-23 | TCL Research America Inc. | Write-a-movie: unifying writing and shooting |
CN112256672A (en) * | 2020-10-22 | 2021-01-22 | 中国联合网络通信集团有限公司 | Database change approval method and device |
US11812121B2 (en) | 2020-10-28 | 2023-11-07 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US20220130424A1 (en) * | 2020-10-28 | 2022-04-28 | Facebook Technologies, Llc | Text-driven editor for audio and video assembly |
US12087329B1 (en) | 2020-10-28 | 2024-09-10 | Meta Platforms Technologies, Llc | Text-driven editor for audio and video editing |
US11166086B1 (en) | 2020-10-28 | 2021-11-02 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US12013921B2 (en) | 2020-11-03 | 2024-06-18 | Capital One Services, Llc | Computer-based systems configured for automated computer script analysis and malware detection and methods thereof |
WO2022098759A1 (en) * | 2020-11-03 | 2022-05-12 | Capital One Services, Llc | Computer-based systems configured for automated computer script analysis and malware detection and methods thereof |
US11675881B2 (en) | 2020-11-03 | 2023-06-13 | Capital One Services, Llc | Computer-based systems configured for automated computer script analysis and malware detection and methods thereof |
US11481475B2 (en) | 2020-11-03 | 2022-10-25 | Capital One Services, Llc | Computer-based systems configured for automated computer script analysis and malware detection and methods thereof |
US11729476B2 (en) * | 2021-02-08 | 2023-08-15 | Sony Group Corporation | Reproduction control of scene description |
US20220256156A1 (en) * | 2021-02-08 | 2022-08-11 | Sony Group Corporation | Reproduction control of scene description |
US20220284886A1 (en) * | 2021-03-03 | 2022-09-08 | Spotify Ab | Systems and methods for providing responses from media content |
US11887586B2 (en) * | 2021-03-03 | 2024-01-30 | Spotify Ab | Systems and methods for providing responses from media content |
US11521639B1 (en) | 2021-04-02 | 2022-12-06 | Asapp, Inc. | Speech sentiment analysis using a speech sentiment classifier pretrained with pseudo sentiment labels |
US20230027035A1 (en) * | 2021-07-09 | 2023-01-26 | Transitional Forms Inc. | Automated narrative production system and script production method with real-time interactive characters |
US11330154B1 (en) | 2021-07-23 | 2022-05-10 | WeMovie Technologies | Automated coordination in multimedia content production |
US11924574B2 (en) | 2021-07-23 | 2024-03-05 | WeMovie Technologies | Automated coordination in multimedia content production |
US11763803B1 (en) | 2021-07-28 | 2023-09-19 | Asapp, Inc. | System, method, and computer program for extracting utterances corresponding to a user problem statement in a conversation between a human agent and a user |
US11735186B2 (en) | 2021-09-07 | 2023-08-22 | 3Play Media, Inc. | Hybrid live captioning systems and methods |
US11769481B2 (en) * | 2021-10-07 | 2023-09-26 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US11869483B2 (en) * | 2021-10-07 | 2024-01-09 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US20230110905A1 (en) * | 2021-10-07 | 2023-04-13 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US20230113950A1 (en) * | 2021-10-07 | 2023-04-13 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US11790271B2 (en) | 2021-12-13 | 2023-10-17 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
US11321639B1 (en) | 2021-12-13 | 2022-05-03 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
US12067363B1 (en) | 2022-02-24 | 2024-08-20 | Asapp, Inc. | System, method, and computer program for text sanitization |
US11770590B1 (en) | 2022-04-27 | 2023-09-26 | VoyagerX, Inc. | Providing subtitle for video content in spoken language |
US11947924B2 (en) | 2022-04-27 | 2024-04-02 | VoyagerX, Inc. | Providing translated subtitle for video content |
US11763099B1 (en) | 2022-04-27 | 2023-09-19 | VoyagerX, Inc. | Providing translated subtitle for video content |
US12099815B2 (en) | 2022-04-27 | 2024-09-24 | VoyagerX, Inc. | Providing subtitle for video content in spoken language |
CN117240983A (en) * | 2023-11-16 | 2023-12-15 | 湖南快乐阳光互动娱乐传媒有限公司 | Method and device for automatically generating sound drama |
Also Published As
Publication number | Publication date |
---|---|
US8447604B1 (en) | 2013-05-21 |
US8825489B2 (en) | 2014-09-02 |
US9066049B2 (en) | 2015-06-23 |
US20130124212A1 (en) | 2013-05-16 |
US20130124213A1 (en) | 2013-05-16 |
US20130124203A1 (en) | 2013-05-16 |
US20130124202A1 (en) | 2013-05-16 |
US20130120654A1 (en) | 2013-05-16 |
US9191639B2 (en) | 2015-11-17 |
US8825488B2 (en) | 2014-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9191639B2 (en) | Method and apparatus for generating video descriptions | |
US11990132B2 (en) | Automated meeting minutes generator | |
US11545156B2 (en) | Automated meeting minutes generation service | |
JP7063348B2 (en) | Artificial intelligence generation of proposed document edits from recorded media | |
US7693717B2 (en) | Session file modification with annotation using speech recognition or text to speech | |
US7046914B2 (en) | Automatic content analysis and representation of multimedia presentations | |
US20200126583A1 (en) | Discovering highlights in transcribed source material for rapid multimedia production | |
US8966360B2 (en) | Transcript editor | |
US20100299131A1 (en) | Transcript alignment | |
US20200090661A1 (en) | Systems and Methods for Improved Digital Transcript Creation Using Automated Speech Recognition | |
EP1692629B1 (en) | System & method for integrative analysis of intrinsic and extrinsic audio-visual data | |
US20070244700A1 (en) | Session File Modification with Selective Replacement of Session File Components | |
US20200126559A1 (en) | Creating multi-media from transcript-aligned media recordings | |
CN101382937A (en) | Multimedia resource processing method based on speech recognition and on-line teaching system thereof | |
JP2009515253A (en) | Automatic detection and application of editing patterns in draft documents | |
US11922944B2 (en) | Phrase alternatives representation for automatic speech recognition and methods of use | |
US20130080384A1 (en) | Systems and methods for extracting and processing intelligent structured data from media files | |
JP3938096B2 (en) | Index creation device, index creation method, and index creation program | |
CN100538696C (en) | The system and method that is used for the analysis-by-synthesis of intrinsic and extrinsic audio-visual data | |
KR101783872B1 (en) | Video Search System and Method thereof | |
Lindsay et al. | Representation and linking mechanisms for audio in MPEG-7 | |
Ordelman et al. | Towards affordable disclosure of spoken heritage archives | |
KR20070003778A (en) | System & method for integrative analysis of intrinsic and extrinsic audio-visual data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ADOBE SYSTEMS INCORPORATED, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KUSPA, DAVID A.;REEL/FRAME:024454/0977 Effective date: 20100527 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |