PREPROCESSOR METHOD AND APPARATUS
CLAIM OF PRIORITY UNDER 35 U.S.C. §119
JOOOIl The present Application for Patent claims priority to Provisional Application No. 60/789,048 filed April 3, 2006, Provisional Application No. 60/789,266 filed April 4, 2006, and Provisional Application No. 60/789,377 filed April 4, 2006, all of which are assigned to the assignee hereof and hereby expressly incorporated by reference herein.
BACKGROUNB Field
}0002] The invention generally relates to multimedia data processing, and more particularly, to processing operations performed prior to or in conjunction with data compression processing.
Background
SUMMARY
[0003] Each of the inventive apparatuses and methods described herein has several aspects, no single one of which is solely responsible for its desirable attributes. Without limiting the scope of this invention, its more prominent features will now be discussed briefly. After considering this discussion, and particularly after reading the section entitled "Detailed Description" one will understand how the features of this invention provides improvements for multimedia data processing apparatuses and methods j0004| In one aspect, a method of processing multimedia data comprises receiving interlaced video frames, converting the interlaced video frames to progressive video, generating metadata associated with the progressive video, and providing the progressive video and at least a portion of the metadata to an encoder for use in encoding the progressive video. The method can further include encoding the progressive video using the metadata In some aspects, the interlaced video frames comprise NTSC video. Converting the video frames can include deinterlacing the interlaced video frames. JCM)OS) In some aspects, the metadata can include bandwidth information, bi-directional motion information, a bandwidth ratio, a complexity value such as a temporal or a spatial
complexity value or both, luminance information, and the spatial information can include luminance and/or chrominance information. The method can also include generating spatial information and bi-directional motion information for the interlaced video frames and generating the progressive video based on the interlaced video frames using the spatial and bi-directional motion information. In some aspects converting the interlaced video frames comprises inverse telecining 3/2 pulldown video frames, and/or resizing the progressive video. The method can further comprise partitioning the progressive video to determine group of picture information, where the partitioning can include shot detection of the progressive video. In some aspects, the method also includes progressive video with a denoising filter.
}0006| In another aspect, an apparatus for processing multimedia data can include a receiver configured to receive interlaced video frames, a deinterlacer configured to convert the interlaced video frames to progressive video, and a parti tioner configured to generate metadata associated with the progressive video and provide the progressive video and the metadata to an encoder for use in encoding the progressive video. In some aspects, the apparatus can further include an encoder configured to receive the progressive video from the communications module and encode the progressive video using the provided metadata. The deinterlacer can be configured to perform spatio- temporal dejnteriacing and/or inverse telecining. The parti tioner can be configured to perform shot detection and generate compression information based on the shot detection. In some aspects the parti tioner can be configured to generate bandwidth information. The apparatus can also include a resampler configured to resize a progressive frame. The metadata can include bandwidth information, bi-directional motion information, a bandwidth ratio, luminance information, a spatial complexity value related to content, and/or a temporal complexity value related to content. In some aspects, the deinterlacer is configured to generate spatial information and bi-directional motion information for the interlaced video frames and progressive video based on the interlaced video frames using spatial and bi-directional motion information.
|0007j Another aspect comprises an apparatus for processing multimedia data includes means for receiving interlaced video frames, means for converting the interlaced video frames to progressive video, means for generating metadata associated with the progressive video, and means for providing the progressive video and at least a portion of the metadata to an encoder for use in encoding the progressive video In some aspects the converting means comprises an inverse teleciner and/or a spatio- temporal deinterlacer In some aspects, the generating means is configured to perform shot detection and generate compression information based on the shot detection. In some aspects the generating means is configured to generate bandwidth information, In some aspects, the generating includes means for resampling to resize a progressive frame. fOøOSJ Another aspect comprises a machine readable medium comprising instructions for processing multimedia data that upon execution cause a machine to receive interlaced video frames, convert the interlaced video frames to progressive v ideo, generate metadata associated with the progressive video, and provide the progressive video and at least a portion of the metadata to an encoder for use in encoding the progressive video.
|0009| Another aspect includes a processor comprising a configuration to receive interlaced video, convert the interlaced video to progressive video, generate metadata associated with the progressive video, and provide the progressive video and at least a portion of the metadata to an encoder for use in encoding the progressive video. 'The conversion of the interlaced video can include a performing spatio-temporal deinterlacing. In some aspects, the conversion of the interlaced video comprises performing inverse telecine In some aspects, generation of metadata includes generating compression information based on detecting shot changes. In some aspects, generation of metadata includes determining compression information of the progressive video. In some aspects, the configuration includes a configuration to resample video to generate a resized a progressive frame. In some aspects, the metadata can include bandwidth information, bi-directional motion information, complexity information such as temporal or spatial complexity information based on content, and/or com pressi on i nforroati on
BRIEF DESCRIPTION OF THE DRAWINGS fø010| Figtiic 1 is a block diagram of a communications system for delivering streaming multimedia data, j 00111 Figure 2 is a block diagram of a digital transmission facility that includes a preprocessor.
|0012j Figure 3A is a bSock diagram of an illustrative aspect of a preprocessor,
{O0I3J Figure 3B is a flou diagram that illustrates a process for processing multimedia data, f0014| Figure 3C is a block diagram illustrating means for processing multimedia data, ffløtSJ Figure 4 is a block diagram illustrating operations of an exemplary preprocessor. f0016| Figure 5 is a diagram of phase decisions in an inverse telecme process,
|0017j Figure 6 is a How diagram illustrating a piocess of inverting telecined video,
[0018J Figure 7 is an illustration of a trellis showing phase transitions, fflølθj Figure 8 is a guide to identify the respective frames that are used to create a pi uralit) of matri ces,
|0020| Figure 9 is a flou diagiam illustrating how the metrics of Figure 8 are created.
|002I I Figure 10 is a flow diagram which shows the processing of the metrics to arrhe at an estimated phase,
{00221 Figure I i is a dataflow diagram illustrating a sj steni for generating decision variables,
|0O23j Figure 12 is a block diagram depicting variables that are used to exaluate the branch information,
{00241 Figures 13 A. 13B and S3C" are flow diagrams showing how lower envelopes arc computed,
|0025j Figure 14 is a flow diagram shoeing the operation of a consistency deteetoi,
[0026J Figure 1 5 is a flow diagram showing a process of computing an offset to a decision \ ariable that is used to compensate for inconsistency in phase decisions,
|00271 Figure 16 presents the operation of inverse telecine after the pull down phase has been estimated
[002&J figure 17 is a block diagram of a deinterlaeer device,
|002θ| Figure 18 is a block diagram of another deinterlacer dev ice,
[0030] Figure 19 is. drawing of a subsampling pattern of an interlaced picture.
{00311 Figure 20 is a block diagram of a deinteriacer dev ice that uses Wmed filtering motion estimation to generate a deintedaced frame,
[0032J Figure 21 illustrates one aspect of an aperture for determining static areas of multimedia data,
{0033 J Figure 22 is a diagram illustrating one aspect of an aperture for determining slow -motion areas of multimedia data,
[0034 f Hgϊire 23 is a diagram illustrating an aspect of motion estimation,
{0035J Figure 24 illustrates two motion vector maps used in determining motion compensation,
|0036| Figure 25 is a flow diagram illustrating a method of deinterlacing multimedia data,
|0037| Figure 26 is a flow diagram illustrating a method of generating a deinterlaced frame using spatio-temporal information,
{0038] Figure 27 is a flow diagram illustrating a method of performing motion compensation for deinterlacing,
[0039J Figure 28 is a block diagram of a preprocessor comprising a processor configured for shot detection and other preprocessing operations according to some aspects,
{00401 Figure 2V illustrates the relationship between encoding complexity C and allocated bits B,
[004 J ϊ Figure 30 is a flow diagram that illustrates a process that operates on a group of pictures and can be used in some aspects to encode video based on shot detection in video frames.
[0042 J Figure 31 is a flow diagram illustrating a process for shot detection,
[0043} Figure 32 is a Dow diagram illustrating a process foi determining different classifications of shots in video,
[0044] Figure 33 is a flow diagiam illustrating a process for assigning frame compression schemes to video frames based on shot detection results,
|0045j figure 34 is a flow diagram illustrating a process for determining abmpt scene changes,
|0046| Figure 35 is. a flow diagram illustrating a process for determining slowly- changing scenes,
|0047j figure 36 is a flow diagram illustrating a process for determining scenes containing camera flashes,
|004S| Figure 37 iiiustrates motion compensation \ ectors between a current frame and a previous frame MVp and a current frame and a next frame MV\, f0049| Figure 38 is a graph illustrating a relationship for a variable used in determining a frame difference metric, fflø50J Figure 39 is a block diagram illustrating encoding data and calculating residuals, fθOSlJ Figure 40 is a block diagram illustrating determining a frame difference metric,
[0052J Figure 41 is a flow diagram illustrating the procedure where compression types are assigned to frames,
{0053] Figure 42 illustrates an example of 1-D po!> phase resampling,
|0054| Figure 43 is a graphic illustrating a safe action area and a safe title aiea of a frame of data, and
|0055j Hgure 44 is a graphic illustrating a safe action area of a frame of data
DETAILED DESCRIPTION
|0056j The following description includes details to provide a thorough understanding of the examples However, it is understood b\ one of ordinary skill in the art that the examples may be practiced even if every detail of a process or device in an example or aspect is not described or illustrated herein hor example, electrical components nia> be shown in block diagrams that do not illustrate every electrical connection or ever> electrical element of the component in oidei not to obscure the examples in unnecessary detail In other instances, such components, other structures and techniques mav be shown in detail to furthei explain the examples fO057| Described herein are certain inventive aspects and aspects for preprocessors and preprocessoi operations methods that improve the performance of existing
preprocessing and encoding systems. Such preprocessors can process metadata and video in preparation for encoding, including performing deinteriacing, inverse telecining, filtering identifying shot types, processing and generating metadata, and generating bandwidth information References herein to "one aspect," "an aspect," some aspects." or "certain aspects" mean that one or more of a particular feature, structure, or characteristic described in connection with the aspect can be included in at least one aspect of a preprocessor system. The appearances of such phrases in various places in the specification are not necessarily all referring to the same aspect, nor are separate or alternative aspects mutually exclusive of other aspects. Moreover, various features are described which may be exhibited by some aspects and not by others. Similarly, various steps are described which may be steps for some aspects but not other aspects. f0058| '-Multimedia data" or "multimedia" as used herein is a broad term that includes video data (which can include audio data), audio data, or both video data and audio data "Video data" or "video" as used herein as a broad term, which refers to an image or one or more series or sequences of images containing text, image, and/ or audio data, and can be used to refer to multimedia data or the terms may be used interchangeably, unless otherwise specified.
[0059} Figure I is a block diagram of a communications system 100 for delivering streaming multimedia. Such system finds application in the transmission of digital compressed video to a multiplicity of terminals as shown in Figure 1 A digital video source can be, for example, a digital cable or satellite feed or an analog source that is digitized. The video source is processed in a transmission facility 120 where it is encoded and modulated onto a carrier for transmission through a network 140 to one or more terminals 160. The terminals 160 decode the received video and typically display at least a portion the \ideo The network 140 refers to any type of communication network, wired or wireless, suitable for the transmission, of encoded data For example, the network 140 can be a cell phone network, wired or wireless local area network (LAN) or a wide area network (WAN), or the Internet. The terminals 160 can be any type of communication device capable of receiving and displaying data, including, but not limited to, cell phones. PDA's, in-home or commercial video display equipment,
computers (portable, laptop, handheld, PCs, and larger server-based computer systems), and personal entertainment devices capable of using multimedia data. |0060| Figures 2 and 3 illustrate sample aspects of a preprocessor 202, In Figure 2, preprocessor 202 is in a digital transmission facility 120 A decoder 201 decodes encoded data from a digital video source and provides metadata 204 and video 205 to the preprocessor 202. The preprocessor 202 is configured to perform certain types of processing on the video 205 and the metadata 204 and provide processed metadata 206 (e g., base layer reference frames, enhancement layer reference frames, bandwidth information, content information) and video 207 to an encoder 203. Such preprocessing of multimedia data can improve the visual clarity, anti-aliasing, and compression efficiency of the data. Generally, the preprocessor 202 receives video sequences provided by the decoder 201 and converts the video sequences into progressive video sequences for further processing (e.g., encoding) by an encoder. In some aspects, the preprocessor 202 can be configured for numerous operations, including inverse teiecine, de-interlacing, filtering (e.g , artifact remov al, de-ringing, de-blocking, and de-noising), resizing (e g., spatial resolution down-sampling from standard definition to Quarter Video Graphics Array (QVGA)), and GOP structure generation fe.g , calculating complexity map generation, scene change detection, and fade/flash detection). føθόl j Figure 3 A illustrates a preprocessor 202 that is configured with modules or components (collectively referred to here as "modules") to perform its preprocessing operations on received metadata 204 and video 205, and then prov ide processed metadata 206 and progressive video 207 for further processing (e.g.. to an encoder). The modules can be implemented in hardware, software, firmware, or a combination thereof. The preprocessor 202 can include various modules, including one or more or the modules illustrated, which include inverse teiecine 301, deinterlacer 302, denoiser 303, alias suppressor 304, resampler 305. deblocker/derringer 306, and a GOP partitioner 307, all described further below. The preprocessor 202 can also include other appropriate modules that may be used to process the video and metadata, including memory 308 and a communications module 309. A software module may reside in RAM memory, Hash memory, ROM memory, EPROM memory, EBPROM memory, registers, a hard disk, a remov able disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor.
such that the processor can read information from, and write information to, the storage medium In the alternative, the storage medium ma) be integral to the processor The processor and the storage medium may reside in an ASIC The ASIC may reside in a user terminal In the alternative, the processor and the storage medium may reside as discrete components in a user terminal
|0062J Figure 3B is a flow diagram that illustrates a process 300 for processing of multimedia data Piocess 300 staπs and proceeds to block 320 where interlaced video is recehed Preprocessor 202 illustrated in f igure 2 and figure 3 can perform this step In some aspects, a decoder Ce g , decoder 2Oi Figure 2} can receive the interlaced data and the prox ide it to preprocessor 202 In some aspects, a data receiving module 330 shown in Figure 3 C which is a portion of a preprocessor 202 can pet form this step Process 300 then proceeds to block 322 where interlaced \ ideo is converted to progressive video Preprocessor 202 in Figure 2 and Figure 3 A, and module 332 of Figure 3€ can perform this step If the interlaced video has been telecined, block 322 processing can include performing inverse telecining to generate progressive \ ideo Process 300 then proceeds to block 324 to generate metadata associated with the progressive \ kleo The GOP Parti tioner 307 in figure 3Λ and a module 334 in Figure 3€ can perform such processing Process 300 then proceeds to block 326 where the progressive video and at least a portion of the metadata are provided to an encoder for encoding {e g , compression) Preprocessor 202 shown in Figure 2 and Figure 3 A. and module 336 in figure 3C can perform this step After providing the progressive \ kleo and associated metadata to another component for encoding, process 300 can end fβø63j figure 3C is a block diagram illustrating means for processing multimedia data Shown here such means are incorporated in a preprocessor 202 The preprocessor 202 includes means for receiving video such as module 330 The preprocessor 202 also includes means for converting interlaced data to progressive \ideo such as module 332 Such means can include, foi esample. a spatial-temporal deirucrlacer and/or an inv erse tcleciner The preprocessor 202 also includes means for genera ling metadata associated with the progressive video such as module 334 Such means can include the GOP paxtitioner 307 (Figure 3A) w hich can generate various types of metadata as described herein The preprocessor 202 can also include means for providing the progresske video and metadata to an encoder for encoding as illustrated by module 336 Such
IO
means can include a communications module 309 illustrated in Figure 3 A in some aspects. As will be appreciated by one skilled in the art, such means can be implemented in many standard ways,
J0064J The preprocessor 202 can use obtained metadata (e.g., obtained from the decoder 20i or from another source) for one or more of the preprocessing operations. Metadata can include information relating to. describing, or classifying the content of the multimedia data ("content information") hi particular the metadata can include a content classification. In some aspects, the metadata does not include content information desired for encoding operations In such cases, the preprocessor 202 can be configured to determine content information and use the content information for preprocessing operations and/or provides the content information to other components, e g., the decoder 203 In some aspects, the preprocessor 202 can use such content information to influence GOP partitioning, determine appropriate type of filtering, and/or determine encoding parameters that are communicated to an encoder. [0065J Figure 4 shows an illustrative example of process blocks that can be included in the preprocessor, and illustrates processing that can be performed by the preprocessor 202 In this example, the preprocessor 202 receives metadata and video 204, 205 and provides output data 206, 207 comprising (processed) metadata and video to the encoder 228 Typically, there are three types of video that is received by the preprocessor. First, the received video can be progressive video and deinterlacing does not have to be performed. Second, the video data can be telecined video, interlaced video converted from 24fps movie sequences, in which case the video. Third, the video can be non- telecined interlaced video. Preprocessor 226 can process these types of video as described below.
|0066| At block 401, the preprocessor 202 determines if the received video 204. 205 is progressive video In some cases, this can be determined from the metadata if the metadata contains such information, or by processing of the video itself. For example, an inverse teiechie process, described below, can determine if the received video 205 is progressive video. If it is, the process proceeds to block 407 where filtering operations are performed on the video to reduce noise, such as white Gaussian noise If the video is not progressive video, at block 401 the process proceeds to block 404 to a phase detector.
|0067j Phase detector 604 distinguishes between video thai originated in a telecine and that which began in a standard broadcast format If the decision is made that the video was telecined (the YES decision path exiting phase detector 404), the telecined video is returned to its original format in inverse telecine 406. Redundant fields are identified and eliminated and fields derived from the same video frame are rewoven into a complete image. Since the sequence of reconstructed film images were photographically recorded at regular intervals of 1/24 of a second, the motion estimation process performed in a GOP partitioner 412 or a decoder is more accurate using the inverse telecined images rather than the felecined data, which has an irregular time base,
|0068J In one aspect, the phase detector 404 makes certain decisions after receipt of a video frame. These decisions include (i) whether the present video from a telecine output and the 3.2 pull down phase is one of the five phases Po, P1, P2, P.*, and P4 shown in Figure 5; and (ii) the video was generated as conventional NTSC That decision is denoted as phase Pj These decisions appear as outputs of phase detector 404 shown in Figure 4. The path from phase detector 404 labeled "YES" actuates the inverse telecine 406, indicating that it has been provided with the correct pull down phase so that it can sort out the fields that were formed from the same photographic image and combine them. The path from phase detector 404 labeled "NO" similarly actuates the deinterlacer 405 to separate an apparent NTSC frame into fields for optimal processing Inverse telecine is further described in co-pending U.S. Patent Application [Attorney Docket No QFDM.021A {050943)1 entitled "INVLRSL Tj-u-ciNi- Ai (KiRIiHM BΛSPD ON S'I Λ ΓH MACH INE" which is owned by the assignee hereof and incorporated by reference herein in its entirety
|0069| The phase detector 404 can continuously analyze video frames that because different types of \ideo may be received at any time As an exemplary, video conforming to the NTSC standard may be inserted into the video as a commercial. After inverse telecine, the resulting progressive video is sent to a denoiser (filter) 407 which can be used to reduce white Gaussian noise.
|0070J When conventional NTSC video is recognized (the NO path from phase detector 40 Vk it is transmitted to deinterlacer 405 for compression The deinterlacer
405 transforms the interlaced fields to progressive video, and denoising operations can then be pei formed on the progressive video
|0071| After the appropriate in\erse leiecine or deinterlacing processing, at Mock 4OS the progressive v ideo is processed for alias suppressing and resampling (e g , resizing
|0072j After resampling, the progressive x'ideo then proceeds to block 410 where deb locker and dcringing operations are performed Two types of artifacts, "blocking" and "ringing," commonh occur in v ideo compression applications Blocking artifacts occur because compression algorithms divide each frame into blocks (e g , 8x8 blocks) Each block is reconstructed with some small errors, and the errors at the edges of a block often contrast with the errors at the edges of neighboring blocks, making block boundaries visible In contrast, ringing artifacts appear as disioπions around the edges of image features Ringing artifacts occur because the encoder discards too much information in quantizing the high-frequencv DCT coefficients In some illustrative examples, both deblocking and deringing can use low -pass FIR. (finite impulse response) filters to hide these visible artifacts
{0073] After deblocking and deringing, the progressive video is processed by a GOP parti ti oner 412 GOP positioning can include detecting shot changes, generating complexity maps (e g , temporal, spatial bandwidth maps), and adaptive GOP partitioning Shot detection relates to determining when a frame in a group of pictures ((3OP) exhibits data that indicates a scene change has occurred Scene change detection can be used for a video encoder to determine a proper GOP length and insert I-frames based on the GOP length, instead of inserting an ϊ-frame at a fixed in ten al The preprocessor 202 can also be configured to generate a bandwidth map which can be used for encoding the multimedia data In some aspects, a content classification module located external to the preprocessor generates the bandwidth map instead Adaptive GOP partitioning the can adapiή ely change the composition of a group of pictures coded together Illustrative examples of is the operations shown in Figure 4 are described below
Inverse Telecine
|0074| Inverse telecine processing is described below and an illustrative example of inverse telecine is provided in reference to Figures 4-16. Video compression gives best results when the properties of the source are known and used to select the ideally matching form of processing, Off-the-air video, for example, can originate in several ways Broadcast video that is conventionally generated - in video cameras, broadcast studios etc. - conforms in the United States to the NTSC standard. According to the standard, each frame is made up of two fields. One field consists of the odd lines, the other, the even lines. This may be referred to as an "interlaced" format. While the frames are generated at approximately 30 frames/sec, the fields are records of the television camera s image that are 1/60 sec apart. Film on the other hand is shot at 24 frames/sec, each frame consisting of a complete image. This may be referred to as a "progressive'* format. For transmission in NTSC equipment, '-progressive" video is converted into "interlaced" video format via a telecine process In one aspect, further discussed below, the system advantageously determines when video lias been telecmed and performs an appropriate transform to regenerate the original progressive frames {0075] Figure 4 shows the effect of telecining progressive frames that were converted to interlaced video, Fi, F;, Fjs, and F 4 are progressive images that are the input to a teleciner. The numbers " T" and ""2" below the respective frames are indications of either odd or even fields it is noted that some fields are repeated in view of disparities amongst the frame rates Figure 4 also shows puH-down phases Po, Pf, P;. \\ and P4 The phase PQ is marked by the first of two NTSC compatible frames which have identical first fields. The following four frames correspond to phases Pt, P2. P.*> and P4 Note that the frames marked by P^ and PH have identical second fields Because film frame F1 is scanned three times, two identical successive output NTSC compatible first fields are formed. All NTSC fields derived from film frame Fi are taken from the same film image and therefore are taken at the same instant of time. Other NTSC frames derived from the film may have adjacent fields 1/24 sec apart. |0076| The phase detector 404 illustrated in Figure 4 makes certain decisions after receipt of a video frame These decisions include' (i) whether the present video from a telecine output and the 3 2 pull down phase is one of the five phases Pc Pn P;, P^, and
P4 shown in definition 512 of Figure 5; and (H ) the video was generated as conventional NTSC — that decision is denoted as phase P?.
|0077| These decisions appear as outputs of phase detector 401 shown in Figure 4. The path from phase detector 4Oi labeled "YES" actuates the inverse telecine 406, indicating that it has been provided with the correct pull down phase so that it can sort out the fields that were formed from the same photographic image and combine them. The path from phase detector 401 labeled "NO" similarly actuates the deinterlacer block 405 to separate an apparent NI SC frame into fields for optima! processing f0078| Figure 6 is a flowchart illustrating a process 600 of inverse telecining a video stream, in one aspect the process 600 is performed by the inverse telecine 301 of Figure 3. Starting at a step 651, the inverse telecine 301 determines a plurality of metrics based upon the received video. In this aspect, four metrics are formed which are sums of differences between fields drawn from the same frame or adjacent frames. The four metrics are further assembled into a Euclidian measure of distance between the four metrics derived from the received data and the most likely values of these metrics for each of the six hypothesized phases. The Euclidean sums are called branch information: for each received frame there are six such quantities. Each hypothesized phase has a successor phase which, in the case of the possible pull down phases, changes with each received frame.
[0079 j The possible paths of transitions are shown in Figure 7 and denoted by 767 There are six such paths. The decision process maintains six measures equivalent to the sum of Euclidean distances for each path of hypothesized phases To make the procedure responsive to changed conditions each Euclidean distance in the sum is diminished as it gets older. The phase track whose sum of Euclidean distances is smallest is deemed to be the operative one. The current phase of this track is called the "applicable phase." inverse telecining based on the phase selected, so long as it is not P
5. can now take place. If P* is selected then the current frame is deinterlaced using the deinterlacer at block 405 (Figure 4). In summary, the applicable phase is either utilized as the current pull down phase, or as an indicator to command the de-interlace of a frame that has been estimated to have a valid NTSC format. fOΘSOJ For every frame received from the video input, a new value for each of four metrics is computed These are defined as-
fOOSlf The term SAD is an abbreviation of the term "summed absolute differences " The fields which are differenced to form the metrics are graphically shown in Figure 8 The subscript refers to the field number; the letter denotes either Previous (= P) or Current (:::C). The brackets in Figure 8 refers to the pair-wise differencing of the fields. SADts refers to differences between the field one of the current frame, labeled Ci, and field one of the previous frame, labeled P1, which are spanned by a bracket labeled FS in definition provided in Figure 8, SA(Xs refers to differences between the field two of the current frame, labeled CN, and field two of the previous frame, labeled P;, which are both spanned by a bracket labeled SS, SA Deo refers to differences between field 2 of the current frame labeled C^ and field one of the current frame, labeled Cj, which is spanned by a bracket labeled CO; and SADpo refers to differences between field one of the current frame and field 2 of the previous frame, which are both spanned by a bracket labeled PO, jQ082| The computational load to evaluate each SAD is described below. There are approximately 480 active horizontal lines in conventional NTSC For the resolution to be the same in the horizontal direction, with a 4 3 aspect ratio, there should be 4SO x 4/3 ~ 640 equivalent vertical lines, or degrees of freedom. The video format of 640x480 pixels is one of the formats accepted by the Adv anced Television Standards Committee. Thus, every 1/30 of a second, the duration of a frame, 640 x 480 ~ 307.200 new pixels are generated. New data is generated at a rate of 9.2 x 10fl pixels/sec, implying that the hardware or software running this system processes data at approximately a 10 MB rate or more. This is one of the high speed portions of the system. It can be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. The SAD calculator cυuld be a standalone component, incorporated as hardware, firmware, middleware in a component of another device, or be implemented in microcode or software that is executed on the processor, or a combination thereof. When
implemented in software, firmware, middleware or microcode, the program code or code segments that perform the calculation may be stored in a machine readable medium such as a storage medium Λ code segment may represent a pioceduie. a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements A code segment ma\ be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents fθOSSJ Flowchart rX)0 in Figure r> makes explicit the relationships in Figure 8 and is a graphical representation of Equations 1 -4 It show s storage locations 94 1. 942. 943. and 944 into which arc kept the most recent values of SADKv S A Den, SADSs and S ADR1 respectiv ely These are eacSi generated by four sum of absolute differences calculators 940, which process the luminance values of previous first field data 931 , luminance values of current fust field data 932, luminance values of current second field data 933 and luminance values of the previous second field data 934 In the summations that define the metrics, the term "\ a1ue(i,jV" is meant to be the value of the luminance at position i,j, the summation being o\er all active pixels, though summing over the a meaningful subset of active pixels is not excluded
|0ΘS4j Howchart 100 in Figure 10 is a detailed flowchart illustrating the process for detecting telecined \ ideo and inverting it to recover to the original scanned film image In step 1030 the metrics defined in Figure 9 are evaluated Continuing to step 1083, lower en\ elope values of the four metrics are found A low er env elope of a SAD metric is a dynamically determined quantity that is the highest numerical floor below which the SAD does not penetrate Continuing to step 1085 branch information quantities defined belovi in Equations 5-10 are determined, which can use the previously determined metrics, the lower en\ elope \ alues and an experimentally determined constant Λ Since the successive values of the phase may be inconsistent, a quantity Λ is determined to reduce this apparent instability in step 1087 The phase is deemed consistent when the sequence of phase decisions is consistent with the model of the problem shown in Figure 7 Following that step, the process proceeds to step iO8V to calculate the decision variables using the current value of Λ Decision \ ariables calculator 10S9 e\aluates decision variables using all the information generated in the blocks of 1080
that led to it Steps 1030, 1083, 1085, 1087. and 1080 are an expansion of metrics determination 65 ! in Figure 6. From these variables, the applicable phase is found by phase selector 1090 Decision step 1091 uses the applicable phase to either invert the telecined video or deinteriace it as shown It is a more explicit statement of the operation of phase detector 404 in Figure 4, In one aspect the processing of Figure 10 is performed by the phase detector 404 of Figure 4. Starting at step 1030, detector 404 determines a plurality of metrics by the process described above with reference to figure 8, and continues through steps 1083. 1085, 1087, 1089, jϋ90, and 1091 . fθOSSJ Flowchart 1000 illustrates a process for estimating the current phase. The flowchart at a step 1083 describes the use of the determined metrics and lower envelope values to compute branch information. The branch information may be recognized as the Euclidean distances discussed earlier Exemplary equations that may be used to generate the branch information are Equations 5-10 below. The Branch Info quantities are computed in block S 209 of Figure 12
[00S6J The processed video data can be stored in a storage medium which can include, for example, a chip configured storage medium (e g., ROM RAM) or a disctype storage medium (e.g., magnetic or optical) connected to processor In some aspects, the inverse telecine 406 and the deinteriacer 405 can each contain part or all of the storage medium The branch information quantities are defined by the following equations.
Branch InfcX4» (SΛϋ
Fs - HO
2 * (SADs. - Hs)
2 ' (SAD
PO - U?)
2 < (SΛDco - Lc)
2 (^)
Branch lnfo(5) = (SΛDre - Ls)~ ÷ (SAD8S - L8)2 ÷ (SADro - LP)" + (SADco - Lc)2 (10>
(0087] The fine detail of the branch computation is shown in bianch information calculator 1209 in Figure 12 As shown in calculator 1209 developing the branch information uses the quantities Ls, the lower envelope value of SAD^ and SAD^s, Lr, the lower envelope \alue of SADpo, and I <., the Sower envelope value of SADco E he lower env elopes a?e used as distance offsets in the branch information calculations, either alone or in conjunction with a predetermined constant A to create 11S, Hp and I L., Their values are kept up to date in lower envelope trackers described below The H offsets are defined to be
Hs = L8 - A ( U)
lha Lp * Λ (12)
H, - L0 - A ( 13)
(0088] Λ process of tracking the \ alues of Ls, Lp, and ϊ< is presented in figures 13 A. BB. and BC Consider, for example, the tracking algorithm for Lp 1300 shown at the top of Figure 1 1 \ The metric SADp0 is compaied with the current value of Lp plus a threshold IV in comparator 1305 If it exceeds it, the current \aiue of Lp is unchanged as shown in block 131 5 If it does not. the new value of Lp becomes a linear combination of SADpo and I ? as seen in block 1313 In another aspect for block 13 15 the new value of Lp is Lp ^ Tf
|0089j The quantities Ls and I <. in Figures 13B and 13C are similarly computed Processing blocks in Figures 13A, 13B, and 13C which have the same function are numbered identically but given primes {' or ") to show that thev opeiate on a different set of variables For example, when a linear combination of the SADμo and Lc are
formed, that operation is shown in block 1313". As is the case for Lp. another aspect for 1315' would replace Le by Lc 4- TV
|009O| In the case of Ls, however, the algorithm in Figure 13B processes SADrx and SADss alternately, in turn labeling each X. since this lower envelope applies to both variables. The alternation of SAD1-S and SADss values takes place when the current value of S ADj K in block 1308 is read into the location for X in block 1303. followed by the current value of S AD^ in 1307 being read into the location for X in block 1302 As is the case for Lp, another aspect for 131 5" would replace Ls by Ls -*- Ts. The quantity A and the threshold values used in testing the current lower envelope values are predetermined by experiment. fflø91J Figure 1 1 is a flowchart illustrating an exemplary process for performing step 1080 of Figure 10 Figure 1 1 generally shows a process for updating the decision variables. There the six decision variables (corresponding to the six possible decisions) are updated with new information derived from the metrics The decision variables are found as follows:
|0092J The quantity α is less than unity and limits the dependence of the decision variables on their past values; use of α is equivalent to diminishing the effect of each Euclidean distance as its data ages. In flowchart 1 162 the decision variables to be updated are listed on the left as available on lines 1 101, 1 102, 1 103, 1 104, 1105, and
! 106. Each of the decision variables on one of the phase transition paths is then multiplied by α, a number less than one in one of the blocks 1 100, then the attenuated value of the old decision variable is added to the current value of the branch info variable indexed by the next phase on the phase transition path that the attenuated decision variable was on. This takes place in block 1 1 10, Variable D? is offset by a quantity Λ in block 1 103: Λ is computed in block 1 1 12 As described below, the quantity is chosen to reduce an inconsistency in the sequence of phases determined by this system. The smallest decision variable is found in block 1 120. |0093| In summary, new information specific to each decision is added Io the appropriate decision variable's previous value that lias been multiplied by α, to get the current decision variable's value. A new decision can be made when new metrics are in hand; therefore this technique is capable of making a new decision upon receipt of fields 1 and 2 of every frame These decision variables are the sums of Euclidean distances referred to earlier.
[0094} 'The applicable phase is selected to be the one having the subscript of the smallest decision variable. A decision based on the decision variables is made explicitly in block 1090 of Figure 10 Certain decisions are allowed in decision space. As described in block 1091, these decisions are. (i) The applicable phase in not P*, - inverse telecine the video and (U) the applicable phase is P? — deinterlace the video [0095} There may be occasional errors in a coherent string of decisions, because the metrics are drawn from video, which is inherently variable. This technique detects phase sequences that are inconsistent with Figure 7 Its operation is outlined in Figure 14 The algorithm 1400 stores the subscript of the present phase decision (~ x) in block 1405 and the subscript of the previous phase decision (::yϊ in block 1406 In block 1410, if \=zy— 5 is tested, in block 141 1 the following values are tested: if
If either of the two tests is affirmative, the decisions are declared to be consistent in block 1420. If neither test is affirmative, an offset, shown in block 1 193 of Figure 1 1 is computed in Figure 15 and added to D
5, the decision variable associated with P
5. J0096] Hie modification to EK also appears in Figure 1 5 as part of process 1500, which provides corrective action to inconsistencies in a sequence of phases. Suppose the consistency test in block 1 510 in flowchart 1500 has failed. Proceeding along the "No" branch that leads from block 1510, the next test in block 1514 is whether D
5 > D
1 for all i < 5, or alternatively is at least one of the variables, D,, for i < 5, bigger than D? If the first case is valid, a parameter 0, whose initial value is δo, is changed to 36« in block 1516. If the second case is valid, then δ is changed to 4δ<
> in block 1517 In block 152B
5 the value of A is updated to be Λ
B, where
|0097| Returning again to block 15210, assume that the string of decisions is judged to be consistent. The parameter δ is changed to δ - in block 15215, defined by
|0098| 'The new value of δ is inserted into ΛΛ, the updating relationship for A in block 152A. This is
Then the updated value of A is added to decision variable D5 in block 1593.
|00991 Figure 16 shows how the inverse telecine process proceeds once the pull down phase is determined With this information, fields 1605 and 1605" are identified as representing the same field of video. The two fields are averaged together, and combined with field 1606 to reconstruct frame 1620. The reconstructed frame is 162O1.
A simϋar process would reconstruct frame 1622. Fields derived from frames 1621 and
1623 are not duplicated. These frames are reconstructed by reweaving their first and second fields together.
[010Oj hi the aspect described above, every time a new frame is received four new values of metrics are found and a six fold set of hypotheses is tested using newly
computed decision variables Other processing structures could be adapted to compute the decision variables. A Viterbi decoder adds the metrics of the brandies that make up the paths together to form the path metric. The decision variables defined here are formed by a similar rule: each is the "leaky"' sum of new information variables. (In a leak}- summation the previous value of a decision variable is multiplied by a number less than unity before new information data is added to it,) A Viterbi decoder structure could be modified to support the operation of this procedure.
JOIOIJ While the present aspect is described in terms of processing conventional video in which a new frame appears every 1/30 second, it is noted that this process may be applied to frames which are recorded and processed backwards in time. Hie decision space remains the same, but there are minor changes that reflect the time reversal of the sequence of input frames. For example, a string of coherent telecine decisions from the time-reversed mode (shown here) P4 Pi P; P3 P0
would also be reversed in time.
|0IO21 Using this variation on the first aspect would allows the decision process two tries ~ one going forward in time, the other backward ~ at making a successful decision.
While the two tries are not independent, they are different in that each try would process the metrics in a different order.
|0103| This idea could be applied in conjunction with a buffer maintained to store future video frames that may require additional. If a video segment is found to give uiiacceptably inconsistent results in the forward direction of processing, the procedure would draw future frames from the buffer and attempt to get over the difficult stretch of video by processing frames in the reverse direction.
|0104] Hie processing of video described in this patent can also be applied to video in die PAL format.
Deinterlacer
|0IO5J "Deinteriacer" as used herein is a broad term that can be used to describe a deinterlacing system, dev ice, or process (including for example, software, firmware, or
hardware configured to perform a process) that processes, in whole or in significant part, interlaced multimedia data to form progressive multimedia data, |0.J06| Broadcast video that Is conventionally generated - in video cameras, broadcast studios etc. - conforms in the United States to the N TSC standard. A common way to compress video is to interlace it. In interlaced data each frame is made up of one of two fields. One field consists of the odd lines of the frame, the other, the even lines. While the frames are generated at approximately 30 frames/sec, the fields are records of the television camera's image that are 1/60 sec apart. Each frame of an interlaced video signal shows even- other horizontal line of the image. As the frames are projected on the screen, the video signal alternates between showing even and odd lines. When this is done fast enough, e.g.. around 60 frames per second, the video image looks smooth to the human eye f0107| Interlacing has been used for decades in analog television broadcasts that are based on the NTSC (U.S.) and PAl. (Europe) formats Because only half the image is sent with each frame, interlaced video uses roughly half the bandwidth than it would sending the entire picture. The eventual display format of the video internal to the terminals 16 is not necessarily NTSC compatible and cannot readily display interlaced data Instead, modem pixel-based displays (e.g., LCD, DLP, LCOS, plasma, etc ) are progressive scan and display progressively scanned video sources (whereas many older video devices use the older interlaced scan technology). Examples of some commonly used deinterlacing algorithms are described in "Scan rate up-conversion using adaptive weighted median filtering,1' P. Haavisto, J. Juhola, and Y, Neuvo, Signal Processing of HDT!' H, pp. 703-710, 1990, and "Deinterlacing of HDTV images for Multimedia Applications," R. Sirøonetti, S. Carrato, G. Ramponi, and A. Polo FΪHsan, in Signal Processing of HDlT I T1 pp 765-772, 1993. føϊøS] Described below are examples of deinterlacing aspects for systems and methods that that can be used, solely or in combination, to improve the performance of deinterlacing and which can be used in the deinterlacer 405 (Figure 4). Such aspects can include deinterlacing a selected frame using spatio-temporal filtering to determine a first provisional deinteriaced frame, using bi-directional motion estimation and motion compensation to determine a second provisional deinteriaced frame from the selected frame, and then combining the first and second provisional frames to form a final
progressive frame. The spatio-temporal filtering can use a weighted median filter ("Wined") filter that can include a horizontal edge detector that prevents blurring horizontal or near horizontal edges. Spatio-temporal filtering of previous and subsequent neighboring fields to a "current" field produces an intensity motion-level map that categorizes portions of a selected frame into different motion levels, for example, static, slow-motion, and fast motion.
(0109] ϊn some aspects, the intensity map is produced by Wmed filtering using a filtering aperture that includes pixels from five neighboring fields (two previous fields, the current field, and two next fields). The Wmed filtering can determine forward, backward, and bidirectional static area detection which can effectively handle scene changes and objects appearing and disappearing In various aspects, a Wmed filter can be utilized across one or more fields of the same parity in an inter-field filtering mode, and switched to an intra-field filtering mode by tweaking threshold criteria. In some aspects, motion estimation and compensation uses iuroa (intensity or brightness of the pixels) and chroma data (color information of the pixels) to improve deinterlacing regions of the selected frame where the brightness level is almost uniform but the color differs A denoising filter can be used to increase the accuracy of motion estimation. The denoising filter can be applied to Wmed deinterlaced provisional frames to remove alias artifacts generated by Wmed filtering. The deinterlacing methods and systems described below produce good deinterlacing results and have a relatively low computational complexity that allow fast running deinterlacing implementations, making such implementations suitable for a wide variety of deinterlacing applications, including systems that are used to provide data to cell phones, computers and other types of electronic or communication devices utilizing a display
|0J 10| The aspects of a deinterϊacer and deinterlacing methods are described herein with reference to various components, modules and/or steps that are used to deinterlace multimedia data
[Oil I j Figure 1 ? is a block diagram illustrating one aspect of a deinterlacer 1700 that can be used as the deinterlacer 405 in Figure 4. The deinterlacer 1722 includes a spatial filter 1730 that spatially and temporally ("spatio-temporal") filters at least a portion of the interlaced data and generates spatio-temporal information For example, Wmed can be used in the spatial filter 1730. In some aspects the deinterlacer 1700 also
includes a denoting filter (not shown), for example, a Wei tier filter or a wavelet shrinkage filler The deinterlacer 1700 also includes a motion estimator 1732 which provides motion estimates and compensation of a selected frame of interlaced data and generates motion information A combiner 1734 receives and combines the spatio- temporal information and the motion information to form a progressive frame |0i 12| Figure 18 is another block diagram of the deinterlacer 1700. A processor 1836 in the deinterlacer 1700 includes a spatial filter module 1838, a motion estimation module 1840, and a combiner module 18 42 interlaced multimedia data from an external source 48 can be provided to a communications module 44 in the deinterlacer 1700. The deinterlacer, and components or steps thereof, can be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof For example, a deinterlacer may be a standalone component, incorporated as hardware, firmware, middleware in a component of another device, or be implemented in microcode or software that is executed on the processor, or a combination thereof When implemented in software, firmware, middleware or microcode, the program code or code segments that perform the deinterlacer tasks may be stored in a machine readable medium such as a storage medium. A code segment may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
[0113} The received interlaced data can be stored in the deinterlacer 1700 in a storage medium 1846 which can include, for example, a chip configured storage medium (e.g., ROM, RAM) or a disc-type storage medium (e g., magnetic or optical) connected to the processor 1836. In some aspects, the processor 1836 can contain part or all of the storage medium. The processor 1836 is configured to process the interlaced multimedia data to form progressive frames which are then provided to another device or process.
|0I I4J Traditional analog video devices like televisions render video in an interlaced manner, i.e., such devices transmit even-numbered scan lines (even field), and odd-
numbered scan lines (odd field) From the signal sampling point of \ iew this is equivalent to a spatio-temporal subsampling in a pattern described by
where Θ stands for the original frame picture. I stands for the interlaced field, and U,bt',«) represents the horizontal, vertical, and temporal position of a pixel respectively, [OU 5] Without loss of generality, it can be assumed n ~ 0 is an even field throughout this disclosure so that Equation 2 J abo\e is simplified as
[0116} Since decimation is not conducted in the horizontal dimension, the sub- saraphng pattern can be depicted in the next n -y coordinate Jn Figure fQ, both circles and stars represent positions where the original full -frame piciuie has a sample pixe! The interlacing process decimates the star pixels, while leaving the circle pixels intact It should be noted that we index vertical positions starting from /era therefore the even Held is the top field, and the odd field is the bottom Held
|01 l7| The goal of a deinteilacei is to transform interlaced video (a sequence of fields) into non-interlaced progressive frames (a sequence of frames) In other words, interpolate cv en and odd fields to "iccover'" or generate full-frame pictures This can be represented by tquation 25
where F represent deinterlacmg results for missing pixels
|βπ&j Hgure 20 is a Mock diagram illustrating certain aspects of an aspect of a deinteriacer that uses Wmed filtering and motion estimation to generate a progressive frame from interlaced multimedia data The upper part of Figure 20 shows a motion intensity map 2052 that can be generated using information from a current field, two previous fields (PP Field and P Held), and two subsequent fields {Next Field and Next Next field) The motion intensity map 2052 categorizes, or partitions, the current frame into two or more different motion levels, and can be generated by spatio-temporal filtering, described in further detail hereinbelow In some aspects, the motion intensm map 2052 is generated to identify static areas, slow-motion areas, and fast-motion areas, as described in reference to Kquations 4-8 below A spatio-temporal filter, e g , W med filter 2054, filters the interlaced multimedia data using criteria based on the motion intensity map, and produces a spatio-temporal provisional dehiteriaced frame In some aspects, the Wmed filtering process involves a horizontal a neighborhood of [-1 , 1], a vertical neighborhood of [-3, 3J, and a temporal neighborhood of fh e adjacent fields, which are represented by the five fields (PP Field, P Field. Current Field. Next Field. Next Next Field) illustrated in Figure 20, with Z ' representing a delay of one field Relative to the Current Field, the Next Held and the P Field are non~parit> fields and the PP Field and the Next Next Field are parity fields The "neighborhood" used for spatio-temporal filtering refers to the spatial and temporal location of fields and pixels actually used during the filtering operation, and can be illustrated as an "aperture" as shown, for example, in Figures 2 ! and 22
|0119| f hε deinteriacer can also include a denoiser (denoising filter) 2056 The denoiser 2056 is configured to filter the spatio-temporal provisional deinterlaced frame generated b\ the W med filter 2056 Denoising the spatio-temporal provisional deinterlaced frame makes the subsequent motion search process more accurate especially if the source interlaced multimedia data sequence is contaminated by white noise it can also at least parti j remove alias between even and odd rows in a Wmed picture The denoiser 2056 can be implemented as a \ ariety of filters including a wavelet shrinkage and wavelet Wiener filter based denoiser which arc also described further hereinbelow
|0ϊ 20] llie bottom part of Figure 20 illustrates an aspect for determining motion information (e g , motion vector candidates, motion estimation, motion compensation)
of interlaced multimedia data. In particular. Figure 20 illustrates a motion estimation and motion compensation scheme that is used to generate a motion compensated provisional progressive frame of the selected frame, and then combined with the Wmed provisional frame to form a resulting "final" progressive frame, shown as de-interlaced current frame 2064 In some aspects, motion vector CMV") candidates (or estimates) of the interlaced multimedia data are provided to the deinterlacer from external motion estimators and used to provide a starting point for bi-directional motion estimator and compensator ("ME/MC") 2068 in some aspects, a MV candidate selector 2072 uses previously determined MVs for neighboring blocks for MV candidates of the blocks being processed, such as the MVs of previous processed blocks, for example blocks in a deinteriaced previous frame 2070. The motion compensation can be done bi-directional, based on the previous deinteriaced frame 70 and a next (e g., future) Wmed frame 2058. A current Wmed frame 2060 and a motion compensated ("MC") current frame 2066 are merged, or combined, by a combiner 2062 A resulting deinteriaced current frame 2064, now a progressive frame, is provided back to the ME/MO 2068 to be used as a deinteriaced previous frame 2070 and also communicated externa! to the deinterlacer for further processing, e.g., compression and transmission to a display terminal 'The various aspects shown in Figure 20 are described in more detail below. |0I2I I Figure 25 illustrates a process 2500 for processing multimedia data to produce a sequence of progressive frames from a sequence of interlaced frames. In one aspect, a progressive frame is produced by the deinterlacer 405 illustrated in Figure 4. At block 2502, process 2500 (process "'A") generates spatio-temporal information for a selected frame Spatio-temporal information can include information used to categorize the motion levels of the multimedia data and generate a motion intensity map. and includes the Wmed provisional deinteriaced frame and information used to generate the frame (e.g , information used in Equations 26-33). This process can be performed by the Wmed filter 2054, as illustrated in the upper portion of Figure 20, and its associated processing, which is described in further detail below. In process A. illustrated in Figure 26, regions are classified into fields of different motion levels at block 2602, as further described below fθl22| Next, at block 2504 (process "B), process 2500 generates motion compensation information for a selected frame. In one aspect, the bi-directional motion
estimator/motion compensator 2068, illustrated in the lower portion of Figure 20, can perform this process. The process 2500 then proceeds to block 2506 where it deinterlaces fields of the selected frame based on the spatio-temporal information and the motion compensation information to form a progressive frame associated with the selected frame This can he performed by the combiner 2062 illustrated in the lower portion of Figure 20.
Motion Intensity Map fO123| For each frame, a motion intensity 2052 map can be determined by processing pixels in a current field to determine areas of different "motion." An illustrative aspect of determining a three category motion intensity map h described below with reference to Figures 2 ! -24 The motion intensity map designates areas of each frame as static areas, slow-motion areas, and fast motion areas based on comparing pixels in same-parity fields and different parity fields
Static Areas
JOl 24] Determining static areas of the motion map can comprise processing pixels in a neighborhood of adjacent fields to determine if luminance differences of certain pixel (s) meet certain criteria. In some aspects, determining static areas of the motion map comprises processing pixels in a neighborhood of five adjacent fields (a Current Field (C), two fields temporally before the current field, and two frames temporally after the Current Field) to determine if luminance differences of certain pixel(s) meet certain thresholds. These five fields are illustrated in Figure 20 with Z ! representing a delay of one field. In other words, the fι\e adjacent would typically be displayed in such a sequence with a Z ! time delay.
(0125] Figure 21 illustrates an aperture identifying certain pixels of each of the five fields that can be used for the spatio-temporal filtering, according to some aspects. The aperture includes, from left to right. 3x3 pixel groups of a Previous Previous Field (PP)5 a Previous Field (P), the Current Field (C), a Next Field (N), and a Next Next Field (NN). In some aspects, an area of the Current Field is considered static in the motion map if it meets the criteria described in the Equations 26-28, the pixel locations and corresponding fields being illustrated in Figure 21 ■
/ ' is the Luminance of a pixel P located in the P Held,
/ s is the luminance of a pixel N located in the N Field,
/ » is the Luminance of a pixel B located in the Current Field.
L, is the Luminance of a pixel E located in the Current Field,
/ ,jpp is the Luminance of a pixel Bpj located in the PP Field,
/ 1,>- is the Luminance of a pixel Ep] ■ located in the PI* Field,
/ \'\\ is the luminance of a pixel Bw located in the NN Field, and
/, w is the Luminance of a pixel Kw located in the NK Field
|OI 26| Threshold I1 can he predetermined and set at a particular value, determined by a process other than deiπterfacing and prcnided (for example, as metadata for the video being deinterlaccd) or it can be dynamically determined during deinferlacing |0127| The static area criteria described above in Equation 26, 27. and 28 use more fields than conventional deinteriacing techniques for at least two reasons hirst, comparison between same-parity fields has lower alias and phase-mismatch than comparison between different-parity fields Howev er, the least time difference (hence correlation) between the field being processed and its most adjacent same-paiitv field neighbors is two fields, larger than that from its different-parit> field neighbors Λ combination of more reliable different-parity fields and lower-alias same-parity fields can impro\e the accuracy of the static area detection
{01281 to addition, the five fields can be distributed s> mrøetrically in the past and in the future relative to a pixel X in the Current Frame C, as shown in Figure 21 T he static area can he sub-dh ided into three categories forward static (static relative to the
previous frame), backward static (static relative to the next frame), or bi-directional (if both the forward and the backward criteria are satisfied) This finer categorization of the static areas can Improve performance especially at scene changes and object appear! ng/d i sappearing
Slow-Motion Areas
|0J 29J An area of the motion-map can be considered a slow -motions area in the motion-map if the luminance values of certain pixels do not meet the criteria to be designated a static area but meet criteria to be designated a slow-motion area. Equation 29 below defines criteria that can be used to determine a slow- motion area. Referring to Figure 22, the locations of pixels Ia. Ic, Ja, Jc, Ka, Kc. La
5 Lx, P and N identified in Equation 29 are shown in an aperture centered around pixel .V. The aperture includes a 3x7 pixel neighborhood of the Current Field (C) and 3x5 neighborhoods of the Next Field (N) a Previous Field (P) Pixel ,V is considered to be part of a slow-motion area if it does not meet the above-listed criteria for a static area and if pixels in the aperture meet the following criteria shown in Equation 29-
where T: is a threshold, and
re luminance values for pixels
Ia, Ic, Ja, Jc. Ka, Kc, La, Lc, P and N, respectively.
{01301 The threshold 72 can also be predetermined and set at a particular value, determined by a process other than deiπterlacing and provided (for example, as metadata for the video being deinteriaced) or it can be dynamically determined during deinterlacing.
J0131] It should be noted that a filter can blur edges that are horizontal (e.g., more than 45° from vertically aligned) because of the angle of its edge detection capability. For example, the edge detection capability of the aperture {filter} illustrated in Figure 22 is affected by the angle formed by pixel "A" and "F", or "C" and "D". Any edges more horizontal than such an angle that will not be interpolated optimally and hence staircase artifacts may appear at those edges In some aspects, the slow-motion category can be divided into two sub-categories, "Horizontal Edge" and "otherwise" to account for this
edge detection effect. The slow-motion pixel can be categorized as a Horizontal fudge if the criteria in Equation 30, shown below, is satisfied, and to a so-called "Otherwise" category if the criteria in Equation 30 is not satisfied
|(/ A + ΛB •+■ AC) - (I D + LE + LF)I <■ 1 \ (30)
where 7', is a threshold, and LA, AB, AC LD, /.E.. and AF are the luminance values of pixels A5 B. C, D. Xi. and F
|0132| Different interpolation methods can used for each of the Horizontal Edge and the Otherwise category.
Fast-Motion Areas
|0133| If the criteria for a static area and the criteria for the slow-motion area are not met, the pixel can be deemed to be in a fast-motion area jG134| Having categorized the pixels in a selected frame, process A (Figure 26) then proceeds to block 2604 and generates a provisional deinterlaced frame based upon the motion intensity map. In this aspect, Wmed filter 2054 (Figure 20) filters the selected field and the necessary adjacent fieids(s) to provide a candidate full-frame image F(, which can be defined as follows.
where a, (t = 0, 1, 2, 3) are integer weights calculated as below.
The Wmed filtered provisional deinferlaced frame is provided for further processing in conjunction with motion estimation and motion compensation processing, as illustrated in the lower portion of Figure 20
|0135| As described above and shown in Equation 31 , the static interpolation comprises inter-field interpolation and the slow-motion and fast-motion interpolation comprises intra-field interpolation In certain aspects where temporal (e.g., inter-field) interpolation of same parity fields is not desired, temporal interpolation can be "disabled" by setting the threshold f? (Equations 4-6) to zero (7, ~ 0) . Processing of the current field with temporal interpolation disabled results in categorizing no areas of the motion-level map as static, and the Wmed filter 2054 (Figure 20) uses the three fields illustrated in the aperture in Figure 22 which operate on a current field and the two adjacent non-parity fields
Den oising
|0I36| In certain aspects, a denoiser can be used to remove noise from the candidate Wmed frame before it is further processed using motion compensation information. A denoiser can remove noise that is present in the Wmed frame and retain the signal present regardless of the signal's frequency content. Various types of de noising filters can be used, including wavelet filters. Wavelets are a class of functions used to localize a given signal in both space and scaling domains Hie fundamental idea behind wavelets is to analyze the signal at different scales or resolutions such that small changes in the wavelet representation produce a correspondingly small change in the original signal.
|0137| In some aspects, a deπoising filter is based on an aspect of a (4, 2) bi- orthogona! cubic B-spline wavelet filter One such filter can be defined by the following forward and inverse transforms.
|0138J Application of a denoisiπg filler can increase the accuracy of motion compensation in a noisy environment Noise in the video sequence is assumed Io be additive white Gaussian. The estimated variance of the noise is denoted by σ It can be estimated as the median absolute deviation of the highest-frequency subband coefficients divided by O 6745 Implementations of such filters are described further in "Ideal spatial adaptation by wavelet shrinkage," D, L. Donoho and LM Johnstone, B/omelrika, vol 8, pp 425-455. 1994. which is incorporated by reference herein in its entirety.
|0J39J A wavelet shrinkage or a wavelet Wiener filter can be also be applied as the denoiser. Wavelet shrinkage denoising can involve shrinking in the wavelet transform domain, and typically comprises three steps, a linear forward wavelet transform, a nonlinear shrinkage denoising, and a linear inverse wavelet transform. The Wiener filter is a MSE-optima! linear filter which can be used to improve images degraded by additive noise and blurring. Such filters are generally known in the art and are described, for example, in "Ideal spatial adaptation by wavelet shrinkage," referenced above, and by S P Ghael, A. M. Sayeed. and R G Baraπiuk, "Improvement Wavelet denoising via empirical Wiener filtering," Proceedings of SPfE, \ol 3 16<-λ pp 389-399, San Diego, July 1997
Motion Compensation
|0140j Referring to Figure 27, at block 2702 process B performs bi-directional motion estimation, and then at block 104 uses motion estimates to perform motion compensation, which is illustrated further illustrated in Figure 20, and described in an illustrative aspect hereinbelow There is a one field "lag" between the Wrøed filter and the motion-compensation based deiπterlacer Motion compensation information for the "missing" data (the non-original rows of pixel data) of the Current Field "C" is being predicted from information in both the previous frame "P" and the next frame "N" as shown in Figure 23 In the Current Field (Figure 23) solid lines represent rows where
original pixel data exist and dashed lines represent rows where Wmed-interpoiated pixel data exist In cenain aspects, motion compensation is perfoimed in a 4-iovv by 8- colurnn pixel neighboihood Howe\er, this, pixel neighborhood is an example for purposes of explanation, and it will be apparent to those skilled in the art that motion compensation ma\ be performed in other aspects based on a pixel neighborhood comprising a different number rows and a different number of columns, the choice of which can be based on many factors including, for example, computational speed, axaiiable processing power, or characteristics of the multimedia data being deinterlaced Because the Current Field only has half of the rows, the four rows to be matched actually correspond to an 8-pixel by 8-pkeS area
|0141J Referring to Figure 20. the bi-directional MFAIC 2068 can use sum of squared errors (SSR) can be used to measure the similarity between a predicting block and a predicted block for the Wmcd current frame 2060 relative to the Wmed next frame 2058 and the deinterlaced current frame 2070 The generation of the motion compensated current frame 2066 then uses pixel information from the most similar matching blocks to fill in the missing data between the original pixel lines In some aspects, the bi-directional MIi MC 2068 biases or gives more weight to the pixel information from the deinteriaced pievious frame 2070 information because it uas generated by motion compensation information and W med information, while the Wmed next frame 2058 is only deinterlaced b\ spatio-temporal filtering |0J42| In some aspects, to improve matching performance in regions of fields that hax'e similar-iυma regions but different-chroma regions, a metric can be used that includes the contribution of pixel values of one or more luma group of pixels (e g , one 4 -row b\ 8-column luma block) and one or more chroma group of pixels (e g , two 2- row b\ 4-column chroma blocks U and V) Such approaches effectively reduces mismatch at color sensitive regions
|0143j Motion Vectors (MVs) have granularity of ''; pixels in the vertical dimension, and either ' ; or 1A pixels in the horizontal dimension To obtain fractional- pixel samples, interpolation filters can be used For example, some filters that can be used to obtain half-pixel samples include a bilinear filtei (1, 1 ), an inteipolatioπ filter recommended by H 263 'AVC (L -5, 20, 20, -5. I), and a six-tap Hamming windowed
sine function filter (3, -21 , 147, 147. -2i, 3). Vpixel samples can be generated from full and half pixel sample by applying a bilinear filter
|0J44| In some aspects, motion compensation can use various types of searching processes to match data {e.g , depicting an object) at a certain location of a current frame to corresponding data at a different location in another frame (e.g , a next frame or a previous frame), the difference in location within the respective frames indicating the object's motion. For example, the searching processes use a full motion search which may cover a larger search area or a fast motion search which can use fewer pixels, and/or the selected pixels used in the search pattern can have a particular shape, e.g., a diamond shape. For fast motion searches, the search areas can be centered around motion estimates, or motion candidates, which can be used as a starting point for searching the adjacent frames. In some aspects. MV candidates can be generated from external motion estimators and provided to the deinterlacer. Motion vectors of a macroblock from a corresponding neighborhood in a previously motion compensated adjacent frame can also he used as a motion estimate. In some aspects, MV candidates can be generated from searching a neighborhood of macrobiocks (e.g., a 3-roacroblock by 3-niacroblock) of the corresponding previous and next frames |0145| Figure 24 illustrates an example of two MV maps, MVp and MV\, that could be generated during motion estimation/compensation by searching a neighborhood of the previous frame and the next frame, as show in Figure 23. In both MVp and MVN the block to be processed to determine motion information is the center block denoted by "X." In both MVp and JV1V\, there are nine MV candidates that can be used during motion estimation of the current block X being processed. In this example, four of the MV candidates exist in the same field from earlier performed motion searches and are depicted by the lighter-colored blocks in MVp and MV\ (Figure 24). Five other MV candidates, depicted by the darker-colored blocks, were copied from the motion information (or maps) of the previously processed frame. |0146j After motion estimation/compensation is completed, two interpolation results may result for the missing rows (denoted by the dashed lines in Figure 23)' one interpolation result generated by the Wmed filter (Wmed Current Frame 2060 Figure 20) and one interpolation result generated by motion estimation processing of the motion compensator (MC Current Frame 2066). A combiner 2062 typically merges the
Wmed Current Frame 2060 and the MC Current Frame 2066 by using at least a portion of the Wined Current Frame 2060 and the MC Current Frame 2066 to generate a Current Deinferlaced Frame 2064. However, under certain conditions, the combiner 2062 may generate a Current Deinterlaced frame using only one of the Current Frame 2060 or the MC Current Frame 2066 In one example, the combiner 2062 merges the Wmed Current Frame 2060 and the MC Current Frame 2066 to generate a deinterlaced output signal as shown in Equation 36;
where h {JJI) is used for the luminance value in field //
/ at position x (x, v/ with ' for transpose. Using a clip function defined as
where Cj is a robustness parameter, and Dijf is the luraa difference between the predicting frame pixel and the available pixel in the predicted frame (taken from the existing field). By appropriately choosing ( 1 , it is possible to tune the relative importance of the mean square error k: can be calculated as shown in Equation 39;
where x ~ (.\*, v), ,vκ - (0,1), D is the motion vector, e> is a small constant to prevent division by zero. Deinteriacing using clipping functions for filtering is further described in "De-interlacing of video data," G. D. Haan and E B. Bellers, HlFJi Transactions on Consumer Electronics, Vol. 43, No. 3, pp. 819 -825. 1997, which is incorporated herein in its entirety.
|0I47| In some aspects, the combiner 2062 can be configured to try and maintain the following equation to achieve a high PSNR and robust results.
J\ , (A\ « ) - /Cw (.Ϋ, «)j - μ-;, (J? - j\. , « ) - /Cw (.Ϋ - j,. , nj (40)
[0148J It is possible to decouple deinterlacing prediction schemes comprising inter- field interpolation from intra-field interpolation with a Wmed ÷ MC deinterlacing scheme In other words, the spatio-temporal Wmed filtering can be used mainly for intra-fiβld interpolation purposes, while inter-field interpolation can be performed during motion compensation. This reduces the peak signal-to-noise ratio of the Wmed result, but the visual quality after motion compensation is applied is more pleasing, because bad pixels from inaccurate inter-field prediction mode decisions will be removed from the Wmed filtering process
|0149] Chroma handling can be consistent with the collocated iuma handling. In terms of motion map generation, the motion level of a chroma pixel is obtained by observing the motion level of its four collocated iuma pixeis. The operation can be based on voting (chroma motion level borrows the dominant Iuma motion level). However, we propose to use a conservative approach as follows If any one of the four Iuma pixels has a fast motion level, the chroma motion level shall be fast-motion; other wise, if any one of the four iuma pixels has a slow motion level, the chroma motion level shall be slow-motion; otherwise the chroma motion level is static The conservative approach may not achieve the highest PSNR, but it avoids the risk of using IN TER prediction wherever there is ambiguity in chroma motion level [015Oj Multimedia data sequences were deinterlaced using the described Wmed algorithm described alone and the combined Wmed and motion compensated algorithm described herein The same multimedia data sequences were also deinterlaced using a pixel blending (or averaging) algorithm and a "no-deiπterlacing"' case where the fields were merely combined without any interpolation or blending. The resulting frames were analyzed to determine the PSNR and is shown in the following table.
|0151J Even though there is only marginal PSNR improvement by deuHerlacmg using the MC in addition to Wraeci, the visual quality of the deintedaced image produced by combining the Wmed and MC interpolation results is more visually pleasing to because as mentioned above, combining the Wroed results and the MC results suppresses alias and noise between even and odd fields.
|OΪ52| In some resampling aspects, a poly-phase resampler is implemented for picture size resizing In one example of downsarnpling. the ratio between the original and the resized picture can be p i <■/, where // and q are relatively prime integers. The total number of phases is p The cutoff frequency of the poly-phase filter in some aspects is 0.6 for resizing factors around 0.5. The cutoff frequency does not exactly match the resizing ratio in order to boost the high-frequency response of the resized sequence. This inevitably allows some aliasing. However, it is well-known that human eyes prefer sharp but a HtOe aiiased pictures to blurry and alias-free pictures. |0153j Figure 42 illustrates an example of poly-phase resampling, showing the phases if the resizing ration is J-J. The cutoff frequency illustrated in Figure 42 is % also. Original pixels are illustrated in the above Figure 42 with vertical axes. A sine function is also drawn centered around the axes to represent the filter waveform. Because we choose the cutoff frequency to be exactly the same as the resampling ration, the zeros of the sine function overlap the position of the pixels after resizing, illustrated in Figure 42 with crosses. To find a pixel value after resizing, the contribution can be summed up from the original pixels as shown in the following equation:
where /c is the cutoff frequency. The above 1-D poly-phase filter can be applied to both the horizontal dimension and the vertical dimension
|0154| Another aspect of resampling (resizing) is accounting for overscan. In an NTSC television signal, an image has 486 scan lines, and in digital video could have
720 pixels on each scan line. However, not all of the entire image is visible on the television due to mismatches between the size and the screen format The part of the image that is not visible is called overscan.
{01551 To help broadcasters put useful information in the area visible by as many televisions as possible, the Society of Motion Picture & 'Television Engineers (SMPI E) defined specific sizes of the action frame called the safe action area and the safe title area. See SMPTE recommended practice RP 27.3-1989 on Specifications for Safe Action and Safe Title Areas Test Pattern for Television Systems. The safe action area is defined by the SMPTE as the area in which "all significant action must take place." The safe title area is defined as the area where "all the useful information can be confined to ensure visibility on the majority of home television receivers *" For example, as illustrated in Figure 43, the safe action area 4310 occupies the center 90% of the screen, giving a 5% border all around The safe title area 4305 occupies the center 80% of the screen, giving a 10% border Figure
[0156J Referring now to Figure 44, because the safe title area is so small, to add more contents in the image, some broadcasts will include text in the safe action area, which, is inside the white rectangular w indow 4415. Usually black borders may be seen in the overscan. For example, in Figure 44, black borders appear at the upper side 4420 and lower side 4425 of the image. These black borders can be removed in the overscan, because H.264 video uses boundary extension in motion estimation. Extended black borders can increase the residual Conservatively, we can cut the boundary by 2%, and then do the resizing The filters for resizing can be generated accordingly. Truncation is performed to remove the overscan before poly-phase downsampiing.
Deblocking / Deringing
|0157] In one example of deblocking processing, a deblocking filter can be applied to all die 4x4 block edges of a frame, except edges at the boundary of the frame and any edges for which the deblocking filter process is disabled. This filtering process shall be performed on a macroblock basis after the completion of the frame construction process with all macrobiocks in a frame processed in order of increasing macroblock addresses. For each macroblock, vertical edges are filtered first, from left to right, and then horizontal edges are filtered from top to bottom The luma deblocking filter process is
4!
performed on four l ό-sample edges and the deblocking filler process for each chroma component is performed on two S-sample edges, for the horizontal direction and for the vertical direction, as shown in Figure 39. Sample values above and to the left of the current macroblock that may have already been modified by the deblocking process operation on previous macroblocks shall be used as input to the deblocking filter process on the current macroblock arid may be further modified during the filtering of the current macroblock- Sample values modified during filtering of vertical edges can be used as input for the filtering of the horizontal edges for the same macrobiock. A deblocking process can be invoked for the luma and chroma components separately. [0158} In an example of deringing processing, a 2-D filter can be adaptively applied to smooth out areas near edges. Edge pixels undergo little or no filtering in order to avoid blurring.
GOP Partitioncr
|0!59| Illustrative examples of processing is described below including bandwidth map generation, shot detection, and adaptive GOP partitioning, than can be included in the GOP parti tioner.
Bandwidth Map Generation følόOj Human visual quality V can be a function of both encoding complexity C and allocated bits B {also referred to as bandwidth). Figure 29 is a graph illustrating this relationship. It should be noted that the encoding complexity metric C considers spatial and temporal frequencies from the human vision point of view. For distortions more sensitive to human eyes, the complexity value is correspondingly higher. It can typically be assume that V is monotonically decreasing in C, and monotonicaϊly increasing in B.
|0161 J To achieve constant visual quality, a bandwidth (B1) is assigned to the i" object (frame or MB) to be encoded that satisfies the criteria expressed in the two equations immediately below
hi the two equations immediately above, C; is the encoding complexity of the I
th object, B Is the total available bandwidth, and V is the achieved visual quality for an object. J01C2] Human visual quality is difficult to formulate as an equation. Therefore, the above equation set is not precisely defined. However, if it is assumed that the 3-D
mode! is continuous in all variables, bandwidth ratio { V / ^ ) can be treated as unchanged within the neighborhood of a (C, V) pair, The bandwidth ratio βi is defined in the equation shown below:
|0163| Bit allocation can then be defined as expressed in the following equations:
where & indicates the "neighborhood."
fø!64J The encoding complexity is affected by human visual sensitivity, both spatial and temporal. Gi rod's human vision model is an example of a model that can be used to define the spatial complexity. This model considers the local spatial frequency and ambient lighting. The resulting metric is called D0<,lt. At a pre-processing point in the process, whether a picture is to be intia-coded or inter-coded is not known and bandwidth ratios for both are generated. Bits are allocated according to the ratio between β,κlRΛ of different video objects. For intra-coded pictures, the bandwidth ratio is expressed in the following equation:
fø!65| In the equation above, Y is the average luminance component of a macroblock, o??.\m-s is a weighing factor for the luminance square and /.\-
Ujf term
following it, /W/*.; is a normalization factor to guarantee For example, a
value for <r
;;V7R,
~ 4 achieves good visual qualify. Content information (e.g., a content classification) can be used to set αr
ATΛ i to a value that corresponds to a desired good visual quality level for the particular content of the video, hi one example, if the video content comprises a "talking head" news broadcast, the visual quality level may be set lower because the information image or displayable portion of the video may be deemed of less importance than the audio portion, and less bits can be allocated to encode the data In another example, if the video content comprises a sporting event, content information may be used to set α,
sτs.- to a value that corresponds to a higher visual quality level because the displayed images may be more important to a viewer, and accordingly more bits can be allocated to encode the data.
(0166] To understand this relationship, it should be noted that bandwidth is allocated logarithmically with encoding complexity The luminance squared term J'* reflects the fact that coefficients with larger magnitude use more bits to encode. To prevent the logarithm from getting negative values, unity is added to the term in the parenthesis.
Logarithms with other bases can also be used.
|0i67J The temporal complexity is determined by a measure of a frame difference metric, which measures the difference between two consecutive frames taking into account the amount of motion (e.g., motion vectors) along with a frame difference metric such as the sum of the absolute differences (SAD). jø!68| Bit allocation for inter-coded pictures can consider spatial as well as temporal complexity. This is expressed below:
[0169} In the above equation, MVp and MVN are the forward and the backward motion vectors for the current MB. It can be noted that Y2 in the intra-coded bandwidth formula is replaced by sum of squared differences (SSD) To understand the role of
|Λ'/rr f A/Ϊ'v j ~ in the above equation, note the next characteristics of human visual
system: areas undergoing smooth, predictable motion (small ji/V/f'V t Mϊ'
v ' ) attract
attention and can he tracked by the eye and typically cannot tolerate any more distortion than stationary regions. However, areas undergoing fast or unpredictable motion (large cannot be tracked and can tolerate significant quantization.
Experiments show that (x.-
XTPR ~ 1 , f ~ 0 00) achieves good visual quality
Shot Detection
|0170| An illustrative example of shot detection is described below Such components and process can be included in the GOP parti tioner 412 {Figure 4) f0J 7IJ The motion compensator 23 can be configured to determine bi-directional motion information about frames in the video The motion compensator 23 can also be configured to determine one or more difference metrics, for example, the sum of absolute differences (SAD) or the sum of absolute differences (SSD). and calculate other information including luminance information for one or more frames (e g., macroblock (MB) luminance averages or differences), a luminance histogram difference, and a frame difference metric, examples of which are described in reference to Equations 1-3 The shot classifier can be configured to classify frames in the video into two or more categories of "shots" using information determined by the motion compensator The encoder is configured to adaptively encode the plurality of frames based on the shot classifications. The motion compensator, shot classifier, and encoder are described below in reference to Equations 1-10.
J0172J Figure 28 is a block diagram of a preprocessor 202 comprising a processor 2831 configured for shot detection and other preprocessing operations according to some aspects. A digital video source can be provided by a source externa! to the preprocessor 202 as shown in Figure 4 and communicated to a communications module 2836 in the preprocessor 202. The preprocessor 202 contains a storage medium 2S25 which communicates w ith the processor 2831 , both of which communicate with the communications module 2836. The processor 2831 includes a motion compensator 2032, a shot classifier 2833, and other modules for preprocessing 2034, which can operate to generate motion information, classify shots in frames of the video data, and perform other preprocessing tests as described herein The motion compensator, shot classier, and other modules can contain processes similar to corresponding modules in
Figure 4, and can process \ kleo to determine information described below In particular, the processor 283 ] can have a configuration to obtain metrics indicative of a difference between adjacent frames, of a plurality of video frames, the metrics comprising hi -directions! motion information and luminance information, determine shot changes in the plurality of video frames based on said metrics, and adaptive!}' encode the plurality of frames based on the shot changes In some aspects, the metrics can be calculated by a device or process externa! to the processor 2831 , which can also be external to the preprocessor 202, and communicated to the processor 2831, either directly or indirectly via another de\ ice or memory The metrics can also be calculated b> the processor 283 1. for example, by the motion compensator 2832 |017JJ The preprocessor 202 provides video and metadata for further processing, encoding, and transmission to other devices, for example, terminals 6 (Figure 1 ) The encoded video can be, in some aspects, scalable multi-layered encoded video which can comprise a base layer and an enhancement laver Scalable laver encoding is further described in co-pending IJ S Patent Application No [Attorney docket no 050078] entitled "SC ΎLΛBEJ VΠ S<O COUINI. Wn H Tw o LAY* K EM.ΌE >INI. AND SJNGI \ LAYI- R D)VOl >i W owned by the assignee hereof, and which is incorporated by reference in its entirety herein
|0174j The various illustrative logical blocks, components, modules, and circuits described in connection with Figure 28, and other examples and figures disclosed herein may be implemented or performed, in some aspects, with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array {I -PGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein A general purpose processor such as the one shown in Hgure 28 ma> be a microprocessor, hut in the alternative, the processor ma> be any conventional processor, controller, microcontioller, or state machine A processor may also be implemented as a combination of computing devices, e g . a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration fθl75| Video encoding usually operates on a stmctured group of pictures (GOP) A Gi)P norma!3> stalls with an intra-coded frame (l-llame), followed by a series of P
(predictive) or B (bi-directional) frames Typically, an I -frame can store all the data to display the frame, a β -frame relies on data in the preceding and following frames (e g., only containing data changed from the preceding frame or is different from data in the next frame), and a P-frame contains data that has changed from the preceding frame. |0176j In common usage, 1-frames are interspersed with P-frames and B-frames in encoded video. In terms of size (e.g., number of bits used to encode the frame), I- frames are typically much larger than P-frames, which in turn are larger than B-frames. for efficient encoding, transmission and decoding processing, the length of a GOF should be long enough to reduce the efficient loss from big I-frames, and short enough to tight mismatch between encoder and decoder, or channel impairment In addition, macro blocks (MB) in P frames can be intra coded for the same reason. (0177J Scene change detection can be used for a video encoder to determine a proper GOP length and insert I-frames based on the GOP length, instead of inserting an I-frame at a fixed interval. In a practical streaming video system, the communication channel is usually impaired by bit errors or packet losses Where to place 1 frames or I MBs may significantly impact decoded video quality and viewing experience. One encoding scheme is to use intra-coded frames for pictures or portions of pictures that have significant change from collocated previous pictures or picture portions Normally these regions cannot be predicted effectively and efficiently with motion estimation, and encoding can be done more efficiently if such regions are exempted from inter-frame coding techniques (e.g., encoding using B-frames and P-frames) In the context of channel impairment, those regions are likely to suffer from error propagation, which can be reduced or eliminated (or nearly so) by intra-frame encoding. |0178J Portions of the GOP video can be classified into two or more categories, where each region can have different intra-frame encoding criteria that may depend on the particular implementation. As an example, the video can be classified into three categories' abrupt scene changes, cross-fading and other slow scene changes, and camera flashlights. Abrupt scene changes includes frames that are significantly different from the previous frame, usually caused by a camera operation Since the content of these frames is different from that of the previous frame, the abrupt scene change frames should be encoded as I frames. Cross- fading and other slow scene changes includes slow switching of scenes, usually caused by computer processing of
camera shots. Gradual blending of two different scenes may look more pleasing to human eyes, but poses a challenge to video coding. Motion compensation cannot reduce the bitrate of those frames effectively, and more intra MBs can be updated for these frames.
|0179j Camera flashlights, or camera flash events, occur when the content of a frame includes camera flashes. Such flashes are relatively short in duration (e.g., one frame) and extremely bright such that the pixels in a frame portraying the flashes exhibit unusually high luminance relative to a corresponding area on an adjacent frame. Camera flashlights shift the luminance of a picture suddenly and swiftly Usually the duration of a camera flashlight is shorter than the temporal masking duration of the human vision system (HVS), which is typically defined to be 44 ms. Human eyes are not sensitive to the quality of these short bursts of brightness and therefore they can be encoded coarsely. Because the flashlight frames cannot be handled effectively with motion compensation and they are bad prediction candidate for future frames, coarse encoding of these frames does not reduce the encoding efficiency of future frames Scenes classified as flashlights should not be used to predict other frames because of the "artificial" high luminance, and other frames cannot effectively he used to predict these frames for the same reason. Once identified, these frames can be taken out because they can require a relatively high amount of processing. One option is to remove the camera flashlight frames and encode a DC coefficient in their place; such a solution is simple, computationally fast and saves many bits
{01801 When any of the above frames are detected, a shot event is declared. Shot detection is not only useful to improve encoding quality, it can also aid in identifying \ ideo content searching and indexing. One aspect of a scene detection process is described hereinbelovv. fϋlSl] Figure 30 illustrates a process 3000 that operates on a GOP and can be used in some aspects to encode video based υn shot detection in video frames, where portions of the process 3000 {or sub-processes) are described and illustrated with reference to Figures 30-40. The processor 283 1 can be configured to incorporate process 3000. After process 3000 starts, it proceeds to block 3042 where metrics (information) are obtained for the video frames, the metrics including information indicative of a difference between adjacent frames. The metrics includes bidirectional motion
information and luminance-based information that is subsequently to determine changes that occurred between adjacent frames which can be used for shot classification. Such metrics can be obtained from another device or process, or calculated by. for example, processor 2831. Illustrative examples of metrics generation are described in reference to process A in Figure 31
|01S2j Process 3000 then proceeds to block 3044 where shot changes in the video are determined based on the metrics. A video frame can be classified into two or more categories of what type of shot is contained in the frame, for example, an abrupt scene change, a slowly changing scene, or a scene containing high luminance values (camera flashes). Certain implementations encoding may necessitate other categories. An illustrative example of shot classification is described in reference to process B in Figure 32, and in more detai! with reference to processes D, E5 and F in Figures 34-36, respectively
|0183| Once a frame is classified, process 3000 proceeds to block 3046 where the frame can be encoded, or designated for encoding, using the shot classification results Such results can influence whether to encode the frame with an intra-coded frame or a predictive frame (e.g , P-frame or B-frame). Process O in Figure 33 shows an example of an encoding scheme using the shot results.
|0IS4| Figure 31 illustrates an example of a process for obtaining metrics of the \ ideo Figure 31 illustrates certain steps that occur in block 3042 of Figure 30 Referring still to Figure 31. in block 3152, process A obtains or determines bidirectional motion estimation and compensation information of the video. The motion compensator 2832 of Figure 28 can be configured to perform bi-directional motion estimation on the frames and determine motion compensation information that can be used for subsequent shot classification. Process A then proceeds to block 31 54 where it generates luminance information including a luminance difference histogram for a current or selected frame and one or more adjacent frames Lastly, process A then continues to block 3156 where a metric is calculated that indicative of the shot contained in the frame. One such metric is a frame difference metric which is shown in two examples in Equations 4 and 10 Illustrative examples of determining motion information, luminance information, and a frame difference metric are described below Motion Compensation
|0IS5| To perform bi-directional motion estimation/compensation, a video sequence can be preprocessed with a bi-directional motion compensator that matches every 8x8 biock of the current frame with blocks in two of the frames most adjacent neighboring frames, one in the past, and one in the future Hie motion compensator produces motion vectors and difference metrics for every block. Figure 37 illustrates this concept, showing an example of matching pixels of a current frame C to a past frame P and a future (or next) frame N, and depicts motion vectors to the matched pixels (past motion vector MV,, and future motion vector MVN. A brief description of an illustrative aspect of bi-directional motion vector generation and related encoding follows below. [0186J Figure 40 illustrates an example of a motion vector determination process and predictive frame encoding in, for example, MPEG-4. The process described in Figure 40 is a more detailed illustration of an example process that can take place in block 3152 of Figure 31. In Figure 40, current picture 4034 is made up of 5 x 5 macroblocks, where the number of macroblocks in this example is arbitrary. A macroblock is made up of 16 x 16 pixels Pixels can be defined by an 8-bit luminance value (Y) and two 8-biι chrominance values (Cr and Cb).
|<JIS7] In MPKG, Y. Cr and Cb components can be stored in a 4;2 0 format, where the Cr and Cb components are down-sampled by 2 in the X and the Y directions. Hence, each macroblock would consist of 256 Y components, 64 Cr components and 64 Cb components. Macroblock 4036 of current picture 4034 is predicted from reference picture 4032 at a different time point than current picture 4034 A search is made in reference picture 4032 to locate best matching macroblock 4038 that is closest, in terms of Y, Cr and Cb values to current macroblock 4036 being encoded. Hie location of best matching macroblock 138 in reference picture 4032 is encoded in motion vector 4040. Reference picture 4032 can be an 1 -frame or P-frame that a decoder will have reconstructed prior to the construction of current picture 4034 Best matching macroblock 4038 is subtracted from current macrob1oek40(a difference for each of the Y. Cr and Cb components is calculated) resulting in residual error 4042 Residua! error 4042 is encoded with 2D Discrete Cosine Transform (DCT) 4044 and then quantized 4046 Quantization 4046 can be performed to provide spatial compression by, for example, allotting fewer bits to the high frequency coefficients while allotting more bits to the low frequency coefficients. The quantized coefficients of residual error 4042,
along with motion vector 4040 and reference picture 4034 identifying information, are encoded information representing current macroblock 4036. The encoded information can be stored in memory for future use or operated on for purposes of, for example, error correction or image enhancement, or transmitted over network 140 |0188j 'The encoded quantized coefficients of residual error 4042. along with encoded motion vector 4040 can be used to reconstruct current macroblock 4036 in the encoder for use as part of a reference frame for subsequent motion estimation and compensation. The encoder can emulate the procedures of a decoder for this P-frame reconstruction. The emulation of the decoder will result in both the encoder and decoder working with the same reference picture. The reconstruction process, -whether done in an encoder, for further inter-coding, or in a decoder, is presented here. Reconstruction of a P-frame can be started after the reference frame (or a portion of a picture or frame that is being referenced) is reconstructed. The encoded quantized coefficients are dequantized 4050 and then 2D Inverse DCT, or IDCT, 4052 is performed resulting in decoded or reconstructed residual error 4054. Encoded motion vector 4040 is decoded and used to locate the already reconstructed best matching macroblock 4056 in the already reconstructed reference picture 4032 deconstructed residual error 4054 is then added to reconstructed best matching raacroblock 4056 to form reconstructed macrob5ock 4058. Reeonstmeted macroblock 4058 can be stored in memory, displayed independently or in a picture with other reconstructed macrobiocks, or processed further for image enhancement.
|0189| Encoding using B-frames (or any section coded with bi-directional prediction) can exploit temporal redundancy between a region in a current picture and a best matching prediction region in a previous picture and a best matching prediction region in a subsequent picture The subsequent best matching prediction region and the previous best matching prediction region are combined to form a combined bidirectional predicted region.. The difference between the current picture region and the best matching combined bi-directional prediction region is a residual error (or prediction error). The locations of the best matching prediction region in the subsequent reference picture and the best matching prediction region in the previous reference picture can be encoded in two motion vectors. Luminance Histogram Difference
5 !
|β!9ϋj The motion compensator can produce a difference metric for every Mock, The difference metric can be a sum of square difference (SSD) or a sum of absolute difference (SAD) Without loss of generality, here SAD is used as an example {01911 F°r every frame, a SAD ratio is calculated as below:
where SADp and SADx are the sum of absolute differences of the forward and the backward difference metric, respectively. It should be noted that the denominator contains a small positive number ε to prevent Ui e "divide-by-zero" error. The nominator also contains an ε to balance the effect of the unity in the denominator. For example, if the previous frame, the current frame, and the next frame are identical, motion search should yield SADp ~ SA D\ ~ 0 in this case, the above calculation generators γ -•-•-■ S instead of 0 or infinity
|β!92J A luminance histogram can be calculated for ever}' frame. Typically the multimedia images have a luminance depth (e g., number of "bins") of eight bits The luminance depth used for calculating the luminance histogram according tυ some aspects can be set to 16 to obtain the histogram In other aspects, the luminance depth can be set to an appropriate number which may depend upon the type of data being processed, the computational power available, or other predetermined criteria. In some aspects, the luminance depth can be set dynamically based on a calculated or received metric, such as the content of the data
|0193j Equation 49 illustrates one example of calculating a luminance histogram difference (lambda):
where Np1 is the number of blocks in the • ith 1 bin for the previous frame, and Nc; is the number of blocks in the f' bin for the current frame, and N is the total number of blocks
in a frame If the luminance histogram difference of the pre\ious and the current frame aic completely dissimilar (or disjoint), then λ - 2 føl94j Λ frame difference metiic D. discussed in reference to Mock 56 of Figure 5, can be calculated . as shown in Equation 50
,■ ■ , where Λ is a constant chosen bv application, and
|0195j figure 32 iHustrates an example of a process B which determines three categories of shot (or scene) changes using metrics obtained or determined for the \ idco Figure 32 illustrates certain steps occurring in one aspect of block 3044 of Figure 30 Again referring to Figure 32, in block 3262, process B first determines if the frame meets criteria to be designated an abrupt scene change Process D in Figure 34 illustrates an example of this determination Process B then proceeds to block 3264 where it determines of the frame is part of a slowly changing scene Process C m Figure 35 illmtiates an example of determining a slow changing scene Finally, at block 3366 process B determines if the frame contains camera flashes, in other words, large luminance \alues differing from the previous fiarne Process F in Figure 36 illustrates an example of determining a frame containing camera flashes An illustrative example of these processes are described below
Abrupt Scene Change
|0I96| Figme 34 is a flow diagram illustrating a process of determining abrupt scene changes Hgure 34 further elaborates certain steps that can occur in some aspects of block 3262 of Figure 32, At block 3482 checks if the frame difference metric D meets the criterion shown in Equation 51
33
where A is a constant chosen by application, and 1) is a threshold. If the criteria is met, at block 34S4 process D designates the frame as an abrupt scene change and, in this example, no further shot classification is necessary.
[0197} In one example simulation shows, setting A :: 1 , and T] ::: 5 achieve good detection performance. If the current frame is an abrupt scene change frame, then y,-
should be large and v, should be small. The rati can be used instead of ;/,, atone
so that the metric is normalized to the activity level of the context. fOJ98J It should be noted that the above criterion uses the luminance histogram difference lambda (λ) in a non-linear way Figure 39 illustrates λ * (2 λ -M) is a convex function When λ is small (e.g , close to zero), it is barely preemphasis. The larger λ becomes, the more emphasis is conducted by the function. With this pre-emphasis, for any λ larger than 1.4, an abrupt scene change is detected if the threshold T
1 is set to be
Cross-Fading and Slow Scene Changes
[0199} Figure 35 further illustrates further details of some aspects that can occur in block 3264 of Figure 32. Referring to Figure 35. at block 3502 process E determines if the frame is part of a series of frames depicting a slow scene change. Process E determines that the current frame is a cross-fading or other slow scene change if the frame difference metric D is less than the first threshold value T> and greater or equal to a second threshold value /> as illustrated in Equation 52:
for a certain number of continuous frames, where 7'- is the same threshold used above and 1 ':: is another threshold value. Typically, the exact value of 7; and 7> are determined by normal experimentation because of the difference in implementations that are possible. If the criteria is met, at block 94 process E classifies the frame as part of a slow changing scene shot classification for the selected frame ends.
Camera Flashlight Events
|0200| Process F shown in Figure 36 is an example of a process that can determine if the current frame comprises camera flashlights. In this illustrative aspect camera, the luminance histogram statistics are used to detenu ine if the current frame comprises camera flashlights. Process F determines camera flash events are in the selected frame by first determining if the luminance of a current frame is greater than the luminance of the previous frame and the luminance of the next frame, shown at block 3602. if not, the frame is not a camera flash event; but if so it may be. At block 3604, Process F determines whether the backwards difference metric is greater than a threshold Ti, and if the forwards difference metric is greater than a threshold T.i; if both these conditions are satisfied, at block 3606 process F classifies the current frame as having camera flashlights, ϊn one example, at block 3602, process F determines if the average luminance of the current frame minus the average luminance of the previous frame is equal of exceeds a threshold 7), and process F determines if the average luminance of minus the average luminance of the next frame is greater than or equal to the threshold T;,, as shown in Equations 53 and 54:
[020 I i if the criterion is not met. the current frame is not classified as comprising camera flashlights and process F returns If the criterion is met, process F proceeds to block 3604 where it determines if a backwards difference metric SADt- and a forward difference metric SADs are greater than a certain threshold T4, as illustrated in liquations 55 and 56 below:
where Yv is the average luminance of the current frame, ϊp is the average luminance of the previous frame, Ys is the average luminance of the next frame, and SADp and SA D \ are the forward and backward difference metrics associated with the current frame. If the criterion is not met, process F returns
|0202| Values of T* are typically determined by normal experimentation as the implementation of the described processes can result in differences in operating parameters including threshold values. SAD values are included in the determination because camera flashes typically take only one frame, and due to the luminance difference, this frame cannot he predicted well using motion compensation from both the forward and the backward direction.
(0203] ϊn some aspects, one or more of the threshold values Tj, T2, T3, and T4 are predetermined and such values are incorporated into the shot classifier in the encoding device Typically, these threshold values are selected through testing of a particular implementation of shot detection, In some aspects, one or more of the threshold values Tj, T3, T*, and T4 can be set during processing (e.g , dynamically) based on using information (e.g , metadata) supplied to the shot classifier or based on information calculated by the shot classifier itself
|0204| Referring now to Figure 33 which shows a process C for determining encoding parameters for the video, or for encoding the video, based on the shot classification of the selected frame. At block 3370 process C determines if the selected frame was classified as an abrupt scene change If so, at block 3371 the current frame is classified as an abrupt scene change, and the frame can be encoded as an I -frame and a GOP boundary can be determined. If not process C" proceeds to block 3372; if the current frame is classified as a portion of a slowly changing scene, at block 3373 the current frame, and other frames in the slow changing scene can be encoded as a predictive frame (e.g., P-frame or B-frame). Process C then proceeds to block 3374 where it checks if the current frame was classified as a flashlight scene comprising camera flashes. If so, at block 3375 the frame can be identified for special processing, for example, removal, replication of a previous frame or encoding a particular coefficient for the frame If not, no classification of the current frame was made and the selected frame can be encoded in accordance with other criteria, encoded as an 1-franie, or dropped Process C can be implemented in an encoder.
|0205| In the above-described aspect, the amount of difference between the frame to be compressed and its adjacent two frames is indicated by a frame difference metric D. If a significant amount of a one-way luminance change is detected, it signifies a cross- fade effect in the frame The more prominent the cross-fade is, the more gain may be
achieved by using B frames. In some aspects, a modified frame difference metric is used as shown in Equation 57 below.
where t/p ~ jϊV - IVj and Js ~ |^c ~ ϊ \j are the Uinia difference between the current frame and the previous frame, and the luma difference between the current frame and the next frame, respectively, Δ represents a constant that can be determined in normal experimentation as it can depend on the implementation, and a is a weighting variable having a value between 0 and 1.
|0206| The modified frame difference metric D< is only different from the original frame difference metric D if a consistent trend of luma shift is observed and the shift strength is large enough /); is equal to or less than /), If the change of luma is steady (c/p = c/\), the modified frame difference metric !)• is lower than the original frame difference metric D with the lowest ratio of (1 ~ct)
[0207 j Table 1 below shows performance improvement by adding abaipt scene change detection. The total number of ϊ-fraπies in both the non-scene-change (NSC) and the scene-change ( SC) cases are approximately the same. In the NSC case. 1 -frames are distributed uniformly among the whole sequence, while in the SC case, I-frames are only assigned to abrupt scene change frames.
[0208 j It can be seen that typically 0 2 •- 0.3 dB improvement can be achieve PSNR- vvise. Simulation results show that the shot detector is very accurate in determining the shot events above-mentioned Simulation of five clips w-ith normal cross-fade effect shows that at Λ ~ S.5 andar - 0.4, a PSNR gain of 0.22603 1 dB is achieved at the same bitrate.
Table !: Simulation Results Of Abrupt Scene Change Detection
Adaptive GOP Structure
{0209] An illustrative example of adaptive GOP structure operations are described below. Such operations can be included in the GOP parti tioner 412 of Figure 412. MPEG2, an older video compression standard, does not require that the GOP have a regular structure, though one can be imposed. The MPEG2 sequence always begins with an 1 frame, i.e., one which has been encoded without reference to previous pictures. The MPEG2 GOP format is usually prearranged at the encoder by fixing the spacing in the GOP of the P or predictive pictures that follow the I frame. P frames are pictures that have been in part predicted from previous 1 or P pictures. The frames between the starting I frame, and the succeeding P frames are encoded as B frames. A "B" frame (B stands for bi-directional) can use the previous and next ϊ or P pictures either individually or simultaneously as reference. The number of bits used to encode an 1-fraine on the average exceeds the number of bits used to encode a P-frame; likewise the number of bits used to encode a P-frame on the average exceeds that of a B-fraroe A skipped frame, if it is used, may use no bits for its representation.
|0210| One benefit of using P-frames and B-frames, and in more recent compression algorithms, the skipping of frames is that it is possible to reduce video transmission sizes. When temporal redundancy is high - e.g., when there is little change from picture to picture - use of P5 B, or skipped pictures efficiently represents the video stream, because 1 or P pictures decoded earlier are used later as references to decode other P or B pictures.
[021 I f A group of pictures parti tioner adapt! vely encodes frames to minimize temporal redundancy. Differences between frames are quantified and a decision to
represent the picture by a L P. B, or skipped frame is automatically made after suitable tests are performed on the quantified differences. The processing in a GOP parti tioner and is aided by other operations of the preprocessor 202, which provides filtering for noise removal.
|0212j Adaptive encoding process has advantages not available in a "fixed" encoding process A fixed process ignores the possibility that little change in content has taken place; however, an adaptive procedure allows far more B frames to be inserted between each 1 and P, or two P frames, thereby reducing the number of bits used to adequately represent the sequence of frames. Conversely, e g., in a fixed encoding process, when the change in \ideo content is significant, the efficiency of P frames is greatly reduced because the difference between the predicted and the reference frames is too large. Under these conditions, matching objects may fall out of the motion search regions, or the similarity between matching objects is reduced due to distortion caused by changes in camera angle. An adaptive encoding process may beneficially be used to optionally determine when P frames should be encoded. f02l3J In the system disclosed herein, the types of conditions described above are automatically sensed The adaptive encoding process described herein is flexible and is made to adapt to these changes in content The adaptive encoding process evaluates a frame difference metric, which can be thought of as measure of distance between frames, with the same additive properties of distance In concept, given frames FY F'2, and F? having the inter-frame distances ά\z and da;?, the distance between Ff and F.Ϊ is taken as being at least dι; ÷ d;.v Frame assignments are made on the basis of this distance-ϋke metric and other measures
|0214j The GOP partitioner 412 operates by assigning picture types to frames as they are received. The picture type indicates the method of prediction that may be used to code each block
|0215J 1-pictυres are coded without reference to other pictures. Since they stand alone they provide access points in the data stream where decoding can begin An 1 encoding type is assigned to a frame if the "distance" to its predecessor frame exceeds a scene change threshold. fO216| P-pictυres can use the previous I or P pictures for motion compensated prediction They use blocks in the previous fields or frames that may be displaced from
the block being predicted as a basis for encoding. After the reference block is subtracted from the block being considered, the residual block is encoded, typically using the discrete cosine transform for the elimination of spatial redundancy, A P encoding types is assigned to a frame if the "distance" between it and the last frame assigned to be a F frame exceeds a second threshold, which is typically less than the first.
|O2I7| B-frame pictures can use the previous and next P- or l-pictuies for motion compensation as described above. A block in a B picture can be forward, backward or bi-directionally predicted, or it could be intra-coded without reference to other frames. In H 264 a reference block can be a linear combination of as many as 32 blocks from as many frames. If the frame cannot be assigned to be an 1 or P type, it is assigned to be a B type, if the "distance" from it to its immediate predecessor is greater than a third threshold, which typically is less than the second threshold. If the frame cannot be assigned to become a B-frame encoded, it is assigned to "skip frame'" status. This frame can be skipped because it is virtually a copy of a previous frame
|02IS| Evaluating a metric that quantifies the difference between adjacent frames in the display order is the first part of this processing that takes place in CJOP parti tioner 412 This metric is the distance referred to above, with it, even1 frame is evaluated for its proper type. Thus, the spacing between the i and adjacent P, or two successive P frames, can be variable. Computing the metric begins by processing the video frames with a block-based motion compensator, a block being the basic unit of video compression, composed usually of 16x16 pixels, though other block sizes such as 8x8, 4x4 and 8x16 are possible. For frames consisting of two deniteilaced fields that are present at the output, the motion compensation is done on a field basis, the search for the reference blocks taking place in fields rather than frames For a block in the first field of the current frame a forward reference block is found in fields of the frame that follows it; likewise a backward reference block found in fields of the frame that immediately precedes the current field The current blocks are assembled into a compensated field. The process continues with the second field of the frame. The two compensated fields are combined to form a forward and a backward compensated frame.
[0219J For frames created in the inverse telecine 406, the search for reference blocks may be on a frame basis only, since only reconstructed film frames are generated. Two reference blocks and two differences, forward and backward, are found, leading also to a forward and backward compensated frame. In summary, the motion compensator produces motion vectors and difference metrics for every block Note that the differences in the metric are evaluated between a block in the field or frame being considered and a block that best matches it, either in a preceding field or frame or a field or frame that immediately follows it, depending on whether a forward or backward difference is being evaluated. Only luminance values enter into this calculation [022Of The motion compensation step thus generates two sets of differences. These are between blocks of current values of luminance and the luminance values in reference blocks taken from frames that are immediately ahead and immediately behind the current one in time. The absolute value of each forward and each backward difference is determined for each pixel in a block and each is separately summed over the entire frame. Both fields are included in the two summations when the deinterlaced NTSC fields that comprise a frame are processed. In this way, SADp, and SADs, the summed absolute values of the forward and backward differences are found f 022 i I For every frame a SAD ratio is calculated using the relationship,
where SATJp and SAD\ are the summed absolute values of the forward and backward differences respectively A small positive number is added to the numerator *
■; to prevent the "divide-by-zero" error A similar ε term is added to the denominator, further reducing the sensitivity of γ when either SADp or SAD^ is close to zero. [02221 In an alternate aspect, the difference can be the SSD, the sum of squared differences, and SAD, the sum of absolute differences, or the SATD, in which the blocks of pixel values are transformed by applying the two dimensional Discrete Cosine Transform to them before differences in block elements are taken The sums are evaluated over the area of active video, though a smaller area may be used in other aspects
|0223| The luminance histogram of every frame as received {non-motion compensated) is also computed. The histogram operates on the DC coefficient, i.e., the
(0.0) coefficient, in the 16x 16 array of coefficients that is the result of applying the two dimensional Discrete Cosine Transform to the block of luminance values if it were available, Equivalentiy the average value of the 256 values of luminance in the 16x16 block may be used in the histogram. For images whose luminance depth is eight bits, the number of bins is set at 16, The next metric evaluates the histogram difference
|0224j In the above, Np, is the number of blocks from the previous frame in the i'h bin, and NC! is the number of blocks from the current frame that belong in the /'* bin, N is the total number of blocks in a frame. jβ225| These intermediate results are assembled to form the current frame difference metric as
where y
c is the SAD ratio based on the current frame and γp is the SAD ratio based on the previous frame If a scene has smooth motion and its luma histogram barely change, then A/ - 1. If the current frame displays an abrupt scene change, then γ
c will be large v , and yi> should be small The ratio ^- instead of y
t. alone is used so that the metric is
normalized to the activity level of the context, f0226| Dataflow 4100 in Figure 40 illustrates certain components that may be used to compute the frame difference metric. Preprocessor 4125 delivers interlaced fields in the case of video having a NTSC source, and frames of film images when the source of the video is the result of inverse telecine to the bi-directional motion compensator 4133. The bi-directional motion compensator 4133 operates on a field (or frame in the case of a cinematic source of video) by breaking it into blocks of 16x 16 pixels and comparing each block to all 16x16 blocks in a defined area of a field of the previous frame. The block which provides the best match is selected and subtracted from the current block. The absolute values of the differences is taken and the result summed over the 256 pixels that comprise the current block. When this is done for all current blocks of the field, and then for both fields the quantity SAD\, the backward difference metric has been computed by a backward difference module 4137. A similar procedure may be
performed by a forward difference module 4136. The forward difference module 4136 uses the frame which is immediately ahead of the current one in time as a source of reference blocks to develop the SADp5 the forward difference metric The same estimation process, albeit done using the recovered film frames, takes place when the inpυt frames are formed in the inverse teleάne The histograms that can be used to complete the computation of the frame difference metric may be formed in histogram difference module 4141 Each 16x 16 block is assigned to a bin based on the average value of its luminance. This information is formed by adding all 256 pixel luminance values in a block together, normalizing it by 256 if desired, and incrementing the count of the bin into which the average value would have been placed The calculation is done once for each pre-motion compensated frame, the histogram for the current frame becoming the histogram for the previous frame when a new current frame arrives The two histograms are differenced and normalized by the number of blocks in histogram di tTerence module 4141 Io form λ, defined by Equation 59 These results are combined in frame difference combiner 4143, which uses the inteπnediale results found in histogram difference module 413Q, forward and backward difference modules 4136 and 4136 to evaluate the current frame difference defined in Fxjυation 60 [0227 f The system of flowchart 4100 and components or steps thereof, can be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. Each functional component of flowchart 4100, including the preprocessor 4135. the bidirectional motion compensator 4133, the toward and backward difference metric modules 4136 and 4137. the histogram difference module 4141, and the frame difference metric combiner 4143, may be realized as a standalone component, incorporated as hardware, firmware, middleware in a component of another device, or be implemented in microcode or software that is executed on the processor, or a combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments that perform the desired tasks may be stored in a machine readable medium such as a storage medium. A code segment may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a
hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
|0228| The received and processed data can be stored in a storage medium which can include, for example, a chip configured storage medium (e.g , ROM, RAM) or a disc-type storage medium (e.g , magnetic or optical) connected to a processor. In some aspects, the combiner 4143 can contain part or all of the storage medium. Flowchart 4200 in Figure 4 ! illustrates a process of assigning compression types to frames In one aspect M, the current frame difference defined in Equation 3, is the basis for a! I decisions made with respect to frame assignments. As decision block 4253 indicates, if a frame under consideration is the first in a sequence, the decision path marked YES is followed to block 4255, thereby declaring the frame to be an 1 frame. The accumulated frame differences is set to zero in block 4257, and the process returns (in block 4258) to the start block 4253 If the frame being considered is not the first frame in a sequence, the path marked NO is followed from block 4253 where the decision was made, and in test block 425*5 the current frame difference is tested against the scene change threshold If the current frame difference is larger than that threshold, the decision path marked YKS is followed to block 4255, again leading to the assignment of an I -frame. If the current frame difference is less than the scene change threshold, the NO path is followed to block 426! where the current frame difference is added the accumulated frame difference
|0229| Continuing through the flowchart at decision block 4263, the accumulated frame difference is compared with threshold t, which is in general less than the scene change threshold. If the accumulated frame difference is larger than t, control transfers to block 4265. and the frame is assigned to be a P frame; the accumulated frame difference is then reset to zero in step 4267. if the accumulated frame difference is less than t, control transfers from block 4263 to block 4269 There the current frame difference is compared with T, which is less than t If the current frame difference is smaller than τ, the frame is assigned to be skipped in block 4273, if the current frame difference is larger than τ, the frame is assigned to be a β frame
|0230J In an alternate aspect another frame encoding complexity indicator M* is defined as
where α is a scaler, SAl )p is the SAP with forward motion compensation, Afl ~
P is the sum of lengths measured in pixels of the motion \ ectojs from the forward motion compensation, and Λ and m are two threshold numbers that render the frame encoding complexity indicator to zero if SADp is lower than .v or Λ/ϊ> is lower than m. M* would he used in place of the cuπent frame difference in flowchart 4200 of Figure 41 Λs can be seen, M* is different from M onl\ if the forward motion compensation shows a low le\ el of rno\ ement In this case, Λ/;s smaller than Λ/
|02311 It is noted that the shot detection and encoding aspects described herein may be described as a process which is depicted as a flowchart, a flow diagiam, a strυciuie diagram, or a block diagram Although the flowcharts shown in the figures rna\ describe operations as a sequential process, many operations can be performed in parallel or concurrently In addition, the order of operations may be re-arranged. Λ process is typically terminated when its operations are completed A process may correspond to a method, a function, a procedure, a subroutine, a subprogiam, etc When a process corresponds to a function, its termination corresponds to a return of the function to the calling function or the main function
[0232J It should also be apparent to those skilled in the art that one or more elements of a device disclosed herein may be rearranged without affecting the opeiation of the device Similarly, one or more elements of a de\ice disclosed herein may be combined without affecting the operation of the device. Those of ordinary skill in the art would understand that infojmation and multimedia data may be represented using am of a v ariety of different technologies and techniques Those of ordinary skill would further appreciate that the \ arious illustraine logical blocks, modules, and algorithm steps described in connection with the examples disclosed herein may be implemented as electronic hardware, firmware, computei softwate, middleware, microcode, or combinations thereof Fo clearly illustrate this interchangeabil'ity of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described abo\e generalh in terms of their functionality . Whether such functionalit\ is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system Skilled artisans may implement the described functionality m \arying vva> s for each particular application, but such
implementation decisions should not he interpreted as causing a departure from the scope of the disclosed methods
|0233| For example, the steps of a method or algorithm described in connection with the shot detection and encoding examples and Figures disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The methods and algorithms are particularly applicable to communication technology including wireless transmissions of video to cell phones, computers, laptop computers, PDA's and all tvφes of personal and business communication devices, software module may reside in RAM memory, flash memory, ROM memory, EPROM memory. EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application Specific. Integrated Circuit (ASIC). The ASIC may reside in a wireless modem In the alternative, the processor and the storage medium may reside as discrete components in the wireless modem.
|0234| In addition, the various illustrative logical blocks, components, modules, and circuits described in connection with the examples disclosed herein may he implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASK"), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g , a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
|0235J The previous description of the disclosed examples is provided to enable any person of ordinary skill in the art to make or use the disclosed methods and apparatus. Various modifications to these examples will be readily apparent to those skilled in the
art, and the principles defined herein may be applied to other examples and additional elements may be added without departing from the spirit or scope of the disclosed method and apparatus The description of the aspects is intended to be illustrative, and not to limit the scope of the claims