JP2005102192A

JP2005102192A - Content receiving apparatus, video/audio output timing control method, and content providing system

Info

Publication number: JP2005102192A
Application number: JP2004256203A
Authority: JP
Inventors: Ikuo Tsukagoshi; 郁夫塚越; Shinji Takada; 信司高田; Koichi Goto; 晃一後藤
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-09-02
Filing date: 2004-09-02
Publication date: 2005-04-14
Anticipated expiration: 2024-09-02
Also published as: JP4882213B2

Abstract

<P>PROBLEM TO BE SOLVED: To enable surely adjusting the lip sync between video and audio at a decoder side without making a viewer have a sense of incongruity. <P>SOLUTION: A plurality of video frames VF1 and a plurality of audio frames AF1 are acquired and accumulated as a result of receiving and decoding a plurality of encoded video frames to which a video time stamp VTS is attached, and a plurality of encoded audio frames to which an audio time stamp ATS is attached. A time difference generated by deviation between the reference clock of the encoder side and the system time clock stc of the decoder side is calculated by renderers 37, 67. According to the time difference and using to the audio frame output timing when successively outputting a plurality of audio frames AF1 in frame unit as a reference, the video frame output timing of the plurality of video frames VF1 is adjusted per frame. Thus, lip sync can be performed while maintaining the audio continuity. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、コンテンツ受信装置、ビデオオーディオ出力タイミング制御方法及びコンテンツ提供システムに関し、例えばコンテンツを受信するデコーダ側で映像と音声のリップシンクがずれることを解消する場合に適用して好適なものである。 The present invention relates to a content receiving apparatus, a video / audio output timing control method, and a content providing system, and is suitable for application to, for example, eliminating the lip sync between video and audio on the decoder side receiving content. .

従来、コンテンツ受信装置においては、エンコーダ側のサーバからコンテンツを受信してデコードする場合、当該コンテンツを構成する映像パケット及び音声パケットに分離し、それぞれデコードした後に映像パケットに付されたビデオタイムスタンプと音声パケットに付されたオーディオタイムスタンプを基にビデオフレームとオーディオフレームを出力することにより、映像と音声との出力タイミングを一致させる（すなわちリップシンクさせる）ようになされている（例えば、特許文献１、特許文献２参照）。
特開平8-280008号公報特開2004-15553公報 2. Description of the Related Art Conventionally, in a content receiving apparatus, when content is received from an encoder server and decoded, it is separated into a video packet and an audio packet constituting the content, and a video time stamp attached to the video packet after decoding, respectively, By outputting a video frame and an audio frame based on the audio time stamp attached to the audio packet, the output timing of the video and the audio is matched (that is, lip synced) (for example, Patent Document 1). , See Patent Document 2).
Japanese Patent Laid-Open No. 8-280008 Japanese Patent Laid-Open No. 2004-15553

ところでかかる構成のコンテンツ受信装置においては、当該デコーダ側のシステムタイムクロックと、エンコーダ側の基準クロックとが互いに同期しているとは限らず、また当該デコーダ側のシステムタイムクロックにおけるクロックジッタ等によってエンコーダ側の基準クロックとの間でクロック周波数の微妙なずれが生じていることもある。 By the way, in the content receiving apparatus having such a configuration, the system time clock on the decoder side and the reference clock on the encoder side are not always synchronized with each other, and the encoder is based on the clock jitter in the system time clock on the decoder side. There may be a slight shift in the clock frequency with the reference clock on the side.

またコンテンツ受信装置は、ビデオフレームとオーディオフレームとではそのデータ長が異なるため、当該デコーダ側のシステムタイムクロックとエンコーダ側の基準クロックとが完全に同期していないときには、ビデオタイムスタンプ及びオーディオタイムスタンプを基にビデオフレーム及びオーディオフレームを出力したとしても、映像と音声との出力タイミングが一致せず、リップシンクがずれてしまうという問題があった。 In addition, since the data length of the content receiving device is different between the video frame and the audio frame, when the system time clock on the decoder side and the reference clock on the encoder side are not completely synchronized, the video time stamp and the audio time stamp Even if the video frame and the audio frame are output based on the above, there is a problem that the output timing of the video and the audio does not match and the lip sync is shifted.

本発明は以上の点を考慮してなされたもので、視聴者であるユーザに違和感を感じさせることなく映像及び音声間のリップシンクを当該デコーダ側で確実に調整し得るコンテンツ受信装置、ビデオオーディオ出力タイミング制御方法及びコンテンツ提供システムを提案しようとするものである。 The present invention has been made in consideration of the above points, and a content receiving apparatus, video audio and the like that can reliably adjust the lip sync between video and audio on the decoder side without causing the viewer user to feel uncomfortable. An output timing control method and a content providing system are proposed.

かかる課題を解決するため本発明においては、エンコーダ側の基準クロックに基づくビデオタイムスタンプが順次付された複数の符号化ビデオフレームと、基準クロックに基づくオーディオタイムスタンプが順次付された複数の符号化オーディオフレームとをエンコーダ側のコンテンツ提供装置から受信して復号する復号手段と、復号手段によって符号化ビデオフレーム及び符号化オーディオフレームを復号した結果得られる複数のビデオフレーム及び複数のオーディオフレームを蓄積する記憶手段と、エンコーダ側の基準クロックのクロック周波数とデコーダ側のシステムタイムクロックのクロック周波数とのずれによって生じる時間差を算出する算出手段と、時間差に応じ、複数のオーディオフレームをフレーム単位で順次出力するときのオーディオフレーム出力タイミングを基準として複数のビデオフレームをフレーム単位で順次出力するときのビデオフレーム出力タイミングを調整するタイミング調整手段とを設けるようにする。 In order to solve such a problem, in the present invention, a plurality of encoded video frames sequentially attached with video time stamps based on a reference clock on the encoder side, and a plurality of encodings sequentially attached with audio time stamps based on the reference clock. Decoding means for receiving and decoding an audio frame from a content providing apparatus on the encoder side, and storing a plurality of video frames and a plurality of audio frames obtained as a result of decoding the encoded video frame and the encoded audio frame by the decoding means Storage means, calculation means for calculating a time difference caused by a difference between the clock frequency of the reference clock on the encoder side and the clock frequency of the system time clock on the decoder side, and sequentially outputting a plurality of audio frames in units of frames according to the time difference When To be provided with a timing adjusting means for adjusting a video frame output timing when sequentially outputting a plurality of video frames in a frame unit based on the audio frame output timing.

エンコーダ側の基準クロックとデコーダ側のシステムタイムクロックとの間におけるクロック周波数のずれによって生じる時間差に応じ、複数のオーディオフレームをフレーム単位で順次出力するときのオーディオフレーム出力タイミングを基準として複数のビデオフレームをフレーム単位で順次出力するときのビデオフレーム出力タイミングを調整することにより、エンコーダ側とデコーダ側のクロック周波数の差を吸収し、オーディオフレーム出力タイミングにビデオフレーム出力タイミングを合わせてリップシンクさせることができる。 Multiple video frames based on the audio frame output timing when multiple audio frames are output sequentially in units of frames according to the time difference caused by the clock frequency difference between the reference clock on the encoder side and the system time clock on the decoder side By adjusting the video frame output timing when sequentially outputting video in frame units, the difference between the clock frequency on the encoder side and the decoder side can be absorbed, and the audio frame output timing can be synchronized with the video frame output timing to lip sync it can.

また本発明においては、復号手段に対して、エンコーダ側の基準クロックに基づくビデオタイムスタンプが順次付された複数の符号化ビデオフレームと、基準クロックに基づくオーディオタイムスタンプが順次付された複数の符号化オーディオフレームとをエンコーダ側のコンテンツ提供装置から受信して復号させる復号ステップと、記憶手段に対して、復号ステップで符号化ビデオフレーム及び符号化オーディオフレームを復号した結果得られる複数のビデオフレーム及び複数のオーディオフレームを蓄積させる記憶ステップと、算出手段に対して、エンコーダ側の基準クロックのクロック周波数とデコーダ側のシステムタイムクロックのクロック周波数とのずれによって生じる時間差を算出させる差分算出ステップと、タイミング調整手段に対して、時間差に応じ、複数のオーディオフレームをフレーム単位で順次出力するときのオーディオフレーム出力タイミングを基準として複数のビデオフレームをフレーム単位で順次出力するときのビデオフレーム出力タイミングを調整させるタイミング調整ステップとを設けるようにする。 In the present invention, a plurality of encoded video frames sequentially attached with video time stamps based on the reference clock on the encoder side and a plurality of codes sequentially attached with audio time stamps based on the reference clock are provided to the decoding means. A decoding step for receiving and decoding the encoded audio frame from the content providing apparatus on the encoder side, and a plurality of video frames obtained as a result of decoding the encoded video frame and the encoded audio frame in the decoding step with respect to the storage unit; A storage step for accumulating a plurality of audio frames, a difference calculating step for causing the calculation means to calculate a time difference caused by a difference between a clock frequency of a reference clock on the encoder side and a clock frequency of a system time clock on the decoder side, and timing Adjusting hand On the other hand, the timing adjustment that adjusts the video frame output timing when sequentially outputting multiple video frames in units of frames based on the audio frame output timing when sequentially outputting multiple audio frames in units of frames according to the time difference Steps.

さらに本発明においては、コンテンツ提供装置とコンテンツ受信装置を有するコンテンツ提供システムであって、コンテンツ提供装置は、エンコーダ側の基準クロックに基づくビデオタイムスタンプを付した複数の符号化ビデオフレームと、基準クロックに基づくオーディオタイムスタンプを付した複数の符号化オーディオフレームとを生成する符号化手段と、複数の符号化ビデオフレーム及び複数の符号化オーディオフレームをコンテンツ受信装置へ順次送信する送信手段とを具え、コンテンツ受信装置は、ビデオタイムスタンプが順次付された複数の符号化ビデオフレームと、オーディオタイムスタンプが順次付された複数の符号化オーディオフレームとをエンコーダ側のコンテンツ提供装置から受信して復号する復号手段と、復号手段によって符号化ビデオフレーム及び符号化オーディオフレームを復号した結果得られる複数のビデオフレーム及び複数のオーディオフレームを蓄積する記憶手段と、エンコーダ側の基準クロックのクロック周波数とデコーダ側のシステムタイムクロックのクロック周波数とのずれによって生じる時間差を算出する算出手段と、時間差に応じ、複数のオーディオフレームをフレーム単位で順次出力するときのオーディオフレーム出力タイミングを基準として複数のビデオフレームをフレーム単位で順次出力するときのビデオフレーム出力タイミングを調整するタイミング調整手段とを設けるようにする。 The present invention further provides a content providing system having a content providing apparatus and a content receiving apparatus, wherein the content providing apparatus includes a plurality of encoded video frames with video time stamps based on a reference clock on the encoder side, and a reference clock. Encoding means for generating a plurality of encoded audio frames with audio time stamps based on, and a transmission means for sequentially transmitting a plurality of encoded video frames and a plurality of encoded audio frames to a content receiving device, The content receiving device receives a plurality of encoded video frames sequentially attached with video time stamps and a plurality of encoded audio frames sequentially attached with audio time stamps from a content providing device on the encoder side and decodes them Means and decoding means Accordingly, storage means for storing a plurality of video frames and a plurality of audio frames obtained as a result of decoding the encoded video frame and the encoded audio frame, the clock frequency of the reference clock on the encoder side, and the clock frequency of the system time clock on the decoder side Calculating means for calculating a time difference caused by the difference between the two and a plurality of video frames sequentially output in units of frames with reference to an audio frame output timing when a plurality of audio frames are sequentially output in units of frames according to the time difference. Timing adjusting means for adjusting the video frame output timing is provided.

上述のように本発明によれば、エンコーダ側の基準クロックとデコーダ側のシステムタイムクロックとの間におけるクロック周波数のずれによって生じる時間差に応じ、複数のオーディオフレームをフレーム単位で順次出力するときのオーディオフレーム出力タイミングを基準として複数のビデオフレームをフレーム単位で順次出力するときのビデオフレーム出力タイミングを調整することにより、エンコーダ側とデコーダ側のクロック周波数の差を吸収し、オーディオフレーム出力タイミングにビデオフレーム出力タイミングを合わせてリップシンクさせることができ、かくして視聴者であるユーザに違和感を感じさせることなく映像及び音声間のリップシンクを当該デコーダ側で確実に調整し得るコンテンツ受信装置、ビデオオーディオ出力タイミング制御方法及びコンテンツ提供システムを実現することができる。 As described above, according to the present invention, audio is output when a plurality of audio frames are sequentially output in units of frames in accordance with a time difference caused by a difference in clock frequency between a reference clock on the encoder side and a system time clock on the decoder side. By adjusting the video frame output timing when sequentially outputting multiple video frames in units of frames based on the frame output timing, the difference in clock frequency between the encoder side and the decoder side is absorbed, and the video frame is used as the audio frame output timing. A content receiving apparatus and video audio which can lip-sync in accordance with the output timing, and thus can reliably adjust the lip-sync between video and audio on the decoder side without making the viewer user feel uncomfortable It is possible to realize a force timing control method and the content providing system.

以下、図面について、本発明の一実施の形態を詳述する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

（１）コンテンツ提供システムの全体構成
図１において、１は全体として本発明のコンテンツ提供システムを示し、大きく分けてコンテンツ配信側となるコンテンツ提供装置２と、コンテンツ受信側となる第１のコンテンツ受信装置３及び第２のコンテンツ受信装置４とによって構成されている。 (1) Overall Configuration of Content Providing System In FIG. 1, reference numeral 1 denotes the content providing system of the present invention as a whole, which is roughly divided into a content providing apparatus 2 on the content distribution side and a first content reception on the content receiving side The apparatus 3 and the second content receiving apparatus 4 are configured.

コンテンツ提供システム１では、コンテンツ提供装置２及びＷｅｂサーバ１４と第１のコンテンツ受信装置３とがインターネット５を介して相互に接続されており、当該Ｗｅｂサーバ１４からインターネット５経由で取得したコンテンツ配信先であるＵＲＬ(Uniform Resource Locator)や当該コンテンツに関するメタデータを第１のコンテンツ受信装置３におけるＷｅｂブラウザ１５で解析し、そのメタデータやＵＲＬをストリーミングデコーダ９へ供給する。 In the content providing system 1, the content providing device 2, the Web server 14, and the first content receiving device 3 are connected to each other via the Internet 5, and a content distribution destination acquired from the Web server 14 via the Internet 5. The URL (Uniform Resource Locator) and metadata related to the content are analyzed by the Web browser 15 in the first content receiving device 3, and the metadata and URL are supplied to the streaming decoder 9.

ストリーミングデコーダ９では、Ｗｅｂブラウザ１５により解析したＵＲＬに基づいて当該コンテンツ提供装置２のストリーミングサーバ８へアクセスし、ユーザ所望のコンテンツに対する配信要求を行う。 The streaming decoder 9 accesses the streaming server 8 of the content providing apparatus 2 based on the URL analyzed by the Web browser 15 and makes a distribution request for the content desired by the user.

コンテンツ提供装置２は、エンコーダ７でユーザ所望のコンテンツに対応したコンテンツデータを予めエンコードし、その結果得られるエレメンタリストリームをストリーミングサーバ８でパケット化し、これをインターネット５を介して第１のコンテンツ受信装置３へ配信するようになされている。 In the content providing apparatus 2, content data corresponding to user-desired content is encoded in advance by the encoder 7, the resulting elementary stream is packetized by the streaming server 8, and the first content is received via the Internet 5. Distribution to the device 3 is performed.

これによりコンテンツ提供システム１では、第１のコンテンツ受信装置３からの要求に応じたユーザ所望のコンテンツをコンテンツ提供装置２から配信するビデオオンデマンド（ＶＯＤ）のようなプリエンコーデッドストリーミングを実現し得るようになされている。 As a result, the content providing system 1 can realize pre-encoded streaming such as video on demand (VOD) in which content desired by the user in response to a request from the first content receiving device 3 is distributed from the content providing device 2. It is made like that.

第１のコンテンツ受信装置３は、エレメンタリストリームをストリーミングデコーダ９でデコードすることにより元の映像及び音声を復元し、当該元の映像及び音声をモニタ１０から出力するようになされている。 The first content receiving device 3 restores the original video and audio by decoding the elementary stream by the streaming decoder 9 and outputs the original video and audio from the monitor 10.

またコンテンツ提供システム１では、第１のコンテンツ受信装置３と第２のコンテンツ受信装置４とが例えばIEEE(Institute of Electrical and Electronics Engineers)802.11a/b/g等の規格に準拠した無線ＬＡＮ６で接続されており、当該第１のコンテンツ受信装置３が外部から供給された地上波ディジタル、ＢＳ(Broadcast Satellite)／ＣＳ(Communication Satellite)ディジタル又は地上波アナログ放送等のコンテンツあるいはＤＶＤ(Digital Versatile Disc)、VideoCDほか一般的なビデオカメラからのコンテンツをリアルタイムストリーミングエンコーダ１１でリアルタイムにエンコードした後に中継する形で第２のコンテンツ受信装置４へ無線送信し得るようになされている。 In the content providing system 1, the first content receiving device 3 and the second content receiving device 4 are connected by a wireless LAN 6 compliant with a standard such as IEEE (Institute of Electrical and Electronics Engineers) 802.11a / b / g. The first content receiving device 3 is supplied from the outside, such as terrestrial digital, BS (Broadcast Satellite) / CS (Communication Satellite) digital or terrestrial analog broadcast content or DVD (Digital Versatile Disc), Content from a general video camera such as VideoCD can be wirelessly transmitted to the second content receiving device 4 in a form of being relayed after being encoded in real time by the real-time streaming encoder 11.

因みに、第１のコンテンツ受信装置３と第２のコンテンツ受信装置４とは必ずしも無線ＬＡＮ６で接続されていなければならない訳ではなく、有線ＬＡＮで接続されていても良い。 Incidentally, the 1st content receiver 3 and the 2nd content receiver 4 do not necessarily need to be connected by wireless LAN 6, and may be connected by wired LAN.

第２のコンテンツ受信装置４は、第１のコンテンツ受信装置３から受信したコンテンツをリアルタイムストリーミングデコーダ１２でデコードすることによりストーミング再生を行い、その再生結果をモニタ１３へ出力するようになされている。 The second content receiving device 4 performs storm playback by decoding the content received from the first content receiving device 3 by the real-time streaming decoder 12 and outputs the playback result to the monitor 13.

かくして第１のコンテンツ受信装置３及び第２のコンテンツ受信装置４の間では、外部から供給を受けたコンテンツを第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１でリアルタイムにエンコードして第２のコンテンツ受信装置４へ送信し、当該第２のコンテンツ受信装置４でストリーミング再生することにより、ライブストリーミングを実現し得るようになされている。 Thus, between the first content receiving device 3 and the second content receiving device 4, the content supplied from the outside is encoded in real time by the real-time streaming encoder 11 in the first content receiving device 3, and the second content. Live streaming can be realized by transmitting to the receiving device 4 and performing streaming playback on the second content receiving device 4.

（２）コンテンツ提供装置の構成
図２に示すようにコンテンツ提供装置２は、エンコーダ７及びストリーミングサーバ８によって構成されており、外部から取り込んだビデオ信号ＶＳ１をビデオ入力部２１を介してディジタル変換した後にビデオデータＶＤ１としてビデオエンコーダ２２へ送出する。 (2) Configuration of Content Providing Device As shown in FIG. 2, the content providing device 2 is composed of an encoder 7 and a streaming server 8, and digitally converts a video signal VS1 taken from outside via a video input unit 21. Later, it is sent to the video encoder 22 as video data VD1.

ビデオエンコーダ２２は、ビデオデータＶＤ１を例えばＭＰＥＧ1/2/4(Moving Picture Experts Group)の規格に準拠した所定の圧縮符号化方法あるいは種々の圧縮符号化方式で圧縮符号化し、その結果得られるビデオエレメンタリストリームＶＥＳ１をリングバッファでなるビデオＥＳ蓄積部２３へ送出する。 The video encoder 22 compresses and encodes the video data VD1 by a predetermined compression encoding method or various compression encoding methods compliant with, for example, the MPEG1 / 2/4 (Moving Picture Experts Group) standard, and the resulting video device 22 The mental stream VES1 is sent to the video ES storage unit 23 formed of a ring buffer.

ビデオＥＳ蓄積部２３は、ビデオエレメンタリストリームＶＥＳ１を一旦蓄積した後に、当該ビデオエレメンタリストリームＶＥＳ１をストリーミングサーバ８のパケット生成部２７及びビデオフレームカウンタ２８へ送出する。 The video ES accumulation unit 23 temporarily accumulates the video elementary stream VES1, and then sends the video elementary stream VES1 to the packet generation unit 27 and the video frame counter 28 of the streaming server 8.

ビデオフレームカウンタ２８では、ビデオエレメンタリストリームＶＥＳ１をフレーム周波数単位（２９．９７[Hz]あるいは３０[Hz]あるいは５９．９４[Hz]あるいは６０[Hz]）でカウントし、そのカウントアップ値を基準クロックに基づく９０[KHz]単位の値に変換し、３２ビット表現で各ビデオフレームに対するビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）としてパケット生成部２７へ送出する。 The video frame counter 28 counts the video elementary stream VES1 in frame frequency units (29.97 [Hz], 30 [Hz], 59.94 [Hz] or 60 [Hz]), and uses the counted up value as a reference. The value is converted into a value in units of 90 [KHz] based on the clock, and is sent to the packet generation unit 27 as a video time stamp VTS (VTS1, VTS2, VTS3,...) For each video frame in 32-bit representation.

また、コンテンツ提供装置２は外部から取り込んだオーディオ信号ＡＳ１をエンコーダ７のオーディオ入力部２４を介してディジタル変換した後にオーディオデータＡＤ１としてオーディオエンコーダ２５へ送出する。 Further, the content providing apparatus 2 digitally converts the audio signal AS1 captured from the outside via the audio input unit 24 of the encoder 7, and then sends the audio signal AS1 to the audio encoder 25 as audio data AD1.

オーディオエンコーダ２５は、オーディオデータＡＤ１をＭＰＥＧ1/2/4オーディオの規格に準拠した所定の圧縮符号化方法あるいは種々の圧縮符号化方式で圧縮符号化し、その結果得られるオーディオエレメンタリストリームＡＥＳ１をリングバッファでなるオーディオＥＳ蓄積部２６へ送出する。 The audio encoder 25 compresses and encodes the audio data AD1 by a predetermined compression encoding method or various compression encoding methods compliant with the MPEG1 / 2/4 audio standard, and the resulting audio elementary stream AES1 is a ring buffer. To the audio ES storage unit 26.

オーディオＥＳ蓄積部２６は、オーディオエレメンタリストリームＡＥＳ１を一旦蓄積した後に、当該オーディオエレメンタリストリームＡＥＳ１をストリーミングサーバ８のパケット生成部２７及びオーディオフレームカウンタ２９へ送出する。 The audio ES accumulation unit 26 once accumulates the audio elementary stream AES1, and then sends the audio elementary stream AES1 to the packet generation unit 27 and the audio frame counter 29 of the streaming server 8.

オーディオフレームカウンタ２９はビデオフレームカウンタ２８と同様、オーディオフレームのカウントアップ値をビデオと共通の基準クロックに基づく９０[KHz]単位の値に変換し、各オーディオフレームに対するオーディオタイムスタンプＡＴＳ（ＡＴＳ１、ＡＴＳ２、ＡＴＳ３、……）として３２ビット表現し、パケット生成部２７へ送出する。 Similarly to the video frame counter 28, the audio frame counter 29 converts the count-up value of the audio frame into a value in units of 90 [KHz] based on the reference clock common to the video, and the audio time stamp ATS (ATS1, ATS2) for each audio frame. , ATS3,...) Is expressed as 32 bits and sent to the packet generator 27.

パケット生成部２７では、ビデオエレメンタリストリームＶＥＳ１を所定データサイズのパケットに分割し、それぞれのパケットにビデオヘッダ情報を付加することによりビデオパケットを生成すると共に、オーディオエレメンタリストリームＡＥＳ１を所定データサイズのパケットに分割し、それぞれのパケットにオーディオヘッダ情報を付加することによりオーディオパケットを生成する。 The packet generation unit 27 divides the video elementary stream VES1 into packets of a predetermined data size, generates video packets by adding video header information to each packet, and also converts the audio elementary stream AES1 to a predetermined data size. An audio packet is generated by dividing the packet and adding audio header information to each packet.

ここで図３に示すようにオーディオパケット及びビデオパケットは、インターネット層におけるホスト間通信用のＩＰ(Internet Protocol)ヘッダ、トランスポート層における伝送制御用のＴＣＰ(Transmission Control Protocol)ヘッダ、リアルタイム・データ転送制御用のＲＴＰ(RealTime Transport Protocol)ヘッダ及びＲＴＰペイロードからなり、ＲＴＰヘッダ内における４バイトのタイムスタンプ領域に上述のオーディオタイムスタンプＡＴＳやビデオタイムスタンプＶＴＳが書き込まれるようになされている。 Here, as shown in FIG. 3, an audio packet and a video packet include an IP (Internet Protocol) header for communication between hosts in the Internet layer, a TCP (Transmission Control Protocol) header for transmission control in the transport layer, and real-time data transfer. The audio time stamp ATS and video time stamp VTS described above are written in a 4-byte time stamp area in the RTP header, which consists of a control RTP (Real Time Transport Protocol) header and an RTP payload.

そしてパケット生成部２７（図２）では、ビデオパケット及びビデオタイムスタンプＶＴＳを基に所定バイト数からなる映像パケットデータを生成すると共に、オーディオパケット及びオーディオタイムスタンプＡＴＳを基に所定バイト数からなる音声パケットデータを生成し、これらを多重化することにより多重化データＭＸＤ１を生成した後パケットデータ蓄積部３０へ送出する。 The packet generation unit 27 (FIG. 2) generates video packet data having a predetermined number of bytes based on the video packet and the video time stamp VTS, and audio having a predetermined number of bytes based on the audio packet and the audio time stamp ATS. Packet data is generated and multiplexed data MXD1 is generated by multiplexing the packet data, and then transmitted to the packet data storage unit 30.

パケットデータ蓄積部３０は、多重化データＭＸＤ１を所定量蓄積すると、当該多重化データＭＸＤ１をインターネット５を介してＲＴＰ／ＴＣＰ(RealTime Transport Protocol/Transmission Control Protocol)で第１のコンテンツ受信装置３へ送信するようになされている。 When the packet data storage unit 30 stores a predetermined amount of the multiplexed data MXD1, the packet data storage unit 30 transmits the multiplexed data MXD1 to the first content receiving device 3 via the Internet 5 by RTP / TCP (Real Time Transport Protocol / Transmission Control Protocol). It is made to do.

（３）第１のコンテンツ受信装置におけるストリーミングデコーダのモジュール構成
図４に示すように第１のコンテンツ受信装置３のストリーミングデコーダ９は、コンテンツ提供装置２からＲＴＰ／ＴＣＰで送信された多重化データＭＸＤ１を入力パケット蓄積部３１に一旦蓄積した後、パケット分割部３２へ送出する。 (3) Module Configuration of Streaming Decoder in First Content Receiving Device As shown in FIG. 4, the streaming decoder 9 of the first content receiving device 3 is multiplexed data MXD1 transmitted from the content providing device 2 by RTP / TCP. Is temporarily stored in the input packet storage unit 31 and then transmitted to the packet division unit 32.

ここで入力パケット蓄積部３１は、インターネット５経由で送信されてくる多重化データＭＸＤ１が所定量のパケット分蓄積された時点で当該多重化データＭＸＤ１をパケット分割部３２へ送出するようになされており、これにより後段のパケット分割部３２で多重化データＭＸＤ１の処理が途切れることなく連続的に実行し得るようになされている。 Here, the input packet storage unit 31 is configured to send the multiplexed data MXD1 to the packet dividing unit 32 when the multiplexed data MXD1 transmitted via the Internet 5 is stored for a predetermined amount of packets. Thus, the processing of the multiplexed data MXD1 can be executed continuously without interruption in the packet division unit 32 at the subsequent stage.

パケット分割部３２は、多重化データＭＸＤ１を映像パケットデータＶＰ１と音声パケットデータＡＰ１に分割し、当該音声パケットデータＡＰ１をリングバッファでなる入力オーディオバッファ３３を介してオーディオフレーム単位でオーディオデコーダ３５へ送出すると共に、映像パケットデータＶＰ１をリングバッファでなる入力ビデオバッファ３４を介してフレーム単位でビデオデコーダ３６へ送出するようになされている。 The packet division unit 32 divides the multiplexed data MXD1 into video packet data VP1 and audio packet data AP1, and sends the audio packet data AP1 to the audio decoder 35 in units of audio frames via the input audio buffer 33 that is a ring buffer. At the same time, the video packet data VP1 is sent to the video decoder 36 in units of frames via the input video buffer 34 which is a ring buffer.

ここで入力オーディオバッファ３３及び入力ビデオバッファ３４においては、後段のオーディオデコーダ３５及びビデオデコーダ３６で１オーディオフレーム分の音声パケットデータＡＰ１及び１ビデオフレーム分の映像パケットデータＶＰ１を連続してデコードできるようになるまで蓄積するようになされており、そのため、いつの時点でも少なくとも１オーディオフレーム及び1ビデオフレーム分のデータをオーディオデコーダ３５及びビデオデコーダ３６へ瞬時に供給できるための容量を有する。 Here, in the input audio buffer 33 and the input video buffer 34, the audio packet data AP1 for one audio frame and the video packet data VP1 for one video frame can be successively decoded by the audio decoder 35 and the video decoder 36 in the subsequent stage. Therefore, it has a capacity for instantaneously supplying data for at least one audio frame and one video frame to the audio decoder 35 and the video decoder 36 at any time.

なおパケット分割部３２は、映像パケットデータＶＰ１のビデオヘッダ情報及び音声パケットデータＡＰ１のオーディオヘッダ情報を解析することによりビデオタイムスタンプＶＴＳ及びオーディオタイムスタンプＡＴＳを認識し得るようになされており、当該ビデオタイムスタンプＶＴＳ及び当該オーディオタイムスタンプＡＴＳをレンダラー３７のタイミングコントロール回路３７Ａへ送出する。 The packet division unit 32 can recognize the video time stamp VTS and the audio time stamp ATS by analyzing the video header information of the video packet data VP1 and the audio header information of the audio packet data AP1. The time stamp VTS and the audio time stamp ATS are sent to the timing control circuit 37 A of the renderer 37.

オーディオデコーダ３５は、音声パケットデータＡＰ１をオーディオフレーム単位でデコードすることにより圧縮符号化前のオーディオフレームＡＦ１を復元し、順次レンダラー３７へ送出する。 The audio decoder 35 restores the audio frame AF1 before compression encoding by decoding the audio packet data AP1 in units of audio frames, and sequentially sends them to the renderer 37.

ビデオデコーダ３６は、映像パケットデータＶＰ１をビデオフレーム単位でデコードすることにより圧縮符号化前のビデオフレームＶＦ１を復元し、順次レンダラー３７へ送出する。 The video decoder 36 restores the video frame VF1 before compression encoding by decoding the video packet data VP1 in units of video frames, and sequentially sends them to the renderer 37.

ところでストリーミングデコーダ９においては、Ｗｅｂブラウザ１５からシステムコントローラ５０に対してコンテンツのメタデータＭＤが供給されており、コンテンツ判別手段としての当該システムコントローラ５０では当該メタデータＭＤに基づいて当該コンテンツの種類がオーディオ及びビデオからなるものであるか、ビデオだけからなるものであるか、或いはオーディオだけからなるものであるかを判別し、そのコンテンツ種類判別結果ＣＨをレンダラー３７へ送出する。 By the way, in the streaming decoder 9, content metadata MD is supplied from the Web browser 15 to the system controller 50, and the content type of the content is determined based on the metadata MD in the system controller 50 as the content discriminating means. It is determined whether it is composed of audio and video, only video, or only audio, and the content type determination result CH is sent to the renderer 37.

レンダラー３７は、オーディオフレームＡＦ１をリングバッファでなる出力オーディオバッファ３８へ一時的に格納し、また同様にビデオフレームＶＦ１をリングバッファでなる出力ビデオバッファ３９に一時的に格納する。 The renderer 37 temporarily stores the audio frame AF1 in the output audio buffer 38 that is a ring buffer, and similarly temporarily stores the video frame VF1 in the output video buffer 39 that is a ring buffer.

そしてレンダラー３７は、タイミングコントロール回路３７Ａによってモニタ１０へ出力すべきビデオフレームＶＦ１の映像とオーディオフレームＡＦ１の音声とをリップシンクさせるべく、システムコントローラ５０からのコンテンツ種類判別結果ＣＨと、オーディオタイムスタンプＡＴＳやビデオタイムスタンプＶＴＳとに基づいて最終的な出力タイミングを調整した後、その出力タイミングで出力ビデオバッファ３９、出力オーディオバッファ３８からビデオフレームＶＦ１、オーディオフレームＡＦ１を順次出力するようになされている。 The renderer 37 then causes the content type determination result CH from the system controller 50 and the audio time stamp ATS to lip-sync the video of the video frame VF1 to be output to the monitor 10 by the timing control circuit 37A and the audio of the audio frame AF1. The final output timing is adjusted based on the video time stamp VTS and the video frame VF1 and the audio frame AF1 are sequentially output from the output video buffer 39 and the output audio buffer 38 at the output timing.

（４）プリエンコーデッドストリーミングにおけるデコーダ側でのリップシンク調整処理
（４−１）プリエンコーデッドストリーミングにおけるビデオフレーム及びオーディオフレームの出力タイミング調整方法
図５に示すようにレンダラー３７のタイミングコントロール回路３７Ａでは、パケット分割部３２から送られたビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……、ＶＴＳｎ）及びオーディオタイムスタンプＡＴＳ（ＡＴＳ１、ＡＴＳ２、ＡＴＳ３、……、ＡＴＳｎ）をバッファ４２及び４３にそれぞれ一時的に格納した後、コンパレータ回路４６へ送出する。 (4) Lip sync adjustment processing at decoder side in pre-encoded streaming (4-1) Video frame and audio frame output timing adjustment method in pre-encoded streaming As shown in FIG. The video time stamp VTS (VTS1, VTS2, VTS3,..., VTSn) and audio time stamp ATS (ATS1, ATS2, ATS3,..., ATSn) sent from the packet division unit 32 are temporarily stored in the buffers 42 and 43, respectively. After being stored, the data is sent to the comparator circuit 46.

またタイミングコントロール回路３７Ａは、そのコンテンツにおける最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１だけをサブトラクタ回路４４及び４５にもそれぞれ送出する。 The timing control circuit 37A also sends only the first video time stamp VTS1 and audio time stamp ATS1 in the content to the subtractor circuits 44 and 45, respectively.

サブトラクタ回路４４及び４５は、当該最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１の値を所定時間分だけ引き戻し、これらをプリセット用ビデオタイムスタンプＶＴＳｐ及びプリセット用オーディオタイムスタンプＡＴＳｐとしてＳＴＣ回路４１へ送出する。 The subtractor circuits 44 and 45 pull back the values of the first video time stamp VTS1 and the audio time stamp ATS1 by a predetermined time, and send them to the STC circuit 41 as the preset video time stamp VTSp and the preset audio time stamp ATSp. To do.

ＳＴＣ回路４１では、システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従ってプリセットする、すなわち当該システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番でアジャストする（置き換える）ようになされている。 The STC circuit 41 presets the value of the system time clock stc in accordance with a preset sequence determined in the order of the preset video time stamp VTSp and the preset audio time stamp ATSp, that is, the value of the system time clock stc is set to the preset video time. Adjustment (replacement) is made in the order of the stamp VTSp and the preset audio time stamp ATSp.

ここでＳＴＣ回路４１では、当該最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１の値を所定時間分だけ引き戻したプリセット用ビデオタイムスタンプＶＴＳｐ及びプリセット用オーディオタイムスタンプＡＴＳｐを用いてシステムタイムクロックｓｔｃの値をプリセットするため、バッファ４２及び４３を介して当該最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１がコンパレータ回路４６に到達したとき、ＳＴＣ回路４１からコンパレータ回路４６へ供給されるプリセット後のシステムタイムクロックｓｔｃの値が当該ビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１よりも前の時刻を示すようになされている。 Here, the STC circuit 41 uses the preset video time stamp VTSp and the preset audio time stamp ATSp obtained by pulling back the values of the first video time stamp VTS1 and the audio time stamp ATS1 by a predetermined time, and the value of the system time clock stc. When the first video time stamp VTS1 and audio time stamp ATS1 reach the comparator circuit 46 via the buffers 42 and 43, the preset system time clock supplied from the STC circuit 41 to the comparator circuit 46 is set. The value of stc indicates a time before the video time stamp VTS1 and the audio time stamp ATS1.

これによりタイミングコントロール回路３７Ａのコンパレータ回路４６では、プリセット後のシステムタイムクロックｓｔｃの値が最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１に対して既に経過しているといったことが無くなるため、当該最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１に対応したビデオフレームＶｆ１及びオーディオフレームＡｆ１についても確実に出力し得るようになされている。 As a result, in the comparator circuit 46 of the timing control circuit 37A, the value of the system time clock stc after the preset is not already passed with respect to the first video time stamp VTS1 and the audio time stamp ATS1, so that the first time The video frame Vf1 and the audio frame Af1 corresponding to the video time stamp VTS1 and the audio time stamp ATS1 can also be reliably output.

実際上、図６（Ａ）及び（Ｂ）に示すように、コンテンツの種類がオーディオ及びビデオからなるものである場合、システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従ってプリセットすると、プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃの値をプリセットした後に必ずプリセット用オーディオタイムスタンプＡＴＳｐで先程のプリセット値が更新されることを意味する。 In practice, as shown in FIGS. 6A and 6B, when the content type is composed of audio and video, the value of the system time clock stc is set to the preset video time stamp VTSp, and the preset audio time stamp. If presetting is performed according to a preset sequence determined in the order of ATSp, it means that the preset value is always updated with the audio time stamp ATSp for presetting after the value of the system time clock stc is preset with the video time stamp VTSp for presetting. .

このときコンパレータ回路４６は、プリセット用オーディオタイムスタンプＡＴＳｐでプリセット値が更新された後のシステムタイムクロックｓｔｃを基準にしてビデオタイムスタンプＶＴＳと比較することにより、プリセット後のシステムタイムクロックｓｔｃの値とエンコーダ側のコンテンツ提供装置２で付けられたビデオタイムスタンプＶＴＳとの時間差を算出するようになされている。 At this time, the comparator circuit 46 compares the video time stamp VTS with the system time clock stc after the preset value is updated with the preset audio time stamp ATSp, thereby comparing the value of the system time clock stc after the preset. The time difference from the video time stamp VTS attached by the content providing apparatus 2 on the encoder side is calculated.

一方、コンテンツの種類がオーディオだけからなるものである場合にはプリセット用ビデオタイムスタンプＶＴＳｐがタイミングコントロール回路３７Ａに送られてくることはないので、システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従えば、当然プリセット用オーディオタイムスタンプＡＴＳｐでシステムタイムクロックｓｔｃの値がプリセットされることを意味する。 On the other hand, if the content type is composed only of audio, the preset video time stamp VTSp is not sent to the timing control circuit 37A, so the value of the system time clock stc is set as the preset video time stamp VTSp. According to the preset sequence determined in the order of the preset audio time stamp ATSp, it naturally means that the value of the system time clock stc is preset with the preset audio time stamp ATSp.

同様に、コンテンツの種類がビデオだけからなるものである場合にはプリセット用オーディオタイムスタンプＡＴＳｐがタイミングコントロール回路３７Ａに送られてくることはないので、システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従えば、当然プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃの値がプリセットされることを意味する。 Similarly, when the content type is composed only of video, the preset audio time stamp ATSp is not sent to the timing control circuit 37A, so the value of the system time clock stc is set as the preset video time stamp. This means that if the preset sequence determined in the order of VTSp and preset audio time stamp ATSp is followed, the value of the system time clock stc is preset with the preset video time stamp VTSp.

これは、コンテンツの種類がオーディオだけでなる場合、若しくはビデオだけでなる場合だけであり、映像及び音声のリップシンクを調整する必要は特にないため、プリセット用オーディオタイムスタンプＡＴＳｐでプリセットされた後のシステムタイムクロックｓｔｃの値とオーディオタイムスタンプＡＴＳとが一致したときにオーディオフレームＡＦ１を出力すればよく、またプリセット用ビデオタイムスタンプＶＴＳｐでプリセットされた後のシステムタイムクロックｓｔｃの値とビデオタイムスタンプＶＴＳとが一致したときにビデオフレームＶＦ１を出力すればよい。 This is only when the content type is only audio, or only video, and there is no need to adjust the lip sync of video and audio. Therefore, after the preset with the preset audio time stamp ATSp The audio frame AF1 may be output when the value of the system time clock stc matches the audio time stamp ATS, and the value of the system time clock stc after being preset by the preset video time stamp VTSp and the video time stamp VTS. The video frame VF1 may be output when the two match.

実際上、レンダラー３７のタイミングコントロール回路３７Ａでは、例えばコンテンツの種類がオーディオ及びビデオからなるものである場合、図７に示すように例えばオーディオデコーダ３５でデコードした後のオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１０へ順次出力する時点Ｔａ１、Ｔａ２、Ｔａ３、……、のタイミングでは、クリスタルオシレータ回路４０（図４）及びＳＴＣ回路４１を介して供給されるシステムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番でプリセットすることにより、最終的にシステムタイムクロックｓｔｃの値をプリセット用オーディオタイムスタンプＡＴＳｐ１、ＡＴＳｐ２、ＡＴＳｐ３、……と一致させる。 Actually, in the timing control circuit 37A of the renderer 37, for example, when the content type is composed of audio and video, as shown in FIG. 7, the audio frame AF1 (Af1, Af2,. At the time points Ta1, Ta2, Ta3,... That sequentially output Af3,...) To the monitor 10, the value of the system time clock stc supplied through the crystal oscillator circuit 40 (FIG. 4) and the STC circuit 41. Are preset in the order of the preset video time stamp VTSp and the preset audio time stamp ATSp, so that the value of the system time clock stc is finally set to the preset audio time stamp ATSp1, ATSp2, ATSp3, ... and to match.

このことは、再生中に音声が途切たり音飛びがあるとユーザにとって非常に目立つので、レンダラー３７のタイミングコントロール回路３７ＡではオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をリップシンク調整処理の基準として用い、当該オーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）の出力に合わせてビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）の出力タイミングを調整する必要があるからである。 This is very noticeable to the user when the sound is interrupted or skipped during reproduction. Therefore, the timing control circuit 37A of the renderer 37 processes the audio frame AF1 (Af1, Af2, Af3,...) With the lip sync adjustment process. This is because it is necessary to adjust the output timing of the video frame VF1 (Vf1, Vf2, Vf3,...) In accordance with the output of the audio frame AF1 (Af1, Af2, Af3,...).

またレンダラー３７のタイミングコントロール回路３７Ａは、オーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）の出力タイミング（時点Ｔａ１、Ｔａ２、Ｔａ３、……）が決まると、ビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をシステムタイムクロックｓｔｃに基づく３０[Hz]のフレーム周波数で出力する任意の時点Ｔｖ１、Ｔｖ２、Ｔｖ３、……において、プリセット後のシステムタイムクロックｓｔｃのカウント値と、ビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）に付されているビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）とをコンパレータ回路４６でそれぞれ比較する。 When the output timing (time points Ta1, Ta2, Ta3,...) Of the audio frame AF1 (Af1, Af2, Af3,...) Is determined, the timing control circuit 37A of the renderer 37 determines the video frame VF1 (Vf1, Vf2, Vf3). ,... Are output at a frame frequency of 30 [Hz] based on the system time clock stc, at any time point Tv1, Tv2, Tv3,..., The count value of the system time clock stc after the preset and the video frame VF1 ( The video time stamps VTS (VTS1, VTS2, VTS3,...) Attached to Vf1, Vf2, Vf3,.

コンパレータ回路４６では、プリセット後のシステムタイムクロックｓｔｃのカウント値と、ビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）とが一致したときに出力ビデオバッファ３９からビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をモニタ１０へ出力させるようになされている。 In the comparator circuit 46, the video frame VF1 (Vf1, Vf2,...) Is output from the output video buffer 39 when the count value of the system time clock stc after the preset and the video time stamp VTS (VTS1, VTS2, VTS3,...) Match. Vf3,...) Is output to the monitor 10.

ところでコンパレータ回路４６は、プリセット後のシステムタイムクロックｓｔｃのカウント値と、バッファ４２から送られるビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）とを比較した結果、プリセット後のシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）との差分値Ｄ１（時間差）が所定の時間を表す閾値ＴＨ以下であれば、ユーザにとっては映像と音声とが一致していないとは認識し得ないレベルなので、タイミングコントロール回路３７Ａはプリセット後のシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）とが一致したときにビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をそのままモニタ１０に出力すればよい。 By the way, the comparator circuit 46 compares the count value of the system time clock stc after presetting with the video time stamp VTS (VTS1, VTS2, VTS3,...) Sent from the buffer 42, and as a result, the system time clock stc after presetting. If the difference value D1 (time difference) between the count value of the video and the video time stamp VTS (VTS1, VTS2, VTS3,...) Is equal to or less than a threshold value TH representing a predetermined time, the video and audio match for the user. Since the timing control circuit 37A matches the count value of the preset system time clock stc and the video time stamp VTS (VTS1, VTS2, VTS3,...), The video frame VF1 ( Vf1, Vf2, V 3, ...) may be directly output to the monitor 10.

それ以外の場合、例えば時点Ｔｖ２のタイミングにおいて、プリセット後のシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ２との差分値Ｄ１が所定の閾値ＴＨよりも大きく、かつ映像が音声よりも遅れている場合には、エンコーダ側のクロック周波数とデコーダ側のクロック周波数とのずれが原因で音声に映像が追いついていない状態であるため、レンダラー３７のタイミングコントロール回路３７ＡではＧＯＰ(Group Of Picture)を構成している例えばＢピクチャに相当するビデオフレームＶｆ３（図示せず）をデコードすることなくスキップし、次のビデオフレームＶｆ４を出力するようになされている。 In other cases, for example, at the timing of time Tv2, the difference value D1 between the count value of the system time clock stc after the preset and the video time stamp VTS2 is larger than the predetermined threshold value TH, and the video is delayed from the sound. In this case, since the video does not catch up with the audio due to the difference between the clock frequency on the encoder side and the clock frequency on the decoder side, the timing control circuit 37A of the renderer 37 constitutes a GOP (Group Of Picture). For example, a video frame Vf3 (not shown) corresponding to a B picture is skipped without being decoded and the next video frame Vf4 is output.

この場合、レンダラー３７は出力ビデオバッファ３９に格納されている「Ｐ」ピクチャについては、ビデオデコーダ３６で次のピクチャをデコードする際の参照フレームとなるためスキップせず、次のピクチャを生成する際の参照フレームとならない非参照フレームである「Ｂ」ピクチャをスキップすることにより、画質劣化を未然に防ぎながらリップシンクさせるようになされている。 In this case, the renderer 37 does not skip the “P” picture stored in the output video buffer 39 because it is a reference frame when the video decoder 36 decodes the next picture. By skipping the “B” picture, which is a non-reference frame that does not become the reference frame, lip sync is performed while preventing image quality deterioration.

ところでレンダラー３７では、仮にスキップするべき「Ｂ」ピクチャが出力ビデオバッファ３９に存在せず、「Ｉ」ピクチャや「Ｐ」ピクチャばかりであった場合には、当該「Ｂ」ピクチャをスキップすることはできないため、音声に映像を追い付かせることができなくなってしまう。 By the way, in the renderer 37, if the “B” picture to be skipped does not exist in the output video buffer 39 and there are only “I” picture and “P” picture, the “B” picture cannot be skipped. Because it is not possible, it will not be able to catch up with the video.

そこでレンダラー３７では、スキップすべき「Ｂ」ピクチャが出力ビデオバッファ３９に存在しないときには、図８に示すようにモニタ１０のモニタ出力タイミングが例えば６０[Hz]であり、出力ビデオバッファ３９から出力すべきビデオフレームＶＦ１のピクチャリフレッシュタイミングが３０[Hz]であることを利用し、当該ピクチャリフレッシュタイミングを短縮するようになされている。 Therefore, in the renderer 37, when the “B” picture to be skipped does not exist in the output video buffer 39, the monitor output timing of the monitor 10 is, for example, 60 [Hz] as shown in FIG. By utilizing the fact that the picture refresh timing of the power video frame VF1 is 30 [Hz], the picture refresh timing is shortened.

具体的にはレンダラー３７は、プリセット用オーディオタイムスタンプＡＴＳｐでプリセットした後のシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳの差分値Ｄ１が１６．６６６……[msec]を超えるとき、すなわち音声の出力タイミングに対してモニタ出力タイミングが１フレーム分以上遅れているときは、１フレーム分のビデオフレームＶＦ１をスキップする代わりにピクチャリフレッシュタイミングを３０[Hz]から６０[Hz]に変更して次のＮ＋１番目のピクチャを出力するようになされている。 Specifically, the renderer 37 determines that when the difference value D1 between the count value of the system time clock stc and the video time stamp VTS after presetting with the preset audio time stamp ATSp exceeds 16.666. When the monitor output timing is delayed by one frame or more with respect to the output timing, the picture refresh timing is changed from 30 [Hz] to 60 [Hz] instead of skipping the video frame VF1 for one frame. The (N + 1) th picture is output.

つまりレンダラー３７は、当該スキップによる画質劣化の影響を受ける「Ｉ」ピクチャや「Ｐ」ピクチャについてはピクチャリフレッシュ間隔を１／３０秒から１／６０秒に短縮することにより、「Ｉ」ピクチャや「Ｐ」ピクチャをスキップすることによる画質劣化を生じさせることなく音声に映像を追い付かせることができるようになされている。 In other words, the renderer 37 shortens the picture refresh interval from 1/30 seconds to 1/60 seconds for the “I” picture and “P” picture that are affected by the image quality degradation due to the skip, so that the “I” picture and “ The video can be caught up with the sound without causing the image quality deterioration due to the skipping of the “P” picture.

これに対してレンダラー３７のタイミングコントロール回路３７Ａは、時点Ｔｖ２のタイミングにおいて、プリセット後のシステムタイムクロックｓｔｃのカウント値と例えばビデオタイムスタンプＶＴＳ２との差分値Ｄ１が所定の閾値ＴＨよりも大きく、かつ音声が映像よりも遅れている場合には、エンコーダ側のクロック周波数とデコーダ側のクロック周波数とのずれが原因で映像に音声が追いついていない状態であるため、現在出力中のビデオフレームＶｆ２を繰り返しリピートして出力するようになされている。 On the other hand, the timing control circuit 37A of the renderer 37 has a difference value D1 between the preset value of the system time clock stc and, for example, the video time stamp VTS2 at the timing of the time point Tv2 larger than a predetermined threshold value TH. If the audio is behind the video, the audio is not catching up with the video due to the difference between the clock frequency on the encoder side and the clock frequency on the decoder side, so the video frame Vf2 currently being output is repeated. The output is repeated.

一方、レンダラー３７のタイミングコントロール回路３７Ａでは、例えばコンテンツの種類がビデオだけからなるものである場合、ビデオデコーダ３６でデコードした後のビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をモニタ１０へ順次出力する時点Ｔｖ１、Ｔｖ２、Ｔｖ３、……、のタイミングでは、プリセット用ビデオタイムスタンプＶＴＳｐでプリセットされたシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳが一致したタイミングでビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をモニタ１０に出力すればよい。 On the other hand, in the timing control circuit 37A of the renderer 37, for example, when the content type is composed only of video, the video frame VF1 (Vf1, Vf2, Vf3,...) Decoded by the video decoder 36 is sent to the monitor 10. At the timing of the time points Tv1, Tv2, Tv3,..., Which are sequentially output, the video frame VF1 (Vf1, Vf1,. Vf2, Vf3, ...) may be output to the monitor 10.

同様に、レンダラー３７のタイミングコントロール回路３７Ａでは、例えばコンテンツの種類がオーディオだけからなるものである場合、オーディオデコーダ３５でデコードした後のオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１０へ順次出力する時点Ｔａ１、Ｔａ２、Ｔａ３、……、のタイミングでは、プリセット用オーディオタイムスタンプＡＴＳｐでプリセットされたシステムタイムクロックｓｔｃのカウント値とオーディオタイムスタンプＡＴＳが一致したタイミングでオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１０のスピーカから出力すればよい。 Similarly, the timing control circuit 37A of the renderer 37 monitors the audio frame AF1 (Af1, Af2, Af3,...) That has been decoded by the audio decoder 35, for example, when the content type is only audio. At the time points Ta1, Ta2, Ta3,... That are sequentially output to the audio frame AF1 (Af1) at the timing when the count value of the system time clock stc preset by the preset audio time stamp ATSp matches the audio time stamp ATS. , Af2, Af3,...) May be output from the speaker of the monitor 10.

（４−２）プリエンコーデッドストリーミングにおけるリップシンク調整処理手順
上述のようにストリーミングデコーダ９におけるレンダラー３７のタイミングコントロール回路３７ＡがオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）を基準にしてビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）の出力タイミングを調整することにより、映像と音声とをリップシンクさせる出力タイミング調整方法についてまとめると、次の図９のフローチャートに示すように、レンダラー３７のタイミングコントロール回路３７Ａは、ルーチンＲＴ１の開始ステップから入って、次のステップＳＰ１へ移る。 (4-2) Lip sync adjustment processing procedure in pre-encoded streaming As described above, the timing control circuit 37A of the renderer 37 in the streaming decoder 9 uses the audio frame AF1 (Af1, Af2, Af3,...) As a reference to the video frame. The output timing adjustment method for lip-syncing video and audio by adjusting the output timing of VF1 (Vf1, Vf2, Vf3,...) Is summarized as shown in the flowchart of FIG. The timing control circuit 37A enters from the start step of the routine RT1 and proceeds to the next step SP1.

ステップＳＰ１においてレンダラー３７は、システムタイムクロックｓｔｃの値をプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従ってプリセットし、次のステップＳＰ２へ移る。 In step SP1, the renderer 37 presets the value of the system time clock stc in accordance with a preset sequence determined in the order of the preset video time stamp VTSp and the preset audio time stamp ATSp, and proceeds to the next step SP2.

ここでレンダラー３７は、コンテンツの種類がオーディオ及びビデオでなるものであるときにはプリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃの値をプリセットした後に必ずプリセット用オーディオタイムスタンプＡＴＳｐで先程のプリセット値を更新し、次のステップＳＰ２へ移る。 Here, the renderer 37 always updates the preset value with the preset audio time stamp ATSp after presetting the value of the system time clock stc with the preset video time stamp VTSp when the content type is audio and video. Then, the process proceeds to the next step SP2.

この場合、オーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１０へ出力する時点Ｔａ１、Ｔａ２、Ｔａ３、……のタイミングで（図７）、システムタイムクロックｓｔｃの値とプリセット用オーディオタイムスタンプＡＴＳｐ（ＡＴＳｐ１、ＡＴＳｐ２、ＡＴＳｐ３、……）とが一致することになる。 In this case, at the time points Ta1, Ta2, Ta3,... When the audio frame AF1 (Af1, Af2, Af3,...) Is output to the monitor 10 (FIG. 7), the value of the system time clock stc and the preset audio time. The stamp ATSp (ATSp1, ATSp2, ATSp3,...) Matches.

またレンダラー３７は、コンテンツの種類がビデオだけからなるものである場合には、プリセット用オーディオタイムスタンプＡＴＳｐは存在しないので、プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃの値をプリセットして所定時間経過したときに次のステップＳＰ２へ移る。 The renderer 37 does not have a preset audio time stamp ATSp when the content type is composed only of video. Therefore, the renderer 37 presets the value of the system time clock stc with the preset video time stamp VTSp for a predetermined time. When it has elapsed, the process proceeds to the next step SP2.

さらにレンダラー３７は、コンテンツの種類がオーディオだけからなるものである場合には、プリセット用ビデオタイムスタンプＶＴＳｐは存在しないので、プリセット用ビデオタイムスタンプＶＴＳｐを待つことなくプリセット用オーディオタイムスタンプＡＴＳｐが到達した時点でシステムタイムクロックｓｔｃの値をプリセットした後に次のステップＳＰ２へ移る。 Furthermore, when the content type is composed only of audio, the renderer 37 does not have the preset video time stamp VTSp, so that the preset audio time stamp ATSp arrives without waiting for the preset video time stamp VTSp. After presetting the value of the system time clock stc at the time, the process proceeds to the next step SP2.

ステップＳＰ２においてレンダラー３７は、システムコントローラ５０から供給されるコンテンツ種類判別結果ＣＨに基づいて当該コンテンツがビデオのみでなるものか否かを判定し、肯定結果が得られると次のステップＳＰ３へ移る。 In step SP2, the renderer 37 determines whether or not the content is only a video based on the content type determination result CH supplied from the system controller 50. If a positive result is obtained, the renderer 37 proceeds to the next step SP3.

ステップＳＰ３においてレンダラー３７は、当該コンテンツがビデオのみでなるため、プリセット用ビデオタイムスタンプＶＴＳｐでプリセットしたシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳとが一致したときにビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）をモニタ１０へ出力し、次のステップＳＰ１２へ移って処理を終了する。 In step SP3, the renderer 37 makes the video frame VF1 (Vf1, Vf2) when the count value of the system time clock stc preset by the preset video time stamp VTSp matches the video time stamp VTS because the content is only video. , Vf3,...) Are output to the monitor 10, and the process proceeds to the next step SP12 to end the process.

これに対してステップＳＰ２で否定結果が得られると、このことはコンテンツの種類がビデオのみでなるものではなく、オーディオ及びビデオでなるものか、オーディオのみでなるものかの何れかであることを表しており、このときレンダラー３７は次のステップＳＰ４へ移る。 On the other hand, if a negative result is obtained in step SP2, this means that the type of content is not only video, but is either audio and video or only audio. At this time, the renderer 37 proceeds to the next step SP4.

ステップＳＰ４においてレンダラー３７は、コンテンツ種類判別結果ＣＨに基づいて当該コンテンツがオーディオのみでなるものか否かを判定し、肯定結果が得られると次のステップＳＰ３へ移る。 In step SP4, the renderer 37 determines whether or not the content is only audio based on the content type determination result CH. If a positive result is obtained, the renderer 37 proceeds to the next step SP3.

ステップＳＰ３においてレンダラー３７は、当該コンテンツがオーディオのみでなるため、プリセット用オーディオタイムスタンプＡＴＳｐでプリセットしたシステムタイムクロックｓｔｃのカウント値とオーディオタイムスタンプＡＴＳとが一致したときにオーディオフレームＡＦ１（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１０のスピーカから出力し、次のステップＳＰ１２へ移って処理を終了する。 In step SP3, the renderer 37 makes the audio frame AF1 (Af1, Af2) when the count value of the system time clock stc preset by the preset audio time stamp ATSp matches the audio time stamp ATS because the content is only audio. , Af3,.

これに対してステップＳＰ４で否定結果が得られると、このことはコンテンツの種類がオーディオ及びビデオからなるものであることを表しており、このときレンダラー３７は次のステップＳＰ５へ移る。 On the other hand, if a negative result is obtained in step SP4, this indicates that the content type is composed of audio and video. At this time, the renderer 37 proceeds to the next step SP5.

ステップＳＰ５においてレンダラー３７は、コンテンツの種類がオーディオ及びビデオからなるものであるため、最終的にプリセット用オーディオタイムスタンプＡＴＳｐでプリセットされたシステムタイムクロックｓｔｃのカウント値と、時点Ｔｖ１、Ｔｖ２、Ｔｖ３、……のタイミングで出力すべきビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）のタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）との差分値Ｄ１（＝ｓｔｃ−ＶＴＳ）を算出し、次のステップＳＰ６へ移る。 In step SP5, the renderer 37 has a content type consisting of audio and video, so that the count value of the system time clock stc preset by the preset audio time stamp ATSp and the time points Tv1, Tv2, Tv3, A difference value D1 (= stc−VTS) from the time stamp VTS (VTS1, VTS2, VTS3,...) Of the video frame VF1 (Vf1, Vf2, Vf3,...) To be output at the timing of. The process moves to the next step SP6.

ステップＳＰ６においてレンダラー３７は、ステップＳＰ７で算出した差分値Ｄ１（絶対値）が所定の閾値ＴＨよりも大きいか否かを判定する。ここで否定結果が得られると、このことは、差分値Ｄ１が、映像及び音声を見て聞いたユーザにとって当該映像と当該音声との間にずれが生じているとは判断し得ない程度の時間（例えば１００[msec]）以下であることを表しており、このときレンダラー３７は次のステップＳＰ３へ移る。 In step SP6, the renderer 37 determines whether or not the difference value D1 (absolute value) calculated in step SP7 is larger than a predetermined threshold value TH. If a negative result is obtained here, this means that the difference value D1 cannot be determined by the user who has seen and heard the video and the audio as being shifted between the video and the audio. This represents that the time is 100 (msec) or less. At this time, the renderer 37 moves to the next step SP3.

ステップＳＰ３においてレンダラー３７は、映像と音声がずれていると判断し得ない程度の時間差しかないので、この場合は当該ビデオフレームＶＦ１をそのままモニタ１０へ出力し、またオーディオフレームＡＦ１についても原則的にそのままモニタ１０へ出力し、次のステップＳＰ１２へ移って処理を終了する。 In step SP3, the renderer 37 does not have enough time to determine that the video and the audio are deviated from each other. In this case, the video frame VF1 is output to the monitor 10 as it is, and the audio frame AF1 is also in principle. The data is output to the monitor 10 as it is, and the process proceeds to the next step SP12 to end the process.

これに対してステップＳＰ６で肯定結果が得られると、このことは差分値Ｄ１が所定の閾値ＴＨよりも大きい、すなわち映像及び音声を見て聞いたユーザにとって当該映像と当該音声との間にずれが生じていると判断し得る程度であることを表しており、このときレンダラー３７は次のステップＳＰ７へ移る。 On the other hand, if an affirmative result is obtained in step SP6, this means that the difference value D1 is larger than the predetermined threshold value TH, that is, the difference between the video and the audio for the user who has seen and listened to the video and the audio. In this case, the renderer 37 proceeds to the next step SP7.

ステップＳＰ７においてレンダラー３７は、映像が音声よりも遅れているか否かをオーディオタイムスタンプＡＴＳ及びビデオタイムスタンプＶＴＳに基づいて判定し、否定結果が得られると次のステップＳＰ８へ移る。 In step SP7, the renderer 37 determines whether the video is behind the audio based on the audio time stamp ATS and the video time stamp VTS. If a negative result is obtained, the renderer 37 proceeds to the next step SP8.

ステップＳＰ８においてレンダラー３７は、映像の方が音声よりも進んでいるので、当該映像に音声が追いつくように現在出力中のピクチャを構成しているビデオフレームＶＦ１を繰り返しリピート出力した後、次のステップＳＰ１２へ移って処理を終了する。 In step SP8, the renderer 37 repeats and repeatedly outputs the video frame VF1 constituting the picture currently being output so that the audio catches up with the video because the video is ahead of the audio, and then the next step. The process proceeds to SP12 and the process is terminated.

これに対してステップＳＰ７で肯定結果が得られると、このことは映像が音声よりも遅れていることを表しており、このときレンダラー３７は次のステップＳＰ９へ移り、出力ビデオバッファ３９にスキップ対象の「Ｂ」ピクチャが存在するか否かを判定し、肯定結果が得られると次のステップＳＰ１０へ移る。 On the other hand, if an affirmative result is obtained in step SP7, this indicates that the video is delayed from the audio. At this time, the renderer 37 moves to the next step SP9 and skips to the output video buffer 39. It is determined whether or not a “B” picture exists, and if a positive result is obtained, the process proceeds to the next step SP10.

ステップＳＰ１０においてレンダラー３７は、音声に対する映像の遅れを取り戻すべくＢピクチャ（この場合、ビデオフレームＶｆ３）をデコードせずにスキップして出力することにより、音声に対する映像の遅れを取り戻してリップシンクさせることができ、次のステップＳＰ１２へ移って処理を終了する。 In step SP10, the renderer 37 skips the B picture (in this case, the video frame Vf3) without decoding and outputs it in order to recover the delay of the video with respect to the audio, thereby recovering the delay of the video with respect to the audio and lip-sync. Then, the process proceeds to the next step SP12 to end the process.

一方、ステップＳＰ９で否定結果が得られると、このことは出力ビデオバッファ３９にスキップ対象の「Ｂ」ピクチャが存在せず、「Ｂ」ピクチャをスキップすることができないことを表しており、このときレンダラー３７は次のステップＳＰ１１へ移る。 On the other hand, if a negative result is obtained in step SP9, this indicates that the “B” picture to be skipped does not exist in the output video buffer 39, and the “B” picture cannot be skipped. The renderer 37 proceeds to the next step SP11.

ステップＳＰ１１においてレンダラー３７は、図８に示したように、モニタ１０のモニタ出力タイミングが６０[Hz]であるのに対し、ビデオフレームＶＦ１のピクチャリフレッシュタイミングが３０[Hz]であることを利用し、当該ピクチャリフレッシュタイミングをモニタ１０のモニタ出力タイミングに合わせて短縮することにより、ピクチャをスキップすることによる画質劣化を生じさせずに映像を音声に追い付かせ、次のステップＳＰ１２へ移って処理を終了する。 In step SP11, the renderer 37 uses the fact that the monitor output timing of the monitor 10 is 60 [Hz] while the picture refresh timing of the video frame VF1 is 30 [Hz] as shown in FIG. By shortening the picture refresh timing in accordance with the monitor output timing of the monitor 10, the video is caught up with the sound without causing picture quality deterioration due to skipping the picture, and the process is moved to the next step SP12 and the process is terminated. To do.

（５）第１のコンテンツ受信装置におけるリアルタイムストリーミングエンコーダの回路構成
第１のコンテンツ受信装置３（図１）は、外部から供給された地上波ディジタル、ＢＳ／ＣＳディジタル又は地上波アナログ放送等のコンテンツあるいはＤＶＤ、VideoCDほか一般的なビデオカメラからのコンテンツをリアルタイムストリーミングエンコーダ１１によってリアルタイムにエンコードした後に第２のコンテンツ受信装置４へ中継する形で無線送信することによりコンテンツ提供側にもなり得るようになされている。 (5) Circuit Configuration of Real-Time Streaming Encoder in First Content Receiving Device First content receiving device 3 (FIG. 1) is an externally supplied content such as terrestrial digital, BS / CS digital or terrestrial analog broadcast Alternatively, content from a general video camera such as a DVD, VideoCD, or the like is encoded in real time by the real-time streaming encoder 11 and then wirelessly transmitted to the second content receiving device 4 so that it can also be a content providing side. Has been made.

その第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１の回路構成について図１０を用いて説明する。リアルタイムストリーミングエンコーダ１１は、外部から供給されたコンテンツを構成するビデオ信号ＶＳ２及びオーディオ信号ＡＳ２をビデオ入力部５１及びオーディオ入力部５３を介してディジタル変換し、これをビデオデータＶＤ２及びオーディオデータＡＤ２としてビデオエンコーダ５２及びオーディオエンコーダ５４へ送出する。 The circuit configuration of the real-time streaming encoder 11 in the first content receiving device 3 will be described with reference to FIG. The real-time streaming encoder 11 digitally converts the video signal VS2 and the audio signal AS2 constituting the content supplied from the outside through the video input unit 51 and the audio input unit 53, and converts this into video data VD2 and audio data AD2. The data is sent to the encoder 52 and the audio encoder 54.

ビデオエンコーダ５２は、ビデオデータＶＤ２を例えばＭＰＥＧ1/2/4の規格に準拠した所定の圧縮符号化方法あるいは種々の圧縮符号化方式で圧縮符号化し、その結果得られるビデオエレメンタリストリームＶＥＳ２をパケット生成部５６及びビデオフレームカウンタ５７へ送出する。 The video encoder 52 compresses and encodes the video data VD2 using a predetermined compression encoding method or various compression encoding methods compliant with, for example, the MPEG1 / 2/4 standard, and generates the resulting video elementary stream VES2 as a packet. The data is sent to the unit 56 and the video frame counter 57.

ビデオフレームカウンタ５７では、ビデオエレメンタリストリームＶＥＳ２をフレーム周波数単位（２９．９７[Hz]あるいは３０[Hz]あるいは５９．９４[Hz]あるいは６０[Hz]）でカウントし、そのカウントアップ値を基準クロックに基づく９０[KHz]単位の値に変換し、３２ビット表現で各ビデオフレームに対するビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）としてパケット生成部５６へ送出する。 In the video frame counter 57, the video elementary stream VES2 is counted in frame frequency units (29.97 [Hz], 30 [Hz], 59.94 [Hz] or 60 [Hz]), and the counted up value is used as a reference. The value is converted into a value in units of 90 [KHz] based on the clock, and is sent to the packet generator 56 as a video time stamp VTS (VTS1, VTS2, VTS3,...) For each video frame in 32-bit representation.

オーディオエンコーダ５４は、オーディオデータＡＤ２をＭＰＥＧ1/2/4オーディオの規格に準拠した所定の圧縮符号化方法あるいは種々の圧縮符号化方式で圧縮符号化し、その結果得られるオーディオエレメンタリストリームＡＥＳ２をパケット生成部５６及びオーディオフレームカウンタ５８へ送出する。 The audio encoder 54 compresses and encodes the audio data AD2 by a predetermined compression encoding method or various compression encoding methods compliant with the MPEG1 / 2/4 audio standard, and generates a packet of the resulting audio elementary stream AES2 The data is sent to the unit 56 and the audio frame counter 58.

オーディオフレームカウンタ５８はビデオフレームカウンタ５７と同様、オーディオフレームのカウントアップ値と共通の基準クロックに基づく９０[KHz]単位の値に変換し、オーディオタイムスタンプＡＴＳ（ＡＴＳ１、ＡＴＳ２、ＡＴＳ３、……）として３２ビット表現し、パケット生成部５６へ送出する。 As with the video frame counter 57, the audio frame counter 58 converts the audio frame count value into a value in units of 90 [KHz] based on the common reference clock and the audio time stamp ATS (ATS1, ATS2, ATS3,...). As 32 bits and sent to the packet generator 56.

パケット生成部５６では、ビデオエレメンタリストリームＶＥＳ２を所定データサイズのパケットに分割し、それぞれのパケットにビデオヘッダ情報を付加することによりビデオパケットを生成すると共に、オーディオエレメンタリストリームＡＥＳ２を所定データサイズのパケットに分割し、それぞれのパケットにオーディオヘッダ情報を付加することによりオーディオパケットを生成する。 The packet generation unit 56 divides the video elementary stream VES2 into packets of a predetermined data size, generates video packets by adding video header information to each packet, and generates the audio elementary stream AES2 of a predetermined data size. An audio packet is generated by dividing the packet and adding audio header information to each packet.

ここで図１１に示すようにＲＴＣＰ(Real Time Control Protocol)packetの前段に付加されているコントロールパケットは、インターネット層におけるホスト間通信用のＩＰ(Internet Protocol)ヘッダ、ユーザ・データグラム・データ転送用のＵＤＰ(User Datagram Protocol)ヘッダ、リアルタイム・データ転送制御用のＲＴＣＰ(Real Time Control Protocol)パケットセンダリポート及びＲＴＣＰパケットからなり、ＲＴＣＰパケットセンダリポート内のセンダ情報内にある４バイトのＲＴＰタイムスタンプ領域にＰＣＲ(Program Clock Reference)値としてエンコーダ側におけるシステムタイムクロック値のスナップショット情報が書き込まれるようになされていて、デコーダ側のクロックリカバリ用にＰＣＲ回路６１から送出される。 Here, as shown in FIG. 11, the control packet added to the front of the RTCP (Real Time Control Protocol) packet is an IP (Internet Protocol) header for host-to-host communication in the Internet layer, for user datagram data transfer. 4 bytes RTP timestamp area in the sender information in the RTCP packet sender port, which consists of the UDP (User Datagram Protocol) header, the RTCP packet sender report for the real time data transfer control, and the RTCP packet. The snapshot information of the system time clock value on the encoder side is written as a PCR (Program Clock Reference) value, and is sent from the PCR circuit 61 for clock recovery on the decoder side.

そしてパケット生成部５６では、ビデオパケット及びビデオタイムスタンプＶＴＳに基づいて所定バイト数からなる映像パケットデータを生成すると共に、オーディオパケット及びオーディオタイムスタンプＡＴＳに基づいて所定バイト数からなる音声パケットデータを生成し、これらを多重化することにより多重化データＭＸＤ２を生成した後パケットデータ蓄積部５９へ送出する。 The packet generator 56 generates video packet data having a predetermined number of bytes based on the video packet and the video time stamp VTS, and generates audio packet data having a predetermined number of bytes based on the audio packet and the audio time stamp ATS. Then, multiplexed data MXD2 is generated by multiplexing these, and then transmitted to the packet data storage unit 59.

パケットデータ蓄積部５９は、多重化データＭＸＤ２を所定量蓄積すると、当該多重化データＭＸＤ２を無線ＬＡＮ６を介してＲＴＰ／ＴＣＰで第２のコンテンツ受信装置４へ送信するようになされている。 When the packet data storage unit 59 stores a predetermined amount of the multiplexed data MXD2, the packet data storage unit 59 transmits the multiplexed data MXD2 to the second content receiving device 4 via the wireless LAN 6 by RTP / TCP.

ところでリアルタイムストリーミングエンコーダ１１は、ビデオ入力部５１でディジタル変換したビデオデータＶＤ２をＰＬＬ(Phase-Locked Loop)回路５５にも供給する。ＰＬＬ回路５５は、ビデオデータＶＤ２に基づいて当該ビデオデータＶＤ２のクロック周波数にＳＴＣ回路６０を同期させると共に、ビデオエンコーダ５２、オーディオ入力部５３及びオーディオエンコーダ５４についてもビデオデータＶＤ２のクロック周波数と同期させるようになされている。 Incidentally, the real-time streaming encoder 11 also supplies the video data VD2 digitally converted by the video input unit 51 to a PLL (Phase-Locked Loop) circuit 55. The PLL circuit 55 synchronizes the STC circuit 60 with the clock frequency of the video data VD2 based on the video data VD2, and also synchronizes the video encoder 52, the audio input unit 53, and the audio encoder 54 with the clock frequency of the video data VD2. It is made like that.

これによりリアルタイムストリーミングエンコーダ１１は、ＰＬＬ回路５５を介してビデオデータＶＤ２に対する圧縮符号化処理とオーディオデータＡＤ２に対する圧縮符号化処理とをビデオデータＶＤ２のクロック周波数と同期したタイミングで実行し得ると共に、ＰＣＲ(Program Clock Reference)回路６１を介してビデオデータＶＤ２のクロック周波数に同期したクロックリファレンスｐｃｒを第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２へ送信し得るようになされている。 As a result, the real-time streaming encoder 11 can execute the compression encoding process for the video data VD2 and the compression encoding process for the audio data AD2 via the PLL circuit 55 at timing synchronized with the clock frequency of the video data VD2. A clock reference pcr synchronized with the clock frequency of the video data VD2 can be transmitted to the real-time streaming decoder 12 in the second content receiver 4 via the (Program Clock Reference) circuit 61.

このときＰＣＲ回路６１は、クロックリファレンスｐｃｒをＲＴＰプロトコルの下位層に位置しリアルタイム性が要求されるＵＤＰ(User Datagram Protocol)で第２のコンテンツ受信装置４のリアルタイムストリーミングデコーダ１２へ送信するようになされており、これにより高速性を確保してリアルタイム性の必要とされるライブストリーミングにも対応し得るようになされている。 At this time, the PCR circuit 61 is configured to transmit the clock reference pcr to the real-time streaming decoder 12 of the second content receiving device 4 by using UDP (User Datagram Protocol) which is located in the lower layer of the RTP protocol and requires real-time property. As a result, high-speed performance can be ensured, and live streaming that requires real-time performance can be supported.

（６）第２のコンテンツ受信装置におけるリアルタイムストリーミングデコーダの回路構成
図１２に示すように第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２は、第１のコンテンツ受信装置３のリアルタイムストリーミングエンコーダ１１から送信された多重化データＭＸＤ２を入力パケット蓄積部７１に一旦蓄積した後、パケット分割部７２へ送出する。 (6) Circuit Configuration of Real-Time Streaming Decoder in Second Content Receiving Device As shown in FIG. 12, the real-time streaming decoder 12 in the second content receiving device 4 transmits from the real-time streaming encoder 11 of the first content receiving device 3. The multiplexed data MXD2 is temporarily stored in the input packet storage unit 71 and then sent to the packet division unit 72.

パケット分割部７２は、多重化データＭＸＤ２を映像パケットデータＶＰ２と音声パケットデータＡＰ２に分割し、当該音声パケットデータＡＰ２をリングバッファでなる入力オーディオバッファ７３を介してオーディオフレーム単位でオーディオデコーダ７４へ送出すると共に、映像パケットデータＶＰ２をリングバッファでなる入力ビデオバッファ７５を介してフレーム単位でビデオデコーダ７６へ送出するようになされている。 The packet division unit 72 divides the multiplexed data MXD2 into video packet data VP2 and audio packet data AP2, and sends the audio packet data AP2 to the audio decoder 74 in units of audio frames via the input audio buffer 73 formed of a ring buffer. At the same time, the video packet data VP2 is sent to the video decoder 76 in units of frames via the input video buffer 75 which is a ring buffer.

ここで入力オーディオバッファ７３及び入力ビデオバッファ７５においても、後段のオーディオデコーダ７４及びビデオデコーダ７６で１オーディオフレーム及び１ビデオフレーム分の音声パケットデータＡＰ２及び映像パケットデータＶＰ２を連続してデコードできるようになるまで蓄積するようになされており、そのため少なくとも１オーディオフレーム及び１ビデオフレーム分のデータ容量があればよい。 Here, also in the input audio buffer 73 and the input video buffer 75, the audio packet data AP2 and the video packet data VP2 for one audio frame and one video frame can be successively decoded by the audio decoder 74 and the video decoder 76 in the subsequent stage. Therefore, it is sufficient to have a data capacity for at least one audio frame and one video frame.

なおパケット分割部７２は、映像パケットデータＶＰ２のビデオヘッダ情報及び音声パケットデータＡＰ２のオーディオヘッダ情報を解析することによりオーディオタイムスタンプＡＴＳ及びビデオタイムスタンプＶＴＳを認識し得るようになされており、当該オーディオタイムスタンプＡＴＳ及び当該ビデオタイムスタンプＶＴＳをレンダラー７７へ送出する。 The packet division unit 72 can recognize the audio time stamp ATS and the video time stamp VTS by analyzing the video header information of the video packet data VP2 and the audio header information of the audio packet data AP2. The time stamp ATS and the video time stamp VTS are sent to the renderer 77.

オーディオデコーダ７４は、音声パケットデータＡＰ２をオーディオフレーム単位でデコードすることにより圧縮符号化前のオーディオフレームＡＦ２を復元し、順次レンダラー７７へ送出する。 The audio decoder 74 restores the audio frame AF2 before compression encoding by decoding the audio packet data AP2 in units of audio frames, and sequentially sends them to the renderer 77.

ビデオデコーダ７６は、映像パケットデータＶＰ２をビデオフレーム単位でデコードすることにより圧縮符号化前のビデオフレームＶＦ２を復元し、順次レンダラー７７へ送出する。 The video decoder 76 restores the video frame VF2 before compression encoding by decoding the video packet data VP2 in units of video frames, and sequentially sends them to the renderer 77.

レンダラー７７は、オーディオフレームＡＦ２をリングバッファでなる出力オーディオバッファ７８へ一時的に格納し、また同様にビデオフレームＶＦ２をリングバッファでなる出力ビデオバッファ７９に一時的に格納する。 The renderer 77 temporarily stores the audio frame AF2 in the output audio buffer 78 that is a ring buffer, and similarly temporarily stores the video frame VF2 in the output video buffer 79 that is a ring buffer.

そしてレンダラー７７は、モニタ１３へ出力するビデオフレームＶＦ２の映像とオーディオフレームＡＦ２の音声とをリップシンクさせるべくオーディオタイムスタンプＡＴＳ及びビデオタイムスタンプＶＴＳに基づいて最終的な出力タイミングを調整した後、その出力タイミングで出力オーディオバッファ７８及び出力ビデオバッファ７９からオーディオフレームＡＦ２及びビデオフレームＶＦ２をモニタ１３へ順次出力するようになされている。 Then, the renderer 77 adjusts the final output timing based on the audio time stamp ATS and the video time stamp VTS so as to lip-sync the video of the video frame VF2 output to the monitor 13 and the audio of the audio frame AF2. The audio frame AF2 and the video frame VF2 are sequentially output from the output audio buffer 78 and the output video buffer 79 to the monitor 13 at the output timing.

ところでリアルタイムストリーミングデコーダ１２は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１からＵＤＰで送信されるクロックリファレンスｐｃｒを受信して減算回路８１に入力する。 Incidentally, the real-time streaming decoder 12 receives the clock reference pcr transmitted by the UDP from the PCR circuit 61 of the real-time streaming encoder 11 in the first content receiving device 3 and inputs it to the subtraction circuit 81.

減算回路８１は、クロックリファレンスｐｃｒとＳＴＣ回路８４から供給されるシステムタイムクロックｓｔｃとの差を算出し、これをフィルタ８２、電圧制御型クリスタルオシレータ回路８３及びＳＴＣ回路８４を順次介して減算回路８１にフィードバックすることによりＰＬＬ(Phase Locked Loop)を形成し、リアルタイムストリーミングエンコーダ１１のクロックリファレンスｐｃｒに次第に収束させ、最終的には当該クロックリファレンスｐｃｒによりリアルタイムストリーミングエンコーダ１１と同期したシステムタイムクロックｓｔｃをレンダラー７７へ供給するようになされている。 The subtraction circuit 81 calculates a difference between the clock reference pcr and the system time clock stc supplied from the STC circuit 84, and sequentially subtracts this through the filter 82, the voltage controlled crystal oscillator circuit 83, and the STC circuit 84. To form a PLL (Phase Locked Loop) and gradually converge to the clock reference pcr of the real-time streaming encoder 11, and finally render the system time clock stc synchronized with the real-time streaming encoder 11 by the clock reference pcr. 77.

これによりレンダラー７７は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１でビデオデータＶＤ２及びオーディオデータＡＤ２を圧縮符号化したり、ビデオタイムスタンプＶＴＳ及びオーディオタイムスタンプＡＴＳをカウントするときのクロック周波数と同期したシステムタイムクロックｓｔｃを基準にして、ビデオフレームＶＦ２及びオーディオフレームＡＦ２の出力タイミングを調整し得るようになされている。 As a result, the renderer 77 compresses and encodes the video data VD2 and the audio data AD2 by the real-time streaming encoder 11 in the first content receiving apparatus 3, and synchronizes with the clock frequency when the video time stamp VTS and the audio time stamp ATS are counted. The output timing of the video frame VF2 and the audio frame AF2 can be adjusted on the basis of the system time clock stc.

実際上レンダラー７７は、オーディオフレームＡＦ２に関してはリングバッファでなる出力オーディオバッファ７８へ一時的に格納すると共に、ビデオフレームＶＦ２に関してはリングバッファでなる出力ビデオバッファ７９に一時的に格納し、映像と音声とをリップシンクさせた状態で出力するべく、リアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から供給されるクロックリファレンスｐｃｒによりエンコーダ側と同期したシステムタイムクロックｓｔｃ及びオーディオタイムスタンプＡＴＳ、ビデオタイムスタンプＶＴＳに基づいて出力タイミングを調整するようになされている。 In practice, the renderer 77 temporarily stores the audio frame AF2 in the output audio buffer 78 that is a ring buffer, and temporarily stores the video frame VF2 in the output video buffer 79 that is a ring buffer. Are output in a lip sync state based on the system time clock stc, audio time stamp ATS, and video time stamp VTS synchronized with the encoder side by the clock reference pcr supplied from the PCR circuit 61 of the real-time streaming encoder 11. The output timing is adjusted.

（７）ライブストリーミングにおけるデコーダ側でのリップシンク調整処理
（７−１）ライブストリーミングにおけるビデオフレーム及びオーディオフレームの出力タイミング調整方法
図１３に示すように、この場合レンダラー７７は、リアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から所定周期で供給されてくるクロックリファレンスｐｃｒの値に、システムタイムクロックｓｔｃのクロック周波数をＰＬＬでロックさせたうえで、当該システムタイムクロックｓｔｃの基で同期されたモニタ１３を通してオーディオタイムスタンプＡＴＳ及びビデオタイムスタンプＶＴＳに従いオーディオフレームＡＦ２及びビデオフレームＶＦ２の出力をコントロールする。 (7) Lip sync adjustment processing on the decoder side in live streaming (7-1) Output timing adjustment method of video frame and audio frame in live streaming As shown in FIG. The clock time of the system time clock stc is locked by the PLL to the value of the clock reference pcr supplied from the PCR circuit 61 at a predetermined cycle, and then the audio time is passed through the monitor 13 synchronized based on the system time clock stc. The output of the audio frame AF2 and the video frame VF2 is controlled according to the stamp ATS and the video time stamp VTS.

すなわちレンダラー７７は、クロックリファレンスｐｃｒの値にシステムタイムクロックｓｔｃのクロック周波数を同期した状態で、システムタイムクロックｓｔｃ及びオーディオタイムスタンプＡＴＳ（ＡＴＳ１、ＡＴＳ２、ＡＴＳ３、……）に従ってオーディオフレームＡＦ２（Ａｆ１、Ａｆ２、Ａｆ３、……）をモニタ１３へ順次出力する。 That is, the renderer 77 synchronizes the clock frequency of the system time clock stc with the value of the clock reference pcr, and in accordance with the system time clock stc and the audio time stamp ATS (ATS1, ATS2, ATS3,...), The audio frame AF2 (Af1,. Af2, Af3,...) Are sequentially output to the monitor 13.

ここで、クロックリファレンスｐｃｒの値とシステムタイムクロックｓｔｃのクロック周波数とは前述のように同期関係を維持しているので、システムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）との間で、例えば時点Ｔｖ１においてシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ１との差分値Ｄ２Ｖが発生することはない。 Here, since the value of the clock reference pcr and the clock frequency of the system time clock stc maintain the synchronous relationship as described above, the count value of the system time clock stc and the video time stamp VTS (VTS1, VTS2, VTS3, )), For example, the difference value D2V between the count value of the system time clock stc and the video time stamp VTS1 does not occur at the time point Tv1.

しかしながら、リアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から供給されるクロックリファレンスｐｃｒはリアルタイム性が要求されるＵＤＰで送信されてくるものであり、高速性を重視するあまり再送制御されないので当該クロックリファレンスｐｃｒが第２のコンテンツ受信装置４のリアルタイムストリーミングデコーダ１２へ到達しないか、あるいはエラーデータを含んで到達することもある。 However, since the clock reference pcr supplied from the PCR circuit 61 of the real-time streaming encoder 11 is transmitted by UDP that requires real-time performance, the clock reference pcr is not subjected to retransmission control so that high-speed performance is important. 2 may not reach the real-time streaming decoder 12 of the content receiver 4 or may include error data.

このような場合には、リアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から所定周期で供給されてくるクロックリファレンスｐｃｒの値と、システムタイムクロックｓｔｃのクロック周波数との同期がＰＬＬを介してずれることがあるが、このときも本発明におけるレンダラー７７ではリップシンクを保障し得るようになされている。 In such a case, the synchronization between the value of the clock reference pcr supplied from the PCR circuit 61 of the real-time streaming encoder 11 at a predetermined cycle and the clock frequency of the system time clock stc may be shifted via the PLL. At this time, the renderer 77 according to the present invention can guarantee the lip sync.

本発明では、システムタイムクロックｓｔｃとオーディオタイムスタンプＡＴＳそしてビデオタイムスタンプＶＴＳとの間にずれが生じた場合、やはりリップシンクを取る方法として、オーディオ出力の連続性を優先させるようになされている。 In the present invention, when a deviation occurs between the system time clock stc, the audio time stamp ATS, and the video time stamp VTS, the continuity of the audio output is given priority as a method of taking a lip sync.

レンダラー７７は、オーディオフレームＡＦ２の出力タイミングＴａ２でのシステムタイムクロックｓｔｃのカウント値とオーディオタイムスタンプＡＴＳ２とを比較し、その差分値Ｄ２Ａを記憶する。一方、レンダラー７７はビデオフレームＶＦ２の出力タイミングＴｖ２でのシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ２とを比較し、その差分値Ｄ２Ｖを記憶する。 The renderer 77 compares the count value of the system time clock stc at the output timing Ta2 of the audio frame AF2 with the audio time stamp ATS2, and stores the difference value D2A. On the other hand, the renderer 77 compares the count value of the system time clock stc at the output timing Tv2 of the video frame VF2 with the video time stamp VTS2, and stores the difference value D2V.

このとき、クロックリファレンスｐｃｒが第２のコンテンツ受信装置４のリアルタイムストリーミングデコーダ１２へ確実に到達し、クロックリファレンスｐｃｒの値と当該リアルタイムストリーミングデコーダ１２のシステムタイムクロックｓｔｃのクロック周波数とがＰＬＬを介して完全に一致し、モニタ１３を含んでデコーダ側がシステムタイムクロックｓｔｃに同期していれば差分値Ｄ２Ｖ、Ｄ２Ａは「０」となる。 At this time, the clock reference pcr surely reaches the real-time streaming decoder 12 of the second content receiving device 4, and the value of the clock reference pcr and the clock frequency of the system time clock stc of the real-time streaming decoder 12 are transmitted via the PLL. If the values coincide completely and the decoder side including the monitor 13 is synchronized with the system time clock stc, the difference values D2V and D2A are “0”.

この差分Ｄ２Ａが正値であればオーディオフレームＡＦ２は早いと判断され、負値であればオーディオフレームＡＦ２は遅れていると判断される。同様に、差分Ｄ２Ｖが正値であればビデオフレームＶＦ２は早いと判断され、負値であればビデオフレームＶＦ２は遅れていると判断される。 If the difference D2A is a positive value, the audio frame AF2 is determined to be early, and if the difference D2A is a negative value, the audio frame AF2 is determined to be delayed. Similarly, if the difference D2V is a positive value, it is determined that the video frame VF2 is early, and if the difference D2V is a negative value, it is determined that the video frame VF2 is delayed.

ここでレンダラー７７は、オーディオフレームＡＦ２が早くても遅れていても、オーディオ出力の連続性を維持させることを優先させ、オーディオフレームＡＦ２に対するビデオフレームＶＦ２の出力を相対的に次のように制御する。 Here, the renderer 77 gives priority to maintaining the continuity of the audio output regardless of whether the audio frame AF2 is early or late, and controls the output of the video frame VF2 relative to the audio frame AF2 as follows. .

例えば、時点Ｔｖ２のタイミングにおいて、Ｄ２Ｖ−Ｄ２Ａが閾値ＴＨよりも大きい場合、差分値Ｄ２Ｖが差分値Ｄ２Ａよりも大きければ音声に映像が追いついていない状態であるため、レンダラー７７はＧＯＰを構成している例えばＢピクチャに相当するビデオフレームＶｆ３（図示せず）をデコードすることなくスキップして次のビデオフレームＶｆ４を出力するようになされている。 For example, when D2V-D2A is greater than the threshold value TH at the time Tv2, the renderer 77 forms a GOP because the video does not catch up with the audio if the difference value D2V is greater than the difference value D2A. For example, a video frame Vf3 (not shown) corresponding to a B picture is skipped without being decoded and the next video frame Vf4 is output.

この場合、レンダラー７７は出力ビデオバッファ７９に格納されている「Ｐ」ピクチャについては、ビデオデコーダ７６で次のピクチャをデコードする際の参照フレームとなるためスキップせず、次のピクチャを生成する際の参照フレームとならない非参照フレームである「Ｂ」ピクチャをスキップすることにより、画質劣化を未然に防ぎながらリップシンクさせるようになされている。 In this case, the renderer 77 does not skip the “P” picture stored in the output video buffer 79 and generates the next picture without skipping since it becomes a reference frame when the video decoder 76 decodes the next picture. By skipping the “B” picture, which is a non-reference frame that does not become the reference frame, lip sync is performed while preventing image quality deterioration.

これに対してＤ２Ｖ−Ｄ２Ａが閾値ＴＨよりも大きく、差分値Ｄ２Ａの方が差分値Ｄ２Ｖよりも大きければ映像に音声が追いついていない状態であるため、レンダラー７７は現在出力中のビデオフレームＶｆ２を繰り返しリピート出力するようになされている。 On the other hand, if D2V-D2A is larger than the threshold value TH and the difference value D2A is larger than the difference value D2V, the audio has not caught up with the video, so the renderer 77 determines the video frame Vf2 currently being output. Repeated output is made.

また、Ｄ２Ｖ−Ｄ２Ａが閾値ＴＨよりも小さい場合は、音声に対する映像のギャップは許容範囲内であると判断され、レンダラー７７は当該ビデオフレームＶＦ２をそのままモニタ１３へ出力する。 When D2V-D2A is smaller than the threshold value TH, it is determined that the video gap with respect to the sound is within the allowable range, and the renderer 77 outputs the video frame VF2 to the monitor 13 as it is.

ところでレンダラー７７では、仮にスキップするべき「Ｂ」ピクチャが出力ビデオバッファ７９に存在せず、「Ｉ」ピクチャや「Ｐ」ピクチャばかりであった場合には、当該「Ｂ」ピクチャをスキップすることはできないため、音声に映像を追い付かせることができなくなってしまう。 By the way, in the renderer 77, if the “B” picture to be skipped does not exist in the output video buffer 79 and there are only “I” picture and “P” picture, the “B” picture cannot be skipped. Because it is not possible, it will not be able to catch up with the video.

そこでレンダラー７７では、第１のコンテンツ受信装置３におけるストリーミングデコーダ９のレンダラー３７と同様に、スキップすべき「Ｂ」ピクチャが存在しないときには、モニタ１３のモニタ出力タイミングが例えば６０[Hz]であり、出力ビデオバッファ７９から出力すべきビデオフレームＶＦ２のピクチャリフレッシュタイミングが３０[Hz]であることを利用し、当該ピクチャリフレッシュタイミングを短縮するようになされている。 Therefore, in the renderer 77, similarly to the renderer 37 of the streaming decoder 9 in the first content receiving device 3, when there is no “B” picture to be skipped, the monitor output timing of the monitor 13 is 60 [Hz], for example. Using the fact that the picture refresh timing of the video frame VF2 to be output from the output video buffer 79 is 30 [Hz], the picture refresh timing is shortened.

具体的にはレンダラー７７は、クロックリファレンスｐｃｒと同期したシステムタイムクロックｓｔｃとビデオタイムスタンプＶＴＳの差分値が１６．６６６……[msec]を超えるとき、すなわち音声の出力タイミングに対してモニタ出力タイミングが１フレーム分以上遅れているときは、１フレーム分のビデオフレームをスキップする代わりにピクチャリフレッシュタイミングを３０[Hz]から６０[Hz]に変更して表示間隔を短縮するようになされている。 Specifically, the renderer 77 monitors the monitor output timing when the difference value between the system time clock stc synchronized with the clock reference pcr and the video time stamp VTS exceeds 16.666... [Msec], that is, with respect to the audio output timing. Is delayed by one frame or more, the picture refresh timing is changed from 30 [Hz] to 60 [Hz] instead of skipping one video frame to shorten the display interval.

つまりレンダラー７７は、当該スキップによる画質劣化の影響を受ける「Ｉ」ピクチャや「Ｐ」ピクチャについてはピクチャリフレッシュ間隔を１／３０秒から１／６０秒に短縮することにより、「Ｉ」ピクチャや「Ｐ」ピクチャをスキップすることによる画質劣化を生じさせることなく映像を音声に追い付かせることができるようになされている。 In other words, the renderer 77 shortens the picture refresh interval from 1/30 seconds to 1/60 seconds for the “I” picture and “P” picture that are affected by the image quality degradation due to the skip, so that the “I” picture and “ It is possible to make the video catch up with the audio without causing image quality degradation due to skipping the “P” picture.

（７−２）ライブストリーミングにおけるリップシンク調整処理手順
上述のようにリアルタイムストリーミングデコーダ１２のレンダラー７７がライブストリーミング再生を行う際に、オーディオフレームＡＦ２を基準にしてビデオフレームＶＦ２の出力タイミングを調整することにより映像と音声とをリップシンクさせる出力タイミング調整方法についてまとめると、次の図１４のフローチャートに示すように、リアルタイムストリーミングデコーダ１２のレンダラー７７は、ルーチンＲＴ２の開始ステップから入って、次のステップＳＰ２１へ移る。 (7-2) Lip Sync Adjustment Processing Procedure in Live Streaming As described above, when the renderer 77 of the real-time streaming decoder 12 performs live streaming playback, the output timing of the video frame VF2 is adjusted based on the audio frame AF2. When the output timing adjustment method for lip-syncing video and audio is summarized by the above, as shown in the flowchart of FIG. 14, the renderer 77 of the real-time streaming decoder 12 enters from the start step of the routine RT2, and then the next step SP21. Move on.

ステップＳＰ２１において、第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２のレンダラー７７は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１からクロックリファレンスｐｃｒを受信し、次のステップＳＰ２２へ移る。 In step SP21, the renderer 77 of the real-time streaming decoder 12 in the second content receiving device 4 receives the clock reference pcr from the PCR circuit 61 of the real-time streaming encoder 11 in the first content receiving device 3, and proceeds to the next step SP22. Move.

ステップＳＰ２２においてレンダラー７７は、減算回路８１、フィルタ８２、電圧制御型クリスタルオシレータ回路８３及びＳＴＣ回路８４を介して構成されるＰＬＬによってクロックリファレンスｐｃｒとシステムタイムクロックｓｔｃとを同期させることにより、これ以降、出力タイミングを調整する際の基準として当該クロックリファンレンスｐｃｒに同期したシステムタイムクロックｓｔｃを用い、次のステップＳＰ２３へ移る。 In step SP22, the renderer 77 synchronizes the clock reference pcr and the system time clock stc with the PLL configured via the subtracting circuit 81, the filter 82, the voltage controlled crystal oscillator circuit 83, and the STC circuit 84, and thereafter. Then, the system time clock stc synchronized with the clock reference pcr is used as a reference for adjusting the output timing, and the process proceeds to the next step SP23.

ステップＳＰ２３においてレンダラー７７は、時点Ｔｖ１、Ｔｖ２、Ｔｖ３、……のタイミイグにおけるシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳとの差分値Ｄ２Ｖを算出し、また時点Ｔａ１、Ｔａ２、Ｔａ３、……のタイミングにおけるシステムタイムクロックｓｔｃのカウント値とオーディオタイムスタンプＡＴＳとの差分値Ｄ２Ａを算出し、次のステップＳＰ２４へ移る。 In step SP23, the renderer 77 calculates a difference value D2V between the count value of the system time clock stc and the video time stamp VTS at the timings Tv1, Tv2, Tv3,..., And the time points Ta1, Ta2, Ta3,. The difference value D2A between the count value of the system time clock stc and the audio time stamp ATS at the timing is calculated, and the process proceeds to the next step SP24.

ステップＳＰ２４においてレンダラー７７は、ステップＳＰ２３で算出した差分値Ｄ２Ｖ、Ｄ２Ａに基づいて算出したＤ２Ｖ−Ｄ２Ａが閾値ＴＨ（例えば１００[msec]）よりも小さい場合、否定結果を得て次のステップＳＰ２５へ移る。 In step SP24, the renderer 77 obtains a negative result when D2V-D2A calculated based on the difference values D2V and D2A calculated in step SP23 is smaller than a threshold value TH (for example, 100 [msec]), and proceeds to the next step SP25. Move.

ステップＳＰ２５においてレンダラー７７は、Ｄ２Ａ−Ｄ２Ｖが閾値ＴＨ（例えば１００[msec]）よりも大きい場合、肯定結果を得て映像が音声に対して進んでいると判断し、次のステップＳＰ２６へ移る。 In step SP25, when D2A-D2V is larger than the threshold value TH (for example, 100 [msec]), the renderer 77 obtains a positive result and determines that the video is proceeding with respect to the sound, and proceeds to the next step SP26.

ステップＳＰ２６においてレンダラー７７は、映像の方が音声よりも進んでいるので、音声が映像に追いつくように現在出力中のピクチャを構成するビデオフレームＶＦ２をリピートして出力した後、次のステップＳＰ３１へ移って処理を終了する。 In step SP26, the renderer 77 repeats and outputs the video frame VF2 constituting the picture currently being output so that the audio catches up with the video because the video is ahead of the audio, and then proceeds to the next step SP31. The process is terminated.

これに対してステップＳＰ２５でＤ２Ａ−Ｄ２Ｖが閾値ＴＨを越えていないのであれば、否定結果を得て、音声と映像との間にずれが生じているとは感じない程度であると判断し、次のステップＳＰ２７へ移る。 On the other hand, if D2A-D2V does not exceed the threshold value TH in step SP25, a negative result is obtained, and it is determined that there is no feeling that there is a deviation between the audio and the video, Control goes to the next step SP27.

ステップＳＰ２７においてレンダラー７７は、映像と音声との間でずれが生じているとは感じない程度の時間差しかないので、この場合はクロックリファレンスｐｃｒと同期したシステムタイムクロックｓｔｃを基に、ビデオフレームＶＦ２をビデオタイムスタンプＶＴＳに従ってそのままモニタ１３へ出力し、次のステップＳＰ３１へ移って処理を終了する。 In step SP27, the renderer 77 does not have a time that does not feel that there is a difference between video and audio. In this case, the video frame VF2 is based on the system time clock stc synchronized with the clock reference pcr. Is output as it is to the monitor 13 in accordance with the video time stamp VTS, and the process proceeds to the next step SP31.

なおレンダラー７７は、音声に関しては音の連続性を維持させるため、上記のいずれの場合においても、クロックリファレンスｐｃｒと同期したシステムタイムクロックｓｔｃを基に、オーディオタイムスタンプＡＴＳに従ってそのままモニタ１３へ出力するようになされている。 The renderer 77 outputs the sound to the monitor 13 as it is according to the audio time stamp ATS based on the system time clock stc synchronized with the clock reference pcr in any of the above cases in order to maintain the continuity of the sound. It is made like that.

これに対してステップＳＰ２４で肯定結果が得られると、このことはＤ２Ｖ−Ｄ２Ａが閾値ＴＨ（例えば１００[msec]）よりも大きいこと、すなわち音声に対して映像が遅れていることを表しており、このときレンダラー７７は次のステップＳＰ２８へ移る。 On the other hand, if a positive result is obtained in step SP24, this indicates that D2V-D2A is larger than a threshold value TH (for example, 100 [msec]), that is, the video is delayed with respect to the sound. At this time, the renderer 77 proceeds to the next step SP28.

ステップＳＰ２８においてレンダラー７７は、出力ビデオバッファ７９に「Ｂ」ピクチャが存在するか否かを判定し、肯定結果が得られると次のステップＳＰ２９へ移り、否定結果が得られると次のステップＳＰ３０へ移る。 In step SP28, the renderer 77 determines whether or not a “B” picture exists in the output video buffer 79. If a positive result is obtained, the process proceeds to the next step SP29, and if a negative result is obtained, the process proceeds to the next step SP30. Move.

ステップＳＰ２９においてレンダラー７７は、ビオデがオーディオに対して遅れていると判断し、かつ「Ｂ」ピクチャが出力ビデオバッファ７９に存在することを確認したので、Ｂピクチャ（ビデオフレームＶｆ３）をデコードせずにスキップして出力することにより、音声に対する映像の遅れを取り戻してリップシンクさせることができ、次のステップＳＰ３１へ移って処理を終了する。 In step SP29, the renderer 77 determines that the video is behind the audio and confirms that the “B” picture is present in the output video buffer 79, so the B picture (video frame Vf3) is not decoded. By skipping to the output, the delay of the video with respect to the audio can be recovered and the lip sync can be performed, and the process proceeds to the next step SP31 to end the process.

一方、ステップＳＰ３０においてレンダラー７７は、モニタ１３のモニタ出力タイミングが６０[Hz]であるのに対し、ビデオフレームＶＦ２のピクチャリフレッシュタイミングが３０[Hz]であることを利用し、当該ピクチャリフレッシュタイミングをモニタ１３のモニタ出力タイミングに合わせて短縮することにより、ピクチャをスキップすることによる画質劣化を生じさせることなく音声に映像を追い付かせ、次のステップＳＰ３１へ移って処理を終了する。 On the other hand, in step SP30, the renderer 77 uses the fact that the monitor refresh timing of the monitor 13 is 60 [Hz] while the picture refresh timing of the video frame VF2 is 30 [Hz]. By shortening according to the monitor output timing of the monitor 13, the video is caught up with the sound without causing picture quality deterioration due to skipping the picture, and the process proceeds to the next step SP 31 to end the process.

このように第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２のレンダラー７７は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１のクロックリファレンスｐｃｒと当該リアルタイムストリーミングデコーダ１２のシステムタイムクロックｓｔｃとを同期させることによりライブストリーミング再生を実現すると共に、そのためのクロックリファレンスｐｃｒがＵＤＰでリアルタイム性を重要視するために再送制御されずに到達しないことがあった場合でも、システムタイムクロックｓｔｃに対するオーディオタイムスタンプＡＴＳ、ビデオタイムスタンプＶＴＳのずれに応じてリップシンク調整処理を実行することにより、ライブストリーミング再生を行いながらも確実にリップシンクし得るようになされている。 As described above, the renderer 77 of the real-time streaming decoder 12 in the second content receiving device 4 synchronizes the clock reference pcr of the real-time streaming encoder 11 and the system time clock stc of the real-time streaming decoder 12 in the first content receiving device 3. Therefore, even if the clock reference pcr for that purpose does not arrive without being retransmitted because the real time property is important in UDP, the audio time stamp ATS with respect to the system time clock stc is realized. By executing the lip sync adjustment process according to the deviation of the video time stamp VTS, it is possible to ensure the playback while performing live streaming playback. It is configured so as to be able to Pushinku.

（８）動作及び効果
以上の構成において、第１のコンテンツ受信装置３のストリーミングデコーダ９は、コンテンツの種類がオーディオ及びビデオからなるものである場合、プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃの値をプリセットした後に必ずプリセット用オーディオタイムスタンプＡＴＳｐでプリセットし直すことにより、最終的には必ずシステムタイムクロックｓｔｃの値とプリセット用オーディオタイムスタンプＡＴＳｐ（ＡＴＳｐ１、ＡＴＳｐ２、ＡＴＳｐ３、……）とを一致させる。 (8) Operation and Effect In the above configuration, the streaming decoder 9 of the first content receiving device 3 uses the preset video time stamp VTSp and the system time clock stc when the content type is composed of audio and video. After presetting the values, be sure to re-set with the preset audio time stamp ATSp, so that the system time clock stc will eventually match the preset audio time stamp ATSp (ATSp1, ATSp2, ATSp3, ...) Let

ストリーミングデコーダ９のレンダラー３７は、プリセット用オーディオタイムスタンプＡＴＳｐでプリセットしたシステムタイムクロックｓｔｃのカウント値と、ビデオフレームＶＦ１（Ｖｆ１、Ｖｆ２、Ｖｆ３、……）に付されたビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）との差分値Ｄ１を算出することにより、当該ビデオタイムスタンプＶＴＳを付したエンコーダ側のクロック周波数とデコーダ側システムタイムクロックｓｔｃのクロック周波数とのずれによって生じる時間差を認識することができる。 The renderer 37 of the streaming decoder 9 includes the count value of the system time clock stc preset by the preset audio time stamp ATSp and the video time stamp VTS (VTS1, VTS1,...) Attached to the video frame VF1 (Vf1, Vf2, Vf3,...). By calculating a difference value D1 from VTS2, VTS3,..., A time difference caused by a difference between the clock frequency of the encoder side with the video time stamp VTS and the clock frequency of the decoder system time clock stc is recognized. be able to.

ストリーミングデコーダ９のレンダラー３７は、その差分値Ｄ１に応じてビデオフレームＶＦ１の現ピクチャをリピートして出力したり、又は非参照フレームのＢピクチャをデコードせずにスキップして出力したり、或いはピクチャリフレッシュタイミングを短縮して出力することにより、モニタ１０へ出力する音声を途切れさせることなく当該音声の連続性を保ったまま、その音声に対する映像の出力タイミングを調整することができる。 The renderer 37 of the streaming decoder 9 repeats and outputs the current picture of the video frame VF1 according to the difference value D1, or skips and outputs the B picture of the non-reference frame without decoding, or By shortening and outputting the refresh timing, it is possible to adjust the output timing of the video for the sound while maintaining the continuity of the sound without interrupting the sound output to the monitor 10.

もちろんレンダラー３７は、差分値Ｄ１が閾値ＴＨ以下であって、ユーザがリップシンクのずれを認識し得ない程度である場合には、リピート出力やスキップ再生処理或いはピクチャリフレッシュ間隔を短縮することをせずにビデオタイムスタンプＶＴＳ（ＶＴＳ１、ＶＴＳ２、ＶＴＳ３、……）の通りにモニタ１０へ出力することもできるので、この場合にも当然映像の連続性を保つことができる。 Of course, if the difference value D1 is equal to or less than the threshold value TH and the user cannot recognize the lip sync deviation, the renderer 37 can shorten the repeat output, skip reproduction process, or picture refresh interval. Since it can also be output to the monitor 10 according to the video time stamp VTS (VTS1, VTS2, VTS3,...), The continuity of the video can naturally be maintained in this case as well.

さらに第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２のレンダラー７７は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から供給されるクロックリファレンスｐｃｒとデコーダ側のシステムタイムクロックｓｔｃとを同期させた上で、オーディオタイムスタンプＡＴＳ及びビデオタイムスタンプＶＴＳに従ってオーディオフレームＡＦ２及びビデオフレームＶＦ２をモニタ１３へ出力することができるので、リアルタイム性を保持したままライブストリーミング再生を実現することができる。 Furthermore, the renderer 77 of the real-time streaming decoder 12 in the second content receiver 4 includes a clock reference pcr supplied from the PCR circuit 61 of the real-time streaming encoder 11 in the first content receiver 3 and a system time clock stc on the decoder side. Since the audio frame AF2 and the video frame VF2 can be output to the monitor 13 in accordance with the audio time stamp ATS and the video time stamp VTS, live streaming reproduction can be realized while maintaining real-time characteristics. .

その上、第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２のレンダラー７７は、第１のコンテンツ受信装置３におけるリアルタイムストリーミングエンコーダ１１のＰＣＲ回路６１から供給されるクロックリファレンスｐｃｒがＵＤＰで再送制御されずに到達しないために、当該クロックリファレンスｐｃｒとシステムタイムクロックｓｔｃとの同期が外れたとしても、システムタイムクロックｓｔｃとビデオタイムスタンプＶＴＳとの差分値Ｄ２Ｖ、システムタイムクロックｓｔｃとオーディオタイムスタンプＡＴＳとの差分値Ｄ２Ａを算出し、当該差分値Ｄ２ＶとＤ２Ａとのギャップに応じてビデオフレームＶＦ２の出力タイミングを調整することにより、モニタ１３へ出力する音声を途切れさせることなく連続性を保ったまま、その音声に対する映像の出力タイミングを調整することができる。 In addition, the renderer 77 of the real-time streaming decoder 12 in the second content receiver 4 does not retransmit the clock reference pcr supplied from the PCR circuit 61 of the real-time streaming encoder 11 in the first content receiver 3 using UDP. Therefore, even if the clock reference pcr and the system time clock stc are out of synchronization, the difference value D2V between the system time clock stc and the video time stamp VTS, the system time clock stc and the audio time stamp ATS By calculating the difference value D2A and adjusting the output timing of the video frame VF2 according to the gap between the difference values D2V and D2A, the sound output to the monitor 13 is not interrupted. While maintaining the continuity, it is possible to adjust the output timing of the video for the audio.

また第１のコンテンツ受信装置３におけるストリーミングデコーダ９のレンダラー３７は、プリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従い、当該プリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐを用いてシステムタイムクロックｓｔｃをプリセットするようにしたことにより、コンテンツの種類がオーディオだけからなるものであるときにはプリセット用オーディオタイムスタンプＡＴＳｐでシステムタイムクロックｓｔｃをプリセットすることができ、またコンテンツの種類がビデオのみでなるものであるときには、プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃをプリセットすることができるので、コンテンツの種類がオーディオ及びビデオからなるもの、オーディオのみからなるもの又はビデオのみからなるものである場合にも対応することができる。 Also, the renderer 37 of the streaming decoder 9 in the first content receiving device 3 follows the preset sequence determined in the order of the preset video time stamp VTSp and the preset audio time stamp ATSp, and the preset video time stamp VTSp and the preset use time stamp. Since the system time clock stc is preset using the audio time stamp ATSp, the system time clock stc can be preset with the audio time stamp ATSp for presetting when the type of content is only audio. If the content type is only video, the preset video time stamp VTSp is used as the system time clock stc. It is possible to preset which types of content consists of audio and video, may also be corresponding if made of only the or a video composed of only audio.

すなわちストリーミングデコーダ９のレンダラー３７は、コンテンツが必ずしもオーディオ及びビデオからなるものでなく、当該コンテンツがビデオのみからなるものであってプリセット用オーディオタイムスタンプＡＴＳｐが存在しないときや、コンテンツがオーディオのみからなるものであってプリセット用ビデオタイムスタンプＶＴＳｐが存在しないときでも、ビデオフレームＶＦ１やオーディオフレームＡＦ１の出力に対応することができるので、コンテンツの種類に応じた最適なタイミングでモニタ１０へ出力することができる。 That is, the renderer 37 of the streaming decoder 9 does not necessarily include audio and video, but the content includes only video and the preset audio time stamp ATSp does not exist, or the content includes only audio. Even when the preset video time stamp VTSp does not exist, the video frame VF1 and the audio frame AF1 can be output. Therefore, the video time stamp VTSp can be output to the monitor 10 at an optimum timing according to the content type. it can.

さらにストリーミングデコーダ９のレンダラー３７では、プリセット後のシステムタイムクロックｓｔｃのカウント値とビデオタイムスタンプＶＴＳ２との差分値Ｄ１が所定の閾値ＴＨよりも大きく、かつ映像が音声よりも遅れている場合で、出力ビデオバッファ３９にＢピクチャが存在するときには画質劣化の影響のない当該Ｂピクチャをデコードすることなくスキップし、出力ビデオバッファ３９にＢピクチャが存在しないときにはモニタ１０のモニタ出力タイミングに合わせてビデオフレームＶＦ１のピクチャリフレッシュタイミングを短縮することによりピクチャスキップによる画質劣化を生じさせずに映像を音声に追い付かせることができる。 Further, in the renderer 37 of the streaming decoder 9, when the difference value D1 between the count value of the system time clock stc after the preset and the video time stamp VTS2 is larger than a predetermined threshold value TH and the video is delayed from the audio, When there is a B picture in the output video buffer 39, the B picture that is not affected by image quality degradation is skipped without decoding, and when there is no B picture in the output video buffer 39, a video frame is synchronized with the monitor output timing of the monitor 10. By shortening the picture refresh timing of VF1, it is possible to catch up video with audio without causing image quality degradation due to picture skip.

以上の構成によれば、第１のコンテンツ受信装置３におけるストリーミングデコーダ９のレンダラー３７及び第２のコンテンツ受信装置４におけるリアルタイムストリーミングデコーダ１２のレンダラー７７は、オーディオフレームＡＦ１、ＡＦ２の出力タイミングを基準としてビデオフレームＶＦ１、ＶＦ２の出力タイミングを調整することができるので、音声の連続性を保ったまま視聴者であるユーザに違和感を感じさせることなくリップシンクさせることができる。 According to the above configuration, the renderer 37 of the streaming decoder 9 in the first content receiver 3 and the renderer 77 of the real-time streaming decoder 12 in the second content receiver 4 are based on the output timings of the audio frames AF1 and AF2. Since the output timing of the video frames VF1 and VF2 can be adjusted, it is possible to lip-sync without making the viewer user feel uncomfortable while maintaining the continuity of the audio.

（９）他の実施の形態
なお上述の実施の形態においては、オーディオフレームＡＦ１、ＡＦ２を基準とした差分値Ｄ１又はＤ２Ｖ、Ｄ２Ａに応じてリップシンクを調整することによりエンコーダ側のクロック周波数とデコーダ側のクロック周波数とのずれを吸収するようにした場合について述べたが、本発明はこれに限らず、クロックジッタ、ネットワークジッタ等によって生じるエンコーダ側のクロック周波数とデコーダ側のクロック周波数との微妙なずれを吸収するようにしても良い。 (9) Other Embodiments In the above-described embodiment, the clock frequency on the encoder side and the decoder are adjusted by adjusting the lip sync according to the difference values D1 or D2V, D2A based on the audio frames AF1, AF2. However, the present invention is not limited to this, and the present invention is not limited to this, and the subtle difference between the clock frequency on the encoder side and the clock frequency on the decoder side caused by clock jitter, network jitter, etc. The shift may be absorbed.

また上述の実施の形態においては、コンテンツ提供装置２と第１のコンテンツ受信装置３との間でインターネット５を介して接続し、プリエンコーデッドストリーミングを実現するようにした場合について述べたが、本発明はこれに限らず、コンテンツ提供装置２と第２のコンテンツ受信装置４との間でインターネット５を介して接続し、プリエンコーデッドストリーミングを実現するようにしたり、コンテンツ提供装置２から第１のコンテンツ受信装置３を介して第２のコンテンツ受信装置４へコンテンツを提供することによりプリエンコーデッドストリーミングを実現するようにしても良い。 In the above-described embodiment, the case where the content providing apparatus 2 and the first content receiving apparatus 3 are connected via the Internet 5 to realize pre-encoded streaming has been described. The invention is not limited to this, and the content providing apparatus 2 and the second content receiving apparatus 4 are connected via the Internet 5 so as to realize pre-encoded streaming, Pre-encoded streaming may be realized by providing content to the second content receiving device 4 via the content receiving device 3.

さらに上述の実施の形態においては、第１のコンテンツ受信装置３と第２のコンテンツ受信装置４との間でライブストリーミングを行うようにした場合について述べたが、本発明はこれに限らず、コンテンツ提供装置２と第１のコンテンツ受信装置３との間や、コンテンツ提供装置２と第２のコンテンツ受信装置４との間でライブストリーミングを行うようにしても良い。 Furthermore, in the above-described embodiment, the case where live streaming is performed between the first content receiving device 3 and the second content receiving device 4 has been described. However, the present invention is not limited to this, and content Live streaming may be performed between the providing device 2 and the first content receiving device 3 or between the content providing device 2 and the second content receiving device 4.

この場合、コンテンツ提供装置２のストリーミングサーバ８からクロックリファレンスｐｃｒを第１のコンテンツ受信装置３のストリーミングデコーダ９へ送信し、当該ストリーミングデコーダ９でクロックリファレンスｐｃｒとシステムタイムクロックｓｔｃとを同期させることにより、ライブストリーミングを実現することができる。 In this case, the clock reference pcr is transmitted from the streaming server 8 of the content providing device 2 to the streaming decoder 9 of the first content receiving device 3, and the clock decoder pcr and the system time clock stc are synchronized by the streaming decoder 9. Live streaming can be realized.

さらに上述の実施の形態においては、サブトラクタ回路４４及び４５によって最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１の値を所定時間分だけ引き戻すようにした場合について述べたが、本発明はこれに限らず、バッファ４２及び４３を介して最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１がコンパレータ回路４６に到達した時点で、バッファ４２及び４３による遅延等によりＳＴＣ回路４１からコンパレータ回路４６に供給されるプリセット後のシステムタイムクロックｓｔｃの値が当該ビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１を経過していることがなければ、必ずしもサブトラクタ回路４４及び４５によって最初のビデオタイムスタンプＶＴＳ１及びオーディオタイムスタンプＡＴＳ１の値を所定時間分だけ引き戻さなくても良い。 Further, in the above-described embodiment, the case where the values of the first video time stamp VTS1 and the audio time stamp ATS1 are pulled back by a predetermined time by the subtractor circuits 44 and 45 has been described, but the present invention is not limited to this. First, when the first video time stamp VTS1 and audio time stamp ATS1 arrive at the comparator circuit 46 via the buffers 42 and 43, the presets are supplied from the STC circuit 41 to the comparator circuit 46 due to delays by the buffers 42 and 43, etc. If the value of the later system time clock stc has not passed the video time stamp VTS1 and the audio time stamp ATS1, the subtractor circuits 44 and 45 do not necessarily provide the first video time stamp VTS. The value of S1 and audio time-stamp ATS1 may not pulled back by a predetermined time period a.

さらに上述の実施の形態においては、コンテンツの種類を判別する前に当該コンテンツの種類に拘らずプリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐの順番で決められたプリセットシーケンスに従ってシステムタイムクロックｓｔｃをプリセットするようにした場合について述べたが、本発明はこれに限らず、最初にコンテンツの種類を判別し、当該コンテンツがオーディオ及びビデオからなるものであるときには、プリセット用ビデオタイムスタンプＶＴＳｐ、プリセット用オーディオタイムスタンプＡＴＳｐでシステムタイムクロックｓｔｃをプリセットし、当該コンテンツがビデオだけからなるものであるときには、プリセット用ビデオタイムスタンプＶＴＳｐでシステムタイムクロックｓｔｃをプリセットし、当該コンテンツがオーディオだけからなるものであるときには、プリセット用オーディオタイムスタンプＡＴＳｐでシステムタイムクロックｓｔｃをプリセットするようにしても良い。 Furthermore, in the above-described embodiment, the system time clock stc is determined according to a preset sequence determined in the order of the preset video time stamp VTSp and the preset audio time stamp ATSp regardless of the content type before determining the content type. However, the present invention is not limited to this, and the type of content is first determined. When the content is composed of audio and video, the preset video time stamp VTSp, If the system time clock stc is preset with the audio time stamp ATSp and the content is composed only of video, the system time clock is set with the preset video time stamp VTSp. Presetting the stc, when the content is made of only the audio is preset audio time-stamp ATSp may be preset system time clock stc.

さらに上述の実施の形態においては、出力ビデオバッファ３９、７９にＢピクチャが存在しない場合には、モニタ１０及び１３のモニタ出力レートに合わせてビデオフレームＶＦ１及びＶＦ２のピクチャリフレッシュレートを３０[Hz]から６０[Hz]に短縮するようにした場合について述べたが、本発明はこれに限らず、Ｂピクチャの有無に拘らずビデオフレームＶＦ１及びＶＦ２のピクチャリフレッシュレートを３０[Hz]から６０[Hz]に短縮するようにしても良い。この場合でも、レンダラー３７及び７７は、音声に対して映像の遅れを取り戻してリップシンクさせることができる。 Further, in the above-described embodiment, when no B picture exists in the output video buffers 39 and 79, the picture refresh rate of the video frames VF1 and VF2 is set to 30 [Hz] in accordance with the monitor output rate of the monitors 10 and 13. However, the present invention is not limited to this, and the picture refresh rate of the video frames VF1 and VF2 is changed from 30 [Hz] to 60 [Hz] regardless of the presence or absence of the B picture. You may make it shorten to. Even in this case, the renderers 37 and 77 can recover the delay of the video with respect to the audio and perform lip sync.

さらに上述の実施の形態においては、Ｂピクチャをスキップして出力するようにした場合について述べたが、本発明はこれに限らず、Ｉピクチャの直前に位置するＰピクチャをスキップして出力するようにしても良い。 Furthermore, in the above-described embodiment, the case where the B picture is skipped and output has been described. However, the present invention is not limited to this, and the P picture located immediately before the I picture is skipped and output. Anyway.

これは、Ｉピクチャの直前に位置するＰピクチャであれば、次のＩピクチャを生成する際に当該Ｐピクチャが参照されることはなく、スキップしたとしても次のＩピクチャを生成する際に支障を来たすことがなく、画質劣化が生じることもないからである。 This is because if the P picture is located immediately before the I picture, the P picture is not referred to when the next I picture is generated, and even if it is skipped, there is a problem in generating the next I picture. This is because the image quality is not deteriorated.

さらに上述の実施の形態においては、ビデオフレームＶｆ３をデコードせずにスキップしてモニタ１０へ出力するようにした場合について述べたが、本発明はこれに限らず、ビデオフレームＶｆ３をデコードした後に出力ビデオバッファ３９から出力する段階でデコード後のビデオフレームＶｆ３をスキップして出力するようにしても良い。 Further, in the above-described embodiment, the case where the video frame Vf3 is skipped without being decoded and output to the monitor 10 has been described. However, the present invention is not limited to this, and the video frame Vf3 is output after being decoded. At the stage of outputting from the video buffer 39, the decoded video frame Vf3 may be skipped and output.

さらに上述の実施の形態においては、オーディオフレームＡＦ１、ＡＦ２についてはリップシンクの調整を行う際の基準として用いているために、全てのオーディオフレームについて欠けることなくモニタ１０、１３へ出力するようにした場合について述べたが、本発明はこれに限らず、例えば無音部分に相当するオーディオフレームがあった場合には、そのオーディオフレームをスキップして出力するようにしても良い。 Further, in the above-described embodiment, since the audio frames AF1 and AF2 are used as a reference when adjusting the lip sync, all the audio frames are output to the monitors 10 and 13 without any loss. However, the present invention is not limited to this. For example, when there is an audio frame corresponding to a silent portion, the audio frame may be skipped and output.

さらに上述の実施の形態においては、本発明のコンテンツ受信装置を、復号手段としてのオーディオデコーダ３５、７４、ビデオデコーダ３６、７６と、記憶手段としての入力オーディオバッファ３３、７３、出力オーディオバッファ３８、７８、入力ビデオバッファ３４、７５、出力ビデオバッファ３９、７９と、算出手段及びタイミング調整手段としてのレンダラー３７、７７とによって構成するようにした場合について述べたが、本発明はこれに限らず、その他種々の回路構成でコンテンツ受信装置を形成するようにしても良い。 Furthermore, in the above-described embodiment, the content receiving apparatus of the present invention includes audio decoders 35 and 74, video decoders 36 and 76 as decoding means, input audio buffers 33 and 73 as output storage means, output audio buffer 38, 78, the input video buffers 34 and 75, the output video buffers 39 and 79, and the renderers 37 and 77 serving as calculation means and timing adjustment means have been described. However, the present invention is not limited to this. The content receiving device may be formed with various other circuit configurations.

本発明のコンテンツ受信装置、ビデオオーディオ出力タイミング制御方法及びコンテンツ提供システムは、例えばサーバから音声付の動画コンテンツをダウンロードして表示する用途に適用することができる。 The content receiving apparatus, the video / audio output timing control method, and the content providing system according to the present invention can be applied, for example, to downloading and displaying moving image content with sound from a server.

ストリーミングシステムの全容を表すコンテンツ提供システムの全体構成を示す略線的ブロック図である。1 is a schematic block diagram illustrating an overall configuration of a content providing system that represents the entire contents of a streaming system. コンテンツ提供装置の回路構成を示す略線的ブロック図である。It is a rough block diagram which shows the circuit structure of a content provision apparatus. オーディオパケット及びビデオパケット内のタイムスタンプ（ＴＣＰプロトコル）の構造を示す略線図である。It is a basic diagram which shows the structure of the time stamp (TCP protocol) in an audio packet and a video packet. 第１のコンテンツ受信装置におけるストリーミングデコーダのモジュール構成を示す略線的ブロック図である。It is a basic block diagram which shows the module structure of the streaming decoder in a 1st content receiver. タイミングコントロール回路の構成を示す略線的ブロック図である。2 is a schematic block diagram illustrating a configuration of a timing control circuit. FIG. プリセット後のＳＴＣと比較されるタイムスタンプを示す略線図である。It is a basic diagram which shows the time stamp compared with STC after a preset. プリエンコーデッドストリーミングにおけるビデオフレーム及びオーディオフレームの出力タイミングを説明する際に供する略線図である。It is a basic diagram provided when demonstrating the output timing of the video frame and audio frame in pre-encoded streaming. Ｉピクチャ、Ｐピクチャの場合のビデオフレーム出力制御処理の説明に供する略線図である。It is an approximate line figure used for explanation of video frame output control processing in the case of I picture and P picture. プリエンコーデッドストリーミングにおけるリップシンク調整処理手順を示すフローチャートである。It is a flowchart which shows the lip-sync adjustment process sequence in pre-encoded streaming. 第１のコンテンツ受信装置におけるリアルタイムストリーミングエンコーダの回路構成を示す略線的ブロック図である。It is a basic-line block diagram which shows the circuit structure of the real-time streaming encoder in a 1st content receiver. コントロールパケット内のＰＣＲ（ＵＤＰプロトコル）の構造を示す略線図である。It is a basic diagram which shows the structure of PCR (UDP protocol) in a control packet. 第２のコンテンツ受信装置におけるリアルタイムストリーミングデコーダの回路構成を示す略線的ブロック図である。It is a basic-line block diagram which shows the circuit structure of the real-time streaming decoder in a 2nd content receiver. ライブストリーミングにおけるビデオフレーム及びオーディオフレームの出力タイミングを説明する際に供する略線図である。It is an approximate line figure used when explaining the output timing of the video frame and audio frame in live streaming. ライブストリーミングにおけるリップシンク調整処理手順を示す略線的フローチャートである。It is a basic flowchart which shows the lip-sync adjustment process sequence in live streaming.

Explanation of symbols

１……コンテンツ提供システム、２……コンテンツ提供装置、３……第１のコンテンツ受信装置、４……第２のコンテンツ受信装置、５……インターネット、７……エンコーダ、８……ストリーミングサーバ、９……ストリーミングデコーダ、１０、１３……モニタ、１１……リアルタイムストリーミングエンコーダ、１２……リアルタイムストリーミングデコーダ、１４……Ｗｅｂサーバ、１５……Ｗｅｂブラウザ、２１、５１……ビデオ入力部、２２、５２……ビデオエンコーダ、２３……ビデオＥＳ蓄積部、２４、５３……オーディオ入力部、２５、５４……オーディオエンコーダ、２６……オーディオＥＳ蓄積部、２８、５７……ビデオフレームカウンタ、２９、５８……オーディオフレームカウンタ、２７、５６……パケット生成部、３０、５９……パケットデータ蓄積部、３１、７１……入力パケット蓄積部、３２、７２……パケット分割部、３３、７３……入力オーディオバッファ、３４、７５……入力ビデオバッファ、３５、７４……オーディオデコーダ、３６、７６……ビデオデコーダ、３７、７７……レンダラー、３８、７８……出力オーディオバッファ、３９、７９……出力ビデオバッファ、４０……クリスタルオシレータ回路、８１……減算回路、８２……フィルタ、８３……電圧制御型クリスタルオシレータ、４１、６０、８４……ＳＴＣ回路、６１……ＰＣＲ回路。
DESCRIPTION OF SYMBOLS 1 ... Content provision system, 2 ... Content provision apparatus, 3 ... 1st content receiver, 4 ... 2nd content receiver, 5 ... Internet, 7 ... Encoder, 8 ... Streaming server, 9: Streaming decoder, 10, 13: Monitor, 11: Real-time streaming encoder, 12: Real-time streaming decoder, 14: Web server, 15: Web browser, 21, 51 ... Video input unit, 22, 52... Video encoder, 23... Video ES storage section, 24 and 53... Audio input section, 25 and 54. Audio encoder, 26... Audio ES storage section, 28 and 57. 58 …… Audio frame counter, 27, 56 …… Packet generation , 30, 59... Packet data storage unit, 31, 71... Input packet storage unit, 32, 72... Packet division unit, 33, 73... Input audio buffer, 34, 75. 74 ... Audio decoder, 36, 76 ... Video decoder, 37, 77 ... Renderer, 38, 78 ... Output audio buffer, 39,79 ... Output video buffer, 40 ... Crystal oscillator circuit, 81 ... Subtraction Circuit, 82... Filter, 83... Voltage controlled crystal oscillator, 41, 60, 84... STC circuit, 61.

Claims

Providing content on the encoder side with a plurality of encoded video frames sequentially attached with video time stamps based on a reference clock on the encoder side and a plurality of encoded audio frames sequentially attached with audio time stamps based on the reference clock Decoding means for receiving and decoding from the device;
Storage means for storing a plurality of video frames and a plurality of audio frames obtained as a result of decoding the encoded video frame and the encoded audio frame by the decoding means;
Calculating means for calculating a time difference caused by a deviation between the clock frequency of the reference clock on the encoder side and the clock frequency of the system time clock on the decoder side;
Timing adjustment means for adjusting a video frame output timing when sequentially outputting the plurality of video frames in units of frames with reference to an audio frame output timing when the plurality of audio frames are sequentially output in units of frames according to the time difference A content receiving apparatus comprising:

2. The content according to claim 1, wherein the timing adjustment unit outputs the video frame according to the video time stamp based on a system time clock on the decoder side when the time difference is shorter than a predetermined time. Receiver device.

Receiving means for receiving a reference clock on the encoder side transmitted by UDP (User Datagram Protocol), which requires real-time performance from the content providing device;
The calculation means synchronizes the reference clock on the encoder side with the system time clock on the decoder side, and then calculates the difference between the clock frequency of the reference clock on the encoder side and the clock frequency of the system time clock on the decoder side. The content receiving apparatus according to claim 1, wherein a time difference that occurs is calculated.

The content receiving device is:
In accordance with a preset sequence determined by the order of the video time stamp and the audio time stamp, preset means for presetting the system time clock on the decoder side using the video time stamp or the audio time stamp,
The content receiving apparatus according to claim 1, wherein the calculating means calculates the time difference caused by a difference between a clock frequency of the reference clock on the encoder side and a clock frequency of the system time clock after the presetting.

The content receiving device is:
By following a preset sequence determined in the order of the video time stamp and the audio time stamp, if the content type is only audio, the system time clock on the decoder side is preset using the audio time stamp. Preset means;
The content receiving apparatus according to claim 1, further comprising: an audio output unit that outputs audio of the audio frame based on the system time clock after the preset and the audio time stamp.

The content receiving apparatus is
By following a preset sequence determined in the order of the video time stamp and the audio time stamp, the system time clock on the decoder side is preset using the video time stamp when the content type is only video. Preset means;
2. The content receiving apparatus according to claim 1, further comprising video output means for outputting the video frame based on the preset system time clock and the video time stamp.

A plurality of encoded video frames sequentially attached with video time stamps based on a reference clock on the encoder side and a plurality of encoded audio frames sequentially attached with audio time stamps based on the reference clock to the decoding means. A decoding step for receiving and decoding from the content providing apparatus on the encoder side;
A storage step of storing a plurality of video frames and a plurality of audio frames obtained as a result of decoding the encoded video frame and the encoded audio frame in the decoding step;
A difference calculating step for causing the calculating means to calculate a time difference caused by a difference between the clock frequency of the reference clock on the encoder side and the clock frequency of the system time clock on the decoder side;
Video frame output when sequentially outputting the plurality of video frames in units of frames with reference to the audio frame output timing when the plurality of audio frames are sequentially output in units of frames in response to the time difference. A video audio output timing control method comprising: a timing adjustment step for adjusting timing.

A content providing system having a content providing device and a content receiving device,
The content providing apparatus includes:
Encoding means for generating a plurality of encoded video frames with video time stamps based on a reference clock on the encoder side, and a plurality of encoded audio frames with audio time stamps based on the reference clock;
Transmitting means for sequentially transmitting the plurality of encoded video frames and the plurality of encoded audio frames to the content receiving device;
The content receiving apparatus is
Decoding means for receiving and decoding a plurality of encoded video frames sequentially attached with the video time stamp and a plurality of encoded audio frames sequentially attached with the audio time stamp from the content providing apparatus on the encoder side; ,
Storage means for storing a plurality of video frames and a plurality of audio frames obtained as a result of decoding the encoded video frame and the encoded audio frame by the decoding means;
Calculating means for calculating a time difference caused by a deviation between the clock frequency of the reference clock on the encoder side and the clock frequency of the system time clock on the decoder side;
Timing adjustment means for adjusting a video frame output timing when sequentially outputting the plurality of video frames in units of frames with reference to an audio frame output timing when the plurality of audio frames are sequentially output in units of frames according to the time difference A content providing system characterized by comprising: