JP2024116702A

JP2024116702A - Scene description editing device and program

Info

Publication number: JP2024116702A
Application number: JP2023022460A
Authority: JP
Inventors: 秀一青木; Shuichi Aoki
Original assignee: Nippon Hoso Kyokai NHK
Current assignee: Japan Broadcasting Corp
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2024-08-28

Abstract

PROBLEM TO BE SOLVED: To provide a scene description editing device capable of properly editing a scene description defining a specific content in a case where a rendering function of a medium processing device is used by a user terminal.

SOLUTION: A scene description editing device comprises: an output unit which transmits, to a medium processing device, a scene description and viewpoint information defining the constitution of a content with a degree of freedom of a viewpoint; and an acquisition unit which acquires a specific content including at least the content, the specific content having been generated by the medium processing device based on the scene description and the viewpoint information.

SELECTED DRAWING: Figure 21

Description

本発明は、シーン記述編集装置に関する。 The present invention relates to a scene description editing device.

従来、360°映像及び3Dオブジェクトなどのコンテンツを伝送する仕組みが提案されている（例えば、非特許文献１）。このような仕組としては、利用者が座位で頭を動かした範囲の視点移動を伴う3DoF+（Degree of Freedom）、利用者が自由に動く範囲の視点移動を伴う6DoFなどが知られている。このような仕組みでは、360°映像と3Dオブジェクトとの位置関係は、シーン記述によって示される。 Conventionally, mechanisms for transmitting content such as 360° video and 3D objects have been proposed (for example, Non-Patent Document 1). Known mechanisms include 3DoF+ (Degree of Freedom), which involves viewpoint movement within the range in which the user moves their head while in a seated position, and 6DoF, which involves viewpoint movement within the range in which the user can move freely. In such mechanisms, the positional relationship between the 360° video and the 3D objects is indicated by a scene description.

3GPP TR 26.928 V16.1.0 2020年12月3GPP TR 26.928 V16.1.0 December 2020

上述した背景下において、視点の自由度を有するコンテンツを含む特定コンテンツをメディア処理装置によって生成した上で、生成された特定コンテンツをメディア処理装置からユーザ端末に送信するケースが考えられる。同様に、特定コンテンツを定義するシーン記述をシーン記述編集装置によって編集するケースが想定される。 In the above-mentioned context, a case can be considered in which specific content including content with viewpoint freedom is generated by a media processing device, and the generated specific content is then transmitted from the media processing device to a user terminal. Similarly, a case can be considered in which a scene description defining the specific content is edited by a scene description editing device.

発明者等は、鋭意検討の結果、上述したケースにおいて、ユーザ端末及びシーン記述編集装置において、メディア処理装置のレンダリング機能を共通化する必要性を見出した。 After careful consideration, the inventors discovered that in the above-mentioned cases, it is necessary to share the rendering function of the media processing device between the user terminal and the scene description editing device.

そこで、本発明は、上述した課題を解決するためになされたものであり、メディア処理装置のレンダリング機能がユーザ端末で利用される場合に、特定コンテンツを定義するシーン記述を適切に編集することを可能とするシーン記述編集装置を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and aims to provide a scene description editing device that makes it possible to appropriately edit scene descriptions that define specific content when the rendering function of a media processing device is used on a user terminal.

開示の概要は、視点の自由度を有するコンテンツの構成を定義するシーン記述及び視点情報をメディア処理装置に出力する出力部と、前記コンテンツを少なくとも含む特定コンテンツであって、前記シーン記述及び前記視点情報に基づいて前記メディア処理装置で生成された特定コンテンツを取得する取得部と、を備える、シーン記述編集装置である。 The outline of the disclosure is a scene description editing device that includes an output unit that outputs a scene description and viewpoint information that define the configuration of content having viewpoint freedom to a media processing device, and an acquisition unit that acquires specific content that includes at least the content and is generated by the media processing device based on the scene description and the viewpoint information.

本発明によれば、メディア処理装置のレンダリング機能がユーザ端末で利用される場合に、特定コンテンツを定義するシーン記述を適切に編集することを可能とするシーン記述編集装置を提供することができる。 The present invention provides a scene description editing device that enables appropriate editing of scene descriptions that define specific content when the rendering function of a media processing device is used on a user terminal.

図１は、実施形態に係る伝送システム10を示す図である。FIG. 1 is a diagram showing a transmission system 10 according to an embodiment. 図２は、実施形態に係るメディア処理装置200及びユーザ端末300を示すブロック図である。FIG. 2 is a block diagram showing a media processing device 200 and a user terminal 300 according to an embodiment of the invention. 図３は、実施形態に係る第2コンテンツを説明するための図である。FIG. 3 is a diagram for explaining the second content according to the embodiment. 図４は、実施形態に係る特定コンテンツの視聴方法を示す図である。FIG. 4 is a diagram showing a method for viewing a specific content according to the embodiment. 図５は、動作例1を説明するための図である。FIG. 5 is a diagram for explaining the first operation example. 図６は、動作例2を説明するための図である。FIG. 6 is a diagram for explaining the second operation example. 図７は、動作例2を説明するための図である。FIG. 7 is a diagram for explaining the second operation example. 図８は、動作例3を説明するための図である。FIG. 8 is a diagram for explaining the third operation example. 図９は、動作例3を説明するための図である。FIG. 9 is a diagram for explaining the third operation example. 図１０は、動作例3を説明するための図である。FIG. 10 is a diagram for explaining the operation example 3. In FIG. 図１１は、動作例3を説明するための図である。FIG. 11 is a diagram for explaining the operation example 3. In FIG. 図１２は、動作例4を説明するための図である。FIG. 12 is a diagram for explaining the fourth operation example. 図１３は、動作例4を説明するための図である。FIG. 13 is a diagram for explaining the fourth operation example. 図１４は、動作例4を説明するための図である。FIG. 14 is a diagram for explaining the fourth operation example. 図１５は、動作例5を説明するための図である。FIG. 15 is a diagram for explaining the fifth operational example. 図１６は、動作例5を説明するための図である。FIG. 16 is a diagram for explaining the fifth operational example. 図１７は、動作例5を説明するための図である。FIG. 17 is a diagram for explaining the fifth operational example. 図１８は、動作例5を説明するための図である。FIG. 18 is a diagram for explaining the fifth operational example. 図１９は、変更例1に係る第1方法ついて説明するための図である。FIG. 19 is a diagram for explaining the first method according to the first modified example. 図２０は、変更例1に係る第2方法ついて説明するための図である。FIG. 20 is a diagram for explaining the second method according to the first modified example. 図２１は、変更例2に係るメディア処理装置200及びシーン記述編集装置600を示すブロック図である。FIG. 21 is a block diagram showing a media processing device 200 and a scene description editing device 600 according to the second modification. 図２２は、変更例2に係るシーン記述編集装置600のUI（User Interface）の一例を示す図である。FIG. 22 is a diagram showing an example of a UI (User Interface) of a scene description editing device 600 according to the second modified example.

次に、本発明の実施形態について説明する。なお、以下の図面の記載において、同一または類似の部分には、同一または類似の符号を付している。ただし、図面は模式的なものであり、各寸法の比率などは現実のものとは異なることに留意すべきである。 Next, an embodiment of the present invention will be described. In the following description of the drawings, the same or similar parts are denoted by the same or similar reference numerals. However, it should be noted that the drawings are schematic, and the ratios of the dimensions may differ from the actual ones.

したがって、具体的な寸法などは以下の説明を参酌して判断すべきものである。また、図面相互間においても互いの寸法の関係や比率が異なる部分が含まれていることは勿論である。 Therefore, specific dimensions should be determined with reference to the following explanation. Of course, there are also parts in which the dimensional relationships and ratios differ between the drawings.

［開示の概要］
開示の概要に係るシーン記述編集装置は、視点の自由度を有するコンテンツの構成を定義するシーン記述及び視点情報をメディア処理装置に出力する出力部と、前記コンテンツを少なくとも含む特定コンテンツであって、前記シーン記述及び前記視点情報に基づいて前記メディア処理装置で生成された特定コンテンツを取得する取得部と、を備える。 [Disclosure Summary]
The scene description editing device according to the disclosed overview includes an output unit that outputs a scene description and viewpoint information that define the structure of content having viewpoint freedom to a media processing device, and an acquisition unit that acquires specific content that includes at least the content, the specific content being generated by the media processing device based on the scene description and the viewpoint information.

開示の概要では、シーン記述編集装置は、シーン記述及び前記視点情報に基づいてメディア処理装置で生成された特定コンテンツを取得する。このような構成によれば、特定コンテンツをメディア処理装置によって生成（レンダリング）した上で、生成された特定コンテンツをメディア処理装置からユーザ端末に送信するケースを想定した場合に、シーン記述編集装置は、ユーザ端末と同様の仕組みで、メディア処理装置から特定コンテンツを取得することができる。従って、ユーザ端末に表示されると想定される特定コンテンツを確認しながら、シーン記述を適切に編集することができる。 In the summary of the disclosure, the scene description editing device acquires specific content generated by the media processing device based on the scene description and the viewpoint information. With this configuration, assuming a case in which the specific content is generated (rendered) by the media processing device and then transmitted from the media processing device to a user terminal, the scene description editing device can acquire the specific content from the media processing device in a similar manner to that of the user terminal. Therefore, it is possible to appropriately edit the scene description while checking the specific content expected to be displayed on the user terminal.

なお、メディア処理装置によって生成される特定コンテンツは視点情報に基づいて生成されるものであり、ユーザ端末側では、特定コンテンツに含まれる映像について2D映像として扱うことができることに留意すべきである。 It should be noted that the specific content generated by the media processing device is generated based on viewpoint information, and the images contained in the specific content can be treated as 2D images on the user terminal side.

［実施形態］
（伝送システム）
以下において、実施形態に係る伝送システムについて説明する。図１は、実施形態に係る伝送システム10を示す図である。図１に示すように、デジタル無線伝送システムは、送信装置100、メディア処理装置200及びユーザ端末300を備える。 [Embodiment]
(Transmission System)
A transmission system according to an embodiment will be described below. Fig. 1 is a diagram showing a transmission system 10 according to an embodiment. As shown in Fig. 1, the digital wireless transmission system includes a transmitting device 100, a media processing device 200, and a user terminal 300.

実施形態において、送信装置100は、視点の自由度を有していない第1コンテンツ及び視点の自由度を有する第2コンテンツをメディア処理装置200に送信する。さらに、送信装置100は、第1コンテンツに付随する第1制御情報及び第2コンテンツに付随する第2制御情報をメディア処理装置200に送信する。 In an embodiment, the transmitting device 100 transmits a first content that does not have viewpoint freedom and a second content that has viewpoint freedom to the media processing device 200. Furthermore, the transmitting device 100 transmits first control information associated with the first content and second control information associated with the second content to the media processing device 200.

第1コンテンツは、2D映像及び音声の少なくともいずれか1つを含んでもよい。第1コンテンツ及び第1制御情報は、第1方式で送信されてもよい。第1方式は、ISO/IEC 23008-1（以下、MMT（MPEG Media Transport））に準拠する方式であってもよい。以下においては、第1方式がMMTに準拠するMMTP（MMT Protocol）であるケースについて例示する。第1制御情報は、MMT-SI（Signaling Information）と称されてもよい。 The first content may include at least one of 2D video and audio. The first content and the first control information may be transmitted in a first method. The first method may be a method conforming to ISO/IEC 23008-1 (hereinafter, MMT (MPEG Media Transport)). In the following, a case will be illustrated in which the first method is MMTP (MMT Protocol) conforming to MMT. The first control information may be referred to as MMT-SI (Signaling Information).

第2コンテンツは、360°映像及び3Dオブジェクトを含んでもよい。第2コンテンツ及び第2制御情報は、第2方式で送信されてもよい。あるいは、HTTP（Hyper Text Transfer Protocol）などのプロトコルで伝送されてもよい。第2コンテンツは、利用者が座位で頭を動かした範囲の視点移動を伴う3DoF+（Degree of Freedom）、利用者が自由に動く範囲の視点移動を伴う6DoFなどに準拠してもよい。第2コンテンツは、視点の自由度を有するため、同一時刻（フレーム）において、2以上の360°映像を含んでもよく、2以上の3Dオブジェクトを含んでもよい。第2制御情報は、シーン記述と称されてもよい。 The second content may include a 360° video and a 3D object. The second content and the second control information may be transmitted in a second manner. Alternatively, they may be transmitted in a protocol such as HTTP (Hyper Text Transfer Protocol). The second content may conform to 3DoF+ (Degree of Freedom), which involves viewpoint movement within the range in which the user moves his/her head while in a sitting position, or 6DoF, which involves viewpoint movement within the range in which the user can move freely. Since the second content has viewpoint freedom, it may include two or more 360° videos and two or more 3D objects at the same time (frame). The second control information may be referred to as a scene description.

ここで、第2制御情報は、上述した第1方式で送信されてもよい。すなわち、第2制御情報は、第1制御情報と同じ第1方式（例えば、MMTP）で送信されてもよい。あるいは、HTTPなどのプロトコルで伝送されてもよい。 Here, the second control information may be transmitted in the first method described above. That is, the second control information may be transmitted in the same first method (e.g., MMTP) as the first control information. Alternatively, the second control information may be transmitted in a protocol such as HTTP.

送信装置100からメディア処理装置200への伝送は、特に限定されるものではないが、衛星放送を用いた伝送であってもよく、インターネット網を用いた伝送であってもよく、移動体通信網を用いた伝送であってもよい。 The transmission from the transmitting device 100 to the media processing device 200 is not particularly limited, but may be transmission using satellite broadcasting, transmission using the Internet network, or transmission using a mobile communication network.

特に限定されるものではないが、伝送システムは、デジタル無線伝送システムであってもよい。デジタル無線伝送システムは、4K、8K衛星放送で用いるシステムであってもよい。 Although not limited thereto, the transmission system may be a digital wireless transmission system. The digital wireless transmission system may be a system used for 4K and 8K satellite broadcasting.

メディア処理装置200は、ユーザ端末300から受信する視点情報に基づいて、上述した第2コンテンツを少なくとも含む特定コンテンツを生成し、生成された特定コンテンツをユーザ端末300に送信する。特に限定されるものではないが、特定コンテンツの伝送は、インターネット網を用いた伝送であってもよく、移動体通信網を用いた伝送であってもよい。 The media processing device 200 generates specific content including at least the above-mentioned second content based on the viewpoint information received from the user terminal 300, and transmits the generated specific content to the user terminal 300. Although not particularly limited, the transmission of the specific content may be transmission using the Internet network or transmission using a mobile communication network.

ユーザ端末300は、スマートフォン、タブレット端末、ヘッドマウントディスプレイなどのユーザ端末であってもよい。図１に示すように、ユーザ端末300として2以上のユーザ端末300が設けられてもよい。言い換えると、2以上のユーザ端末300は、特定コンテンツの生成をメディア処理装置200に要求してもよい。各ユーザ端末300は、別々の視点情報をメディア処理装置200に送信してもよい。 The user terminal 300 may be a user terminal such as a smartphone, a tablet terminal, or a head-mounted display. As shown in FIG. 1, two or more user terminals 300 may be provided as the user terminal 300. In other words, two or more user terminals 300 may request the media processing device 200 to generate specific content. Each user terminal 300 may transmit different viewpoint information to the media processing device 200.

（メディア処理装置及びユーザ端末）
以下において、実施形態に係るメディア処理装置及びユーザ端末について説明する。図２は、実施形態に係るメディア処理装置200及びユーザ端末300を示すブロック図である。 (Media processing device and user terminal)
A media processing device and a user terminal according to an embodiment of the present invention will now be described. Fig. 2 is a block diagram showing a media processing device 200 and a user terminal 300 according to an embodiment of the present invention.

第1に、メディア処理装置200は、受付部210と、レンダラ220と、符号化処理部230と、を有する。 First, the media processing device 200 has a reception unit 210, a renderer 220, and an encoding processing unit 230.

受付部210は、視点情報を受け付ける。実施形態では、受付部210は、視点情報をユーザ端末300から受信する受信部を構成する。視点情報は、ユーザ端末300のユーザの視点位置を示す情報要素、ユーザ端末300のユーザの視線方向を示す情報要素を含む。 The reception unit 210 receives viewpoint information. In an embodiment, the reception unit 210 constitutes a receiving unit that receives viewpoint information from the user terminal 300. The viewpoint information includes an information element indicating the viewpoint position of the user of the user terminal 300 and an information element indicating the line of sight of the user of the user terminal 300.

レンダラ220は、視点情報に基づいて、第2コンテンツを少なくとも含む特定コンテンツを生成する。特定コンテンツは、視点情報に基づいて生成されるため、同一時刻（フレーム）において、1つの360°映像を含んでもよく、1つの3Dオブジェクトを含んでもよい。以下において、特定コンテンツは、第2コンテンツに加えて第1コンテンツを含むケースについて例示する。 The renderer 220 generates specific content including at least the second content based on the viewpoint information. Since the specific content is generated based on the viewpoint information, it may include one 360° video and one 3D object at the same time (frame). Below, an example is given of the specific content including the first content in addition to the second content.

第1に、図２に示すように、レンダラ220は、第1制御情報（MMT-SI）に基づいて、特定コンテンツの一部として、2D映像及び音声を含む第1コンテンツを生成する。第1コンテンツの生成において視点情報は不要である。 First, as shown in FIG. 2, the renderer 220 generates a first content including 2D video and audio as part of a specific content based on the first control information (MMT-SI). Viewpoint information is not required in generating the first content.

具体的には、レンダラ220は、2D映像、音声及びMMT-SIがパケット化されたMMTPパケットの形式で、2D映像、音声及びMMT-SIを取得する。 Specifically, the renderer 220 acquires the 2D video, audio, and MMT-SI in the form of an MMTP packet in which the 2D video, audio, and MMT-SI are packetized.

例えば、MMTPパケットは、IP（Internet Protocol）パケットに格納される。IPパケットは、UDP（User Datagram Protocol）を用いて伝送されてもよく、TCP（Transmission Control Protocol）を用いて伝送されてもよい。 For example, MMTP packets are stored in IP (Internet Protocol) packets. IP packets may be transmitted using UDP (User Datagram Protocol) or TCP (Transmission Control Protocol).

ここで、第1コンテンツは、一定時間幅で区切られた単位（以下、MPU；Media Processing Unit）で処理される。MPUは、1以上のアクセスユニットを含む。アクセスユニットは、MFU（Media Fragment Unit）として扱われることもある。2D映像に関するMFUは、NAL（Network Abstraction Layer）ユニットと称されてもよく、音声に関するMFUは、MHAS（MPEG-H 3D Audio Stream）パケットと称されてもよい。 Here, the first content is processed in units (hereinafter, MPU: Media Processing Unit) that are separated by a certain time width. The MPU includes one or more access units. The access unit may be treated as an MFU (Media Fragment Unit). The MFU related to 2D video may be called a NAL (Network Abstraction Layer) unit, and the MFU related to audio may be called an MHAS (MPEG-H 3D Audio Stream) packet.

MMT-SIは、PA（Package Access）メッセージを含み、PAメッセージは、第1コンテンツの一覧を示すMPT（MMT Package Table）を含む。さらに、MMT-SIは、第1コンテンツの提示時刻を示すMPUタイムスタンプ記述子を含む。MPUタイムスタンプ記述子は、MPUの提示時刻、すなわち、MPUにおいて最初に提示するアクセスユニットの時刻を意味してもよい。 The MMT-SI includes a PA (Package Access) message, and the PA message includes an MPT (MMT Package Table) that indicates a list of the first content. Furthermore, the MMT-SI includes an MPU timestamp descriptor that indicates the presentation time of the first content. The MPU timestamp descriptor may mean the presentation time of the MPU, i.e., the time of the access unit that is first presented in the MPU.

MPUタイムスタンプ記述子は、UTC（Coordinated Universal Time）を基準時刻として生成されてもよい。基準時刻は、TAI（International Atomic Time）が用いられてもよく、GPS（Global Positioning System）から提供される時刻が用いられてもよい。基準時刻は、NTP（Network Time Protocol）サーバから提供される時刻であってもよく、PTP（Precision Time Protocol）サーバから提供される時刻であってもよい。 The MPU timestamp descriptor may be generated using UTC (Coordinated Universal Time) as the reference time. The reference time may be TAI (International Atomic Time) or may be a time provided by GPS (Global Positioning System). The reference time may be a time provided by an NTP (Network Time Protocol) server or a PTP (Precision Time Protocol) server.

第2に、図２に示すように、レンダラ220は、第2制御情報（シーン記述）に基づいて、特定コンテンツの一部として、360°映像及び3Dオブジェクトを含む第2コンテンツを生成する。第2コンテンツの生成において視点情報が用いられる。 Secondly, as shown in FIG. 2, the renderer 220 generates second content including a 360° video and a 3D object as part of the specific content based on the second control information (scene description). The viewpoint information is used in generating the second content.

具体的には、レンダラ220は、シーン記述がパケット化されたMMTPパケットの形式で、シーン記述を取得してもよい。360°映像及び3Dオブジェクトの取得方法は特に限定されるものではない。 Specifically, the renderer 220 may acquire the scene description in the form of an MMTP packet in which the scene description is packetized. There are no particular limitations on the method of acquiring the 360° video and 3D objects.

360°映像は、ERP（Equirectangular projection）やキューブマップなどの射影変換によって2D映像に変換されていてもよい。360°映像に適用した射影変換の種類を示すメタデータが付加されていてもよい。3Dオブジェクトは、メッシュ形式で符号化されてもよい。メッシュ形式の符号化としては、ISO/IEC 14496-16 “Animation framework extension (AFX)”が用いられてもよい。3Dオブジェクトは、ポイントクラウド形式で符号化されてもよい。ポイントクラウド形式の符号化としては、ISO/IEC 23090-5 “Video-based Point Cloud Compression”が用いられてもよい。 The 360° video may be converted to 2D video by a projective transformation such as ERP (Equirectangular projection) or cube mapping. Metadata indicating the type of projective transformation applied to the 360° video may be added. The 3D object may be encoded in a mesh format. ISO/IEC 14496-16 "Animation framework extension (AFX)" may be used for encoding the mesh format. The 3D object may be encoded in a point cloud format. ISO/IEC 23090-5 "Video-based Point Cloud Compression" may be used for encoding the point cloud format.

ここで、第2コンテンツは、一定時間幅で区切られた単位で1つのファイルに纏められる。一定時間幅は、500msであってもよい。例えば、フレームレートが60fps（frame per second）である場合には、1つのファイルは、30 frameを含む。 Here, the second content is compiled into one file in units separated by a certain time width. The certain time width may be 500 ms. For example, if the frame rate is 60 fps (frames per second), one file contains 30 frames.

シーン記述は、1つのファイル毎に生成され、360°映像及び3Dオブジェクトを特定する情報をフレーム毎に含む。例えば、シーン記述は、フレームの3Dオブジェクトの名称を示す情報要素（object_name）、フレーム番号を示す情報要素（frame_number）、フレームにおける3Dオブジェクトの位置を示す情報要素（translation_object）、フレームにおける3Dオブジェクトの回転を示す情報要素（rotation_object）、フレームにおける3Dオブジェクトの大きさを示す情報要素（scale_object）などを含む。 The scene description is generated for each file and includes information for each frame that identifies the 360° video and 3D objects. For example, the scene description includes an information element (object_name) indicating the name of the 3D object in the frame, an information element (frame_number) indicating the frame number, an information element (translation_object) indicating the position of the 3D object in the frame, an information element (rotation_object) indicating the rotation of the 3D object in the frame, an information element (scale_object) indicating the size of the 3D object in the frame, etc.

第3に、レンダラ220は、第1コンテンツ及び第2コンテンツを含む特定コンテンツを符号化処理部230に出力する。レンダラ220は、特定コンテンツとともに、特定コンテンツの提示時刻を符号化処理部230に出力してもよい。 Third, the renderer 220 outputs the specific content including the first content and the second content to the encoding processing unit 230. The renderer 220 may output the presentation time of the specific content together with the specific content to the encoding processing unit 230.

ここで、特定コンテンツの提示時刻は、メディア処理装置200とユーザ端末300との間の遅延時間に基づいて修正されてもよい。具体的には、レンダラ220は、送信装置100からメディア処理装置200に提供される特定コンテンツの提示時刻（T）及び遅延時間（ΔT）に基づいて、メディア処理装置200からユーザ端末300に提供される特定コンテンツの提示時刻（T’=T+ΔT）を算出してもよい。遅延時間（ΔT）は、メディア処理装置200において予め定められた値であってもよく、ユーザ端末300毎に異なる値であってもよい。 Here, the presentation time of the specific content may be corrected based on the delay time between the media processing device 200 and the user terminal 300. Specifically, the renderer 220 may calculate the presentation time (T' = T + ΔT) of the specific content provided from the media processing device 200 to the user terminal 300 based on the presentation time (T) and delay time (ΔT) of the specific content provided from the transmission device 100 to the media processing device 200. The delay time (ΔT) may be a value that is predetermined in the media processing device 200, or may be a different value for each user terminal 300.

第4に、レンダラ220は、特定コンテンツの生成に用いた視点情報をユーザ端末300に送信する送信部を構成してもよい。特定コンテンツの生成に用いた視点情報は、符号化処理部230からユーザ端末300に送信されてもよい。 Fourthly, the renderer 220 may constitute a transmission unit that transmits the viewpoint information used to generate the specific content to the user terminal 300. The viewpoint information used to generate the specific content may be transmitted from the encoding processing unit 230 to the user terminal 300.

例えば、視点情報及び特定コンテンツの伝送方式は、MMTPであってもよく、HTTPであってもよい。特定コンテンツの伝送方式としてMMTPが用いられる場合には、視点情報は、ISO/IEC 23090-2で規定されたOMAF（Omnidirectional Media Format）にメタデータとして格納されてもよい。 For example, the transmission method of the viewpoint information and the specific content may be MMTP or HTTP. When MMTP is used as the transmission method of the specific content, the viewpoint information may be stored as metadata in OMAF (Omnidirectional Media Format) defined in ISO/IEC 23090-2.

符号化処理部230は、レンダラ220によって生成された特定コンテンツを符号化する。実施形態では、符号化処理部230は、特定コンテンツをユーザ端末300に送信する送信部の一例であってもよい。 The encoding processing unit 230 encodes the specific content generated by the renderer 220. In an embodiment, the encoding processing unit 230 may be an example of a transmission unit that transmits the specific content to the user terminal 300.

さらに、符号化処理部230は、特定コンテンツの提示時刻を符号化してもよい。符号化処理部230は、提示時刻を示す情報要素を特定コンテンツとともにユーザ端末300に送信してもよい。 Furthermore, the encoding processing unit 230 may encode the presentation time of the specific content. The encoding processing unit 230 may transmit an information element indicating the presentation time together with the specific content to the user terminal 300.

ここで、符号化処理部230が用いる圧縮符号化方式としては、任意の圧縮符号化方式を用いることができる。例えば、圧縮符号化方式は、HEVC（High Efficiency Video Coding）であってもよく、VVC（Versatile Video Coding）であってもよい。 Here, the encoding processing unit 230 can use any compression encoding method as the compression encoding method. For example, the compression encoding method may be HEVC (High Efficiency Video Coding) or VVC (Versatile Video Coding).

上述したように、特定コンテンツに含まれる第2コンテンツは、視点情報に基づいて生成されるため、特定コンテンツに含まれる映像は、視点の自由度を有していない2D映像として扱うことができる。 As described above, the second content included in the specific content is generated based on viewpoint information, so the video included in the specific content can be treated as a 2D video that does not have freedom of viewpoint.

例えば、特定コンテンツの視聴開始や終了で用いる伝送制御方式は、RTSP（Real Time Streaming Protocol）を含んでもよい。伝送方式は、MMTPであってもよく、HTTPであってもよい。伝送方式としてMMTPが用いられる場合には、特定コンテンツは、ISO/IEC 23090-2で規定されたOMAFに格納されてもよい。 For example, the transmission control method used to start and end viewing of specific content may include RTSP (Real Time Streaming Protocol). The transmission method may be MMTP or HTTP. When MMTP is used as the transmission method, the specific content may be stored in OMAF defined in ISO/IEC 23090-2.

図２に示すように、ユーザ端末300は、検出部310と、復号処理部320と、レンダラ330と、を有する。 As shown in FIG. 2, the user terminal 300 has a detection unit 310, a decoding processing unit 320, and a renderer 330.

検出部310は、ユーザの視点位置及び視線方向を検出する。検出部310は、加速度センサを含んでもよく、GPS（Global Positioning System）センサを含んでもよい。検出部310は、ユーザによって手動で入力されるユーザI/F（例えば、タッチセンサ、キーボード、マウス、コントローラなど）を含んでもよい。検出部310は、視点情報（視点位置及び視線方向）をメディア処理装置200に送信してもよい。検出部310は、視点情報（ビューポート）をレンダラ330に出力してもよい。 The detection unit 310 detects the user's viewpoint position and line of sight direction. The detection unit 310 may include an acceleration sensor or a GPS (Global Positioning System) sensor. The detection unit 310 may include a user I/F (e.g., a touch sensor, a keyboard, a mouse, a controller, etc.) that is manually input by the user. The detection unit 310 may transmit viewpoint information (viewpoint position and line of sight direction) to the media processing device 200. The detection unit 310 may output the viewpoint information (viewport) to the renderer 330.

復号処理部320は、メディア処理装置200から受信する特定コンテンツを復号する。復号処理部320は、メディア処理装置200から受信する提示時刻を復号してもよい。復号処理部320は、特定コンテンツをレンダラ330に出力してもよく、提示時刻をレンダラ330に出力してもよい。 The decoding processing unit 320 decodes the specific content received from the media processing device 200. The decoding processing unit 320 may decode the presentation time received from the media processing device 200. The decoding processing unit 320 may output the specific content to the renderer 330, or may output the presentation time to the renderer 330.

レンダラ330は、復号処理部320によって復号された特定コンテンツを出力する。レンダラ330は、復号処理部320によって復号された提示時刻に基づいて特定コンテンツを出力してもよい。例えば、レンダラ330は、特定コンテンツに含まれる映像コンテンツをディスプレイに出力し、特定コンテンツに含まれる音声コンテンツをスピーカに出力してもよい。 The renderer 330 outputs the specific content decoded by the decoding processing unit 320. The renderer 330 may output the specific content based on the presentation time decoded by the decoding processing unit 320. For example, the renderer 330 may output the video content included in the specific content to a display and output the audio content included in the specific content to a speaker.

ここで、レンダラ330は、メディア処理装置200から受信する視点情報と検出部310から入力される視点情報との差異に基づいて、視点位置及び視線方向が修正された特定コンテンツを生成してもよい。 Here, the renderer 330 may generate specific content in which the viewpoint position and line of sight direction are corrected based on the difference between the viewpoint information received from the media processing device 200 and the viewpoint information input from the detection unit 310.

（第2コンテンツ）
以下において、実施形態に係る第2コンテンツについて説明する。ここでは、t=0、t=1及びt=2における第2コンテンツについて説明する。t=0、t=1及びt=2の時間間隔は特に限定されるものではない。 (Second content)
The second content according to the embodiment will be described below. Here, the second content at t=0, t=1, and t=2 will be described. The time intervals between t=0, t=1, and t=2 are not particularly limited.

例えば、図３に示すように、t=0において、3Dオブジェクトが表示されずに、360°映像が表示されてもよい。360°映像は、3Dオブジェクトの背景映像であると考えてもよい。t=1において、360°映像に重畳される形式で3Dオブジェクトが表示されてもよい。さらに、t=1において、360°映像に重畳される3Dオブジェクトの位置及び回転が変更されてもよい。 For example, as shown in FIG. 3, at t=0, a 360° video may be displayed without a 3D object being displayed. The 360° video may be considered to be a background video of the 3D object. At t=1, a 3D object may be displayed in a form superimposed on the 360° video. Furthermore, at t=1, the position and rotation of the 3D object superimposed on the 360° video may be changed.

上述したシーン記述は、t=0、t=1及びt=2のそれぞれについて、3Dオブジェクトの位置、回転及び大きさを示す情報要素を含み、360°映像上に3Dオブジェクトを適切に重畳することができる。 The above-mentioned scene description includes information elements that indicate the position, rotation, and size of the 3D object for each of t=0, t=1, and t=2, and allows the 3D object to be appropriately superimposed on the 360° video.

（視聴方法）
以下において、実施形態に係る視聴方法について説明する。ここでは、第1コンテンツ及び第2コンテンツを含む特定コンテンツの視聴について例示する。 (How to watch)
A viewing method according to the embodiment will be described below, in which viewing of a specific content including a first content and a second content will be exemplified.

図４に示すように、ステップS11において、ユーザ端末300は、RTSP SETUPをメディア処理装置に送信する。RTSP SETUPは、特定コンテンツの視聴を開始する旨のメッセージである。 As shown in FIG. 4, in step S11, the user terminal 300 sends an RTSP SETUP to the media processing device. The RTSP SETUP is a message indicating that viewing of a specific content is to begin.

ここで、RTSP SETUPは、ユーザ端末300のIPアドレス、待受ポート番号、コンテンツの識別情報（コンテンツID）などを含む。RTSP SETUPは、特定コンテンツを視聴するためのユーザ端末300の能力情報を含んでもよい。能力情報は、フレームレート、表示解像度などを含んでもよい。表示解像度は、視野角（FoV：Field of View）を含んでもよい。能力情報は、ユーザ端末300が対応する符号化方式及び圧縮方式を示す情報要素を含んでもよい。 Here, the RTSP SETUP includes the IP address of the user terminal 300, the listening port number, the content identification information (content ID), etc. The RTSP SETUP may also include capability information of the user terminal 300 for viewing specific content. The capability information may include the frame rate, the display resolution, etc. The display resolution may include the field of view (FoV). The capability information may also include information elements indicating the encoding method and compression method supported by the user terminal 300.

ここでは、ユーザ端末300の能力情報がメディア処理装置200に直接的に通知されるケースが例示されているが、実施形態はこれに限定されるものではない。ユーザ端末300の能力情報は、送信装置100に通知された上で、送信装置100からメディア処理装置200に通知されてもよい。 Here, a case is illustrated in which the capability information of the user terminal 300 is directly notified to the media processing device 200, but the embodiment is not limited to this. The capability information of the user terminal 300 may be notified to the transmitting device 100, and then notified from the transmitting device 100 to the media processing device 200.

ステップS12において、メディア処理装置200は、RTSP SETUPに対する応答を送信する。ここでは、RTSP SETUPを受け付けた旨を示すACKが応答として送信される。 In step S12, the media processing device 200 transmits a response to the RTSP SETUP. Here, an ACK is transmitted as a response indicating that the RTSP SETUP has been accepted.

ステップS21において、ユーザ端末300は、初期視点情報をメディア処理装置200に送信する。初期視点情報は、MMT-SIの形式で送信されてもよい。 In step S21, the user terminal 300 transmits the initial viewpoint information to the media processing device 200. The initial viewpoint information may be transmitted in the MMT-SI format.

ステップS22において、メディア処理装置200は、初期視点情報に基づいて初期特定コンテンツを生成する（レンダリング処理）。例えば、メディア処理装置200は、初期視点情報及びシーン記述に基づいて、初期特定コンテンツに含める第2コンテンツを生成する。 In step S22, the media processing device 200 generates initial specific content based on the initial viewpoint information (rendering process). For example, the media processing device 200 generates second content to be included in the initial specific content based on the initial viewpoint information and the scene description.

ここで、メディア処理装置200は、ユーザ端末300の表示解像度よりも広い範囲をビューポートとして初期特定コンテンツを生成してもよい。例えば、表示解像度よりも広い範囲は、水平方向において表示解像度+20%、垂直方向において表示解像度+20%の範囲であってもよい。 Here, the media processing device 200 may generate the initial specific content using a range wider than the display resolution of the user terminal 300 as the viewport. For example, the range wider than the display resolution may be a range of the display resolution + 20% in the horizontal direction and the display resolution + 20% in the vertical direction.

メディア処理装置200は、初期特定コンテンツに圧縮符号化方式を適用する。特に限定されるものではないが、圧縮符号化方式は、HEVCであってもよく、VVCであってもよい。 The media processing device 200 applies a compression encoding method to the initial specific content. Although not particularly limited, the compression encoding method may be HEVC or VVC.

ステップS23において、メディア処理装置200は、初期視点情報に対応する初期特定コンテンツをユーザ端末300に送信する。メディア処理装置200は、初期特定コンテンツの提示時刻をユーザ端末300に送信する。上述したように、ユーザ端末300に提供される提示時刻（T’）は、遅延時間（ΔT）に基づいて定められてもよい。 In step S23, the media processing device 200 transmits the initial specific content corresponding to the initial viewpoint information to the user terminal 300. The media processing device 200 transmits the presentation time of the initial specific content to the user terminal 300. As described above, the presentation time (T') provided to the user terminal 300 may be determined based on the delay time (ΔT).

なお、遅延時間（ΔT）としてユーザ端末300毎に異なる値を用いる場合には、上述したRTSP SETUPにRTSP SETUPの送信時刻を含めることによって、メディア処理装置200側で特定することが可能である。 When a different value is used for the delay time (ΔT) for each user terminal 300, it is possible to specify it on the media processing device 200 side by including the transmission time of the RTSP SETUP in the above-mentioned RTSP SETUP.

ユーザ端末300は、提示時刻（T’）に基づいて特定コンテンツを出力する。ユーザ端末300は、メディア処理装置200から受信する視点情報と検出部310から入力される視点情報との差異に基づいて、視点位置及び視線方向が修正された特定コンテンツを生成してもよい。 The user terminal 300 outputs the specific content based on the presentation time (T'). The user terminal 300 may generate the specific content with the viewpoint position and line of sight corrected based on the difference between the viewpoint information received from the media processing device 200 and the viewpoint information input from the detection unit 310.

ステップS31において、ユーザ端末300は、視点情報をメディア処理装置200に送信する。視点情報は、MMT-SIの形式で送信されてもよい。ここで、ユーザ端末300は、所定周期（例えば、500ms）で視点情報を送信してもよく、視点位置及び視線方向の少なくともいずれか1つの変更に応じて視点情報を送信してもよい。 In step S31, the user terminal 300 transmits viewpoint information to the media processing device 200. The viewpoint information may be transmitted in the format of MMT-SI. Here, the user terminal 300 may transmit the viewpoint information at a predetermined period (e.g., 500 ms), or may transmit the viewpoint information in response to a change in at least one of the viewpoint position and the line of sight direction.

ステップS32において、メディア処理装置200は、ステップS31で受信する視点情報に基づいて特定コンテンツを生成する（レンダリング処理）。 In step S32, the media processing device 200 generates specific content based on the viewpoint information received in step S31 (rendering process).

ステップS33において、メディア処理装置200は、ステップS31で受信する視点情報に対応する特定コンテンツをユーザ端末300に送信する。 In step S33, the media processing device 200 transmits to the user terminal 300 the specific content that corresponds to the viewpoint information received in step S31.

ステップS31～ステップS33の処理は、初期視点情報に代えてステップS31で受信する視点情報を用いる点を除いて、ステップS21～ステップS23の処理と同様である。従って、ステップS31～ステップS33の処理の詳細については省略する。ステップS31～ステップS33の処理は、所定周期で繰り返されてもよく、ユーザの視点位置又は視線方向の変更毎に繰り返されてもよい。 The processing of steps S31 to S33 is similar to the processing of steps S21 to S23, except that the viewpoint information received in step S31 is used instead of the initial viewpoint information. Therefore, details of the processing of steps S31 to S33 are omitted. The processing of steps S31 to S33 may be repeated at a predetermined interval, or may be repeated every time the user's viewpoint position or line of sight changes.

ステップS41において、ユーザ端末300は、RTSP TEARDOWNをメディア処理装置に送信する。RTSP TEARDOWNは、特定コンテンツの視聴を終了する旨のメッセージである。 In step S41, the user terminal 300 sends an RTSP TEARDOWN to the media processing device. The RTSP TEARDOWN is a message indicating that viewing of a specific content is to end.

ステップS42において、メディア処理装置200は、RTSP TEARDOWNに対する応答を送信する。ここでは、RTSP TEARDOWNを受け付けた旨を示すACKが応答として送信される。 In step S42, the media processing device 200 transmits a response to the RTSP TEARDOWN. Here, an ACK is transmitted as the response, indicating that the RTSP TEARDOWN has been accepted.

図４では、ステップS11及びステップS12がRTSPベースで実行されるケースについて例示したが、実施形態はこれに限定されるものではない。ステップS11及びステップS12は、MMTPベースで実行されてもよく、HTTPベースで実行されてもよい。 In FIG. 4, a case where steps S11 and S12 are performed on an RTSP basis is illustrated, but the embodiment is not limited to this. Steps S11 and S12 may be performed on an MMTP basis or on an HTTP basis.

同様に、ステップS41及びステップS42がRTSPベースで実行されるケースについて例示したが、実施形態はこれに限定されるものではない。ステップS41及びステップS42は、MMTPベースで実行されてもよく、HTTPベースで実行されてもよい。 Similarly, although the example shows a case where steps S41 and S42 are performed on an RTSP basis, the embodiment is not limited to this. Steps S41 and S42 may be performed on an MMTP basis or on an HTTP basis.

図４では、ステップS31～ステップS33がMMTPベースで実行されるケースについて例示したが、実施形態はこれに限定されるものではない。ステップS31～ステップS33は、他の方式（例えば、HTTP）ベースで実行されてもよい。 In FIG. 4, a case where steps S31 to S33 are executed based on MMTP is illustrated, but the embodiment is not limited to this. Steps S31 to S33 may be executed based on another method (e.g., HTTP).

同様に、ステップS41～ステップS43がMMTPベースで実行されるケースについて例示したが、実施形態はこれに限定されるものではない。ステップS41～ステップS43は、他の方式（例えば、HTTP）ベースで実行されてもよい。 Similarly, although the example shows a case where steps S41 to S43 are executed based on MMTP, the embodiment is not limited to this. Steps S41 to S43 may be executed based on other methods (e.g., HTTP).

（動作例1）
上述した実施形態は、以下に示す動作例1を含んでもよい。動作例1では、メディア処理装置200は、特定コンテンツに付加されるシーケンス番号と対応付けて、特定コンテンツの生成で用いた視点情報をユーザ端末300に送信する。 (Example 1)
The above-described embodiment may include the following Operation Example 1. In Operation Example 1, the media processing device 200 transmits to the user terminal 300 viewpoint information used in generating the specific content, in association with a sequence number added to the specific content.

具体的には、メディア処理装置200（レンダラ220）は、上述した実施形態と同様に、第2制御情報（シーン記述）に基づいて、特定コンテンツの一部として、360°映像及び3Dオブジェクトを含む第2コンテンツを生成する。第2コンテンツの生成において、ユーザ端末300から受信する視点情報が用いられる。 Specifically, the media processing device 200 (renderer 220) generates second content including a 360° video and a 3D object as part of a specific content based on the second control information (scene description), as in the above-described embodiment. In generating the second content, viewpoint information received from the user terminal 300 is used.

動作例1では、レンダラ220は、特定コンテンツ（ここでは、第2コンテンツ）の生成で用いた視点情報をシーケンス番号と対応付ける。レンダラ220は、特定コンテンツに付加されるシーケンス番号と対応付けて、特定コンテンツの生成で用いた視点情報をユーザ端末300に送信する。視点情報は、VP（View Port）メッセージに格納されてもよい。VPメッセージは、ISO/IEC 23008-1、ARIB STD-B60、ARIB TR-B39などで規定されたMMT-SIの形式を有してもよい。VPメッセージは、フレーム毎に送信されてもよい。VPメッセージは、MMTPに関するメッセージ（MMT-SI）としてユーザ端末300に送信されてもよい。 In operation example 1, the renderer 220 associates the viewpoint information used in generating the specific content (here, the second content) with a sequence number. The renderer 220 transmits the viewpoint information used in generating the specific content to the user terminal 300 in association with the sequence number added to the specific content. The viewpoint information may be stored in a VP (View Port) message. The VP message may have the format of MMT-SI defined in ISO/IEC 23008-1, ARIB STD-B60, ARIB TR-B39, etc. The VP message may be transmitted for each frame. The VP message may be transmitted to the user terminal 300 as a message related to MMTP (MMT-SI).

特に限定されるものではないが、VPメッセージは、図５に示すデータ構造を有してもよい。図５に示すように、VPメッセージは、message_id、version、length、fov、viewpoint_pos_x、viewpoint_pos_y、viewpoint_pos_z、viewpoint_yaw、viewpoint_pitch、viewpoint_roll、viewport_width、viewport_height、mpu_sequence_number_flag、mpu_sequence_numberなどを含んでもよい。 Although not limited to this, the VP message may have the data structure shown in FIG. 5. As shown in FIG. 5, the VP message may include message_id, version, length, fov, viewpoint_pos_x, viewpoint_pos_y, viewpoint_pos_z, viewpoint_yaw, viewpoint_pitch, viewpoint_roll, viewport_width, viewport_height, mpu_sequence_number_flag, mpu_sequence_number, etc.

message_idは、VPメッセージを示す識別情報である。message_idは、0x0204であってもよい。 The message_id is identification information that indicates the VP message. The message_id may be 0x0204.

versionは、MMTPプロトコルのバージョンを示す情報である。versionは、0x00であってもよい。 The version is information that indicates the version of the MMTP protocol. The version may be 0x00.

lengthは、VPメッセージの長さを示す情報である。 Length is information indicating the length of the VP message.

fovは、視野角（field of view）を示す情報である。 fov is information that indicates the field of view.

viewpoint_pos_xは、視点位置のx座標を示す情報である。viewpoint_pos_xは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_pos_x is information that indicates the x coordinate of the viewpoint position. viewpoint_pos_x is an example of viewpoint information used in generating specific content.

viewpoint_pos_yは、視点位置のy座標を示す情報である。viewpoint_pos_yは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_pos_y is information that indicates the y coordinate of the viewpoint position. viewpoint_pos_y is an example of viewpoint information used in generating specific content.

viewpoint_pos_zは、視点位置のz座標を示す情報である。viewpoint_pos_zは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_pos_z is information that indicates the z coordinate of the viewpoint position. viewpoint_pos_z is an example of viewpoint information used in generating specific content.

viewpoint_yawは、視点位置のヨーを示す情報である。viewpoint_yawは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_yaw is information that indicates the yaw of the viewpoint position. viewpoint_yaw is an example of viewpoint information used in generating specific content.

viewpoint_pitchは、視点位置のピッチを示す情報である。viewpoint_pitchは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_pitch is information that indicates the pitch of the viewpoint position. viewpoint_pitch is an example of viewpoint information used in generating specific content.

viewpoint_rollは、視点位置のロールを示す情報である。viewpoint_rollは、特定コンテンツの生成で用いた視点情報の一例である。 viewpoint_roll is information that indicates the role of the viewpoint position. viewpoint_roll is an example of viewpoint information used in generating specific content.

viewport_widthは、表示領域（特定コンテンツ）の幅を示す情報である。 viewport_width is information that indicates the width of the display area (specific content).

viewport_heightは、表示領域（特定コンテンツ）の高さを示す情報である。 viewport_height is information that indicates the height of the display area (specific content).

mpu_sequence_number_flagは、mpu_sequence_numberのフィールドが存在するか否かを示す情報である。例えば、mpu_sequence_number_flagが1である場合に、mpu_sequence_numberのフィールドが存在し、mpu_sequence_number_flagが0である場合に、mpu_sequence_numberのフィールドが存在しなくてもよい。 The mpu_sequence_number_flag is information indicating whether or not the mpu_sequence_number field exists. For example, if the mpu_sequence_number_flag is 1, the mpu_sequence_number field exists, and if the mpu_sequence_number_flag is 0, the mpu_sequence_number field does not have to exist.

mpu_sequence_numberは、VPメッセージが示す特定コンテンツに対応する映像のMPUシーケンス番号である。mpu_sequence_numberは、特定コンテンツに付加されるシーケンス番号の一例である。 The mpu_sequence_number is the MPU sequence number of the video corresponding to the specific content indicated by the VP message. The mpu_sequence_number is an example of a sequence number added to specific content.

ここで、viewpoint_pos_x、viewpoint_pos_y、viewpoint_pos_zは、シーン記述によって構成される3次元空間におけるユーザの視点位置を示す情報要素の一例である。viewpoint_yaw、viewpoint_pitch、viewpoint_rollは、シーン記述によって構成される3次元空間におけるユーザの視線方向を示す情報要素の一例である。viewport_width、viewport_heightは、特定コンテンツに含まれる映像の画素数を示す情報要素の一例である。 Here, viewpoint_pos_x, viewpoint_pos_y, and viewpoint_pos_z are examples of information elements that indicate the user's viewpoint position in the three-dimensional space configured by the scene description. viewpoint_yaw, viewpoint_pitch, and viewpoint_roll are examples of information elements that indicate the user's line of sight in the three-dimensional space configured by the scene description. viewport_width and viewport_height are examples of information elements that indicate the number of pixels of an image included in specific content.

このような前提下において、メディア処理装置200及びユーザ端末300は、以下に示す動作を実行してもよい。 Under these conditions, the media processing device 200 and the user terminal 300 may perform the following operations.

第1に、メディア処理装置200（レンダラ220）は、特定コンテンツの生成に用いた視点位置に基づいて、viewpoint_pos_x、viewpoint_pos_y、viewpoint_pos_zを特定してもよい。レンダラ220は、特定コンテンツの生成に用いた視線方向に基づいて、viewpoint_yaw、viewpoint_pitch、viewpoint_rollを特定してもよい。レンダラ220は、特定コンテンツの横方向の画素数に基づいてviewport_widthを特定し、特定コンテンツの縦方向の画素数に基づいてviewport_heightを特定してもよい。 First, the media processing device 200 (renderer 220) may determine viewpoint_pos_x, viewpoint_pos_y, and viewpoint_pos_z based on the viewpoint position used to generate the specific content. The renderer 220 may determine viewpoint_yaw, viewpoint_pitch, and viewpoint_roll based on the line of sight direction used to generate the specific content. The renderer 220 may determine viewport_width based on the number of pixels in the horizontal direction of the specific content, and may determine viewport_height based on the number of pixels in the vertical direction of the specific content.

メディア処理装置200（符号化処理部230）は、レンダラ220によって生成された特定コンテンツの圧縮符号化を実行し、特定コンテンツを送信してもよい。ここで、符号化処理部230は、図５に示すVPメッセージをユーザ端末300に送信してもよい。すなわち、符号化処理部230は、特定コンテンツに付加されるシーケンス番号と対応付けて、特定コンテンツの生成で用いた視点情報をユーザ端末300に送信してもよい。 The media processing device 200 (encoding processing unit 230) may perform compression encoding of the specific content generated by the renderer 220, and transmit the specific content. Here, the encoding processing unit 230 may transmit the VP message shown in FIG. 5 to the user terminal 300. In other words, the encoding processing unit 230 may transmit the viewpoint information used in generating the specific content to the user terminal 300, in association with the sequence number added to the specific content.

第2に、ユーザ端末300（復号処理部320）は、特定コンテンツに付加されるシーケンス番号と対応付けて、特定コンテンツの生成で用いた視点情報をメディア処理装置200から受信する受信部を構成してもよい。すなわち、復号処理部320は、図５に示すVPメッセージ（MMT-SI）をメディア処理装置200から受信してもよい。 Secondly, the user terminal 300 (decoding processing unit 320) may constitute a receiving unit that receives viewpoint information used in generating a specific content from the media processing device 200 in association with a sequence number added to the specific content. That is, the decoding processing unit 320 may receive the VP message (MMT-SI) shown in FIG. 5 from the media processing device 200.

ユーザ端末300（レンダラ330）は、シーケンス番号（mpu_sequence_number）に基づいて、復号された特定コンテンツを構成する映像とVPメッセージに含まれる視点情報とを対応付けてもよい。レンダラ330は、検出部310によって検出される視点情報とVPメッセージに含まれる視点情報との差異に基づいて、VPメッセージに含まれる情報（viewport_width及びviewport_height）によって定義される表示領域から、検出部310によって検出される視点情報によって特定される範囲の映像を特定してもよい。レンダラ330は、特定された映像を表示してもよい。 The user terminal 300 (renderer 330) may associate the video constituting the decoded specific content with the viewpoint information included in the VP message based on the sequence number (mpu_sequence_number). The renderer 330 may identify a video in a range identified by the viewpoint information detected by the detection unit 310 from the display area defined by the information included in the VP message (viewport_width and viewport_height) based on the difference between the viewpoint information detected by the detection unit 310 and the viewpoint information included in the VP message. The renderer 330 may display the identified video.

（動作例2）
上述した実施形態は、以下に示す動作例2を含んでもよい。動作例2では、図６に示すように、視点情報をフィードバックする第1ユーザ端末400及び視点情報をフィードバックしない第2ユーザ端末500が混在するケースが想定される。第1ユーザ端末400は、ヘッドマウントディスプレイなどの端末であってもよい。第1ユーザ端末は、上述したユーザ端末300と同様の機能を有していてもよい。第2ユーザ端末500は、ボリュメトリックディスプレイなどの端末であってもよい。 (Example 2)
The above-described embodiment may include the following operation example 2. In the operation example 2, as shown in FIG. 6, a case is assumed in which a first user terminal 400 that feeds back viewpoint information and a second user terminal 500 that does not feed back viewpoint information are mixed. The first user terminal 400 may be a terminal such as a head-mounted display. The first user terminal may have the same function as the above-described user terminal 300. The second user terminal 500 may be a terminal such as a volumetric display.

具体的には、動作例2では、図６に示すように、メディア処理装置200（レンダラ220）は、コンテンツの構成及び推奨ビューポート情報を送信装置100から受信する受信部を構成してもよい。コンテンツの構成は、2D映像、音声、360°映像、3Dオブジェクトを含むと考えてもよい。コンテンツの構成は、MMT-SI及びシーン記述を含むと考えてもよい。推奨ビューポート情報は、特定視点情報の一例であると考えてもよい。推奨ビューポート情報は、特定コンテンツによって構成される3次元空間（シーン記述で構築される3次元空間）において、どの位置、方向及び画角で映像を見るのかを定義（推奨）する情報であってもよい。推奨ビューポート情報は、特定コンテンツによって構成される3次元空間における視点位置を示す情報要素及び特定コンテンツによって構成される3次元空間における視線方向を示す情報要素の少なくてもいずか1つを示す情報要素を含んでもよい。推奨ビューポート情報は、主として、第2ユーザ端末500で用いられる視点情報であると考えてもよい。 Specifically, in the second operation example, as shown in FIG. 6, the media processing device 200 (renderer 220) may constitute a receiving unit that receives the content configuration and the recommended viewport information from the transmitting device 100. The content configuration may be considered to include 2D video, audio, 360° video, and 3D objects. The content configuration may be considered to include MMT-SI and a scene description. The recommended viewport information may be considered to be an example of specific viewpoint information. The recommended viewport information may be information that defines (recommends) the position, direction, and angle of view at which the video is viewed in a three-dimensional space (a three-dimensional space constructed by a scene description) constituted by the specific content. The recommended viewport information may include at least one of an information element indicating the viewpoint position in the three-dimensional space constituted by the specific content and an information element indicating the line of sight direction in the three-dimensional space constituted by the specific content. The recommended viewport information may be considered to be viewpoint information used primarily in the second user terminal 500.

以下において、明示的に記載しない限りにおいて、特定コンテンツの生成で用いる視点情報は、送信装置100から受信する特定視点情報（推奨ビューポート情報）を含んでもよく、ユーザ端末300から受信する視点情報を含んでもよい。 In the following, unless explicitly stated otherwise, the viewpoint information used in generating specific content may include specific viewpoint information (recommended viewport information) received from the transmitting device 100, or may include viewpoint information received from the user terminal 300.

特に限定されるものではないが、推奨ビューポート情報は、図７に示す態様でシーン記述に含まれてもよい。図７に示すように、推奨ビューポート情報は、camera_orientation、frame_number、translation、yfovを含んでもよい。 Although not limited thereto, the recommended viewport information may be included in the scene description in the manner shown in FIG. 7. As shown in FIG. 7, the recommended viewport information may include camera_orientation, frame_number, translation, and yfov.

camera_orientationは、シーン記述で構築される3次元空間において映像を見る方向を示す情報である。camera_orientationは、視線方向と同義であると考えてもよい。 Camera_orientation is information that indicates the direction in which the image is viewed in the three-dimensional space constructed by the scene description. Camera_orientation can be considered synonymous with line of sight.

frame_numberは、camera_orientation、translation及びyfovが適用される映像のフレーム番号を示す情報である。camera_orientationは、特定視点情報の一例であると考えてもよい。 frame_number is information that indicates the frame number of the image to which camera_orientation, translation, and yfov are applied. Camera_orientation may be considered an example of specific viewpoint information.

translationは、シーン記述で構築される3次元空間において映像を見る位置を示す情報である。translationは、視点位置と同義であると考えてもよい。translationは、特定視点情報の一例であると考えてもよい。例えば、図７では、フレーム番号が0である場合に、視点位置が[0,0,-50]であり、フレーム番号が2505である場合に、視点位置が[0,0,-75]に移動するケースが例示されている。 Translation is information that indicates the position from which an image is viewed in the three-dimensional space constructed by the scene description. Translation may be considered to be synonymous with viewpoint position. Translation may be considered to be an example of specific viewpoint information. For example, Figure 7 illustrates a case in which the viewpoint position is [0,0,-50] when the frame number is 0, and the viewpoint position moves to [0,0,-75] when the frame number is 2505.

yfovは、シーン記述で構築される3次元空間において映像を見る画角を示す情報である。 yfov is information that indicates the angle of view from which the image is viewed in the three-dimensional space constructed by the scene description.

特に限定されるものではないが、camera_orientation及びtranslationは、コンテンツの制作者が付与してもよい。或いは、コンテンツがカメラによって撮像されるケースを想定した場合に、カメラに設けられるGPS及びセンサによってcamera_orientation及びtranslationが自動的に付与されてもよい。 Although not limited thereto, camera_orientation and translation may be assigned by the creator of the content. Alternatively, in the case where the content is captured by a camera, camera_orientation and translation may be assigned automatically by the GPS and sensors installed in the camera.

ここで、translationは、特定コンテンツによって構成される3次元空間における視点位置を示す情報要素の一例である。camera_orientationは、特定コンテンツによって構成される3次元空間における視線方向を示す情報要素の一例である。 Here, translation is an example of an information element that indicates the viewpoint position in the three-dimensional space formed by a specific content. camera_orientation is an example of an information element that indicates the line of sight direction in the three-dimensional space formed by a specific content.

第1に、メディア処理装置200（レンダラ220）は、特定視点情報（推奨ビューポート情報）に基づいて特定コンテンツを生成してもよい。メディア処理装置200（符号化処理部230）は、レンダラ220によって生成された特定コンテンツの圧縮符号化を実行し、特定コンテンツを送信してもよい。ここで、符号化処理部230は、特定視点情報に基づいて生成された特定コンテンツを第1ユーザ端末400に送信してもよく、特定視点情報に基づいて生成された特定コンテンツを第2ユーザ端末500に送信してもよい。 First, the media processing device 200 (renderer 220) may generate specific content based on specific viewpoint information (recommended viewport information). The media processing device 200 (encoding processing unit 230) may perform compression encoding of the specific content generated by the renderer 220 and transmit the specific content. Here, the encoding processing unit 230 may transmit the specific content generated based on the specific viewpoint information to the first user terminal 400, or may transmit the specific content generated based on the specific viewpoint information to the second user terminal 500.

第2に、メディア処理装置200（受付部210）は、ユーザ端末が視点情報をフィードバックする第1ユーザ端末400である場合に、第1ユーザ端末400から視点情報を受信してもよい。メディア処理装置（レンダラ220）は、第1ユーザ端末400から受信する視点情報に基づいて特定コンテンツを生成してもよい。このようなケースにおいて、メディア処理装置200（受付部210）は、リセット信号を第1ユーザ端末400から受信してもよい。メディア処理装置（レンダラ220）は、リセット信号に応じて、特定視点情報（推奨ビューポート情報）に基づいて特定コンテンツを生成してもよい。 Secondly, the media processing device 200 (reception unit 210) may receive viewpoint information from the first user terminal 400 when the user terminal is the first user terminal 400 that feeds back viewpoint information. The media processing device (renderer 220) may generate specific content based on the viewpoint information received from the first user terminal 400. In such a case, the media processing device 200 (reception unit 210) may receive a reset signal from the first user terminal 400. The media processing device (renderer 220) may generate specific content based on specific viewpoint information (recommended viewport information) in response to the reset signal.

このようなケースにおいて、第1ユーザ端末400は、ユーザ端末300と同様の構成を有していてもよい。但し、第1ユーザ端末400（検出部310）は、リセット信号を検出する機能を有してもよい。検出部310は、リセット信号を入力するユーザ操作を検出してもよい。検出部310は、リセット信号をメディア処理装置200に送信してもよい。 In such a case, the first user terminal 400 may have a configuration similar to that of the user terminal 300. However, the first user terminal 400 (detection unit 310) may have a function for detecting a reset signal. The detection unit 310 may detect a user operation to input a reset signal. The detection unit 310 may transmit the reset signal to the media processing device 200.

第3に、メディア処理装置200（レンダラ220）は、ユーザ端末が視点情報をフィードバックしない第2ユーザ端末500である場合に、特定視点情報（推奨ビューポート情報）に基づいて特定コンテンツを生成してもよい。 Thirdly, the media processing device 200 (renderer 220) may generate specific content based on specific viewpoint information (recommended viewport information) when the user terminal is a second user terminal 500 that does not feed back viewpoint information.

このようなケースにおいて、第2ユーザ端末500は、視点情報を検出する検出部310を有していなくてもよい。第2ユーザ端末500は、レンダラ330を有していなくてもよい。第2ユーザ端末500は、検出部310及びレンダラ330を有していない点を除いて、ユーザ端末300と同様の構成を有してもよい。 In such a case, the second user terminal 500 may not have a detection unit 310 that detects viewpoint information. The second user terminal 500 may not have a renderer 330. The second user terminal 500 may have the same configuration as the user terminal 300, except that it does not have the detection unit 310 and the renderer 330.

（動作例3）
上述した実施形態は、以下に示す動作例3を含んでもよい。動作例3では、メディア処理装置200（例えば、後述する選択部260）は、特定コンテンツに含まれる3Dオブジェクトに関するストリームの品質情報として、視点情報に基づいた3Dオブジェクトの向きによって品質が異なる2以上のストリームの各々に関する品質情報を送信装置100から受信する受信部を構成してもよい。 (Example 3)
The above-described embodiment may include the following Operation Example 3. In Operation Example 3, media processing device 200 (e.g., selection unit 260 described later) may constitute a receiving unit that receives, from transmitting device 100, quality information on each of two or more streams whose quality differs depending on the orientation of the 3D object based on viewpoint information, as quality information on a stream related to a 3D object included in specific content.

具体的には、動作例3では、図８に示すように、メディア処理装置200は、図２に示す構成に加えて、選択部260を有する。選択部260は、シーン記述及び3Dオブジェクトを送信装置100から受信する。選択部260は、2以上のストリームの中から選択されたストリーム（3Dオブジェクト）をレンダラ220に入力する。選択部260は、選択されたストリームの送信を送信装置100に要求してもよい。なお、メディア処理装置200が複数のユーザ端末300に特定コンテンツを送信するケースを想定した場合には、選択部260は、複数のユーザ端末300の各々で必要とされるストリームの送信を送信装置100に要求してもよく、全てのストリームの送信を送信装置100に要求してもよい。 Specifically, in operation example 3, as shown in FIG. 8, the media processing device 200 has a selection unit 260 in addition to the configuration shown in FIG. 2. The selection unit 260 receives a scene description and a 3D object from the transmitting device 100. The selection unit 260 inputs a stream (3D object) selected from two or more streams to the renderer 220. The selection unit 260 may request the transmitting device 100 to transmit the selected stream. Note that, in a case in which the media processing device 200 transmits specific content to multiple user terminals 300, the selection unit 260 may request the transmitting device 100 to transmit a stream required by each of the multiple user terminals 300, or may request the transmitting device 100 to transmit all streams.

ここで、選択部260は、3Dオブジェクトに関するストリームの品質情報として、視点情報に基づいた3Dオブジェクトの向きによって品質が異なる2以上のストリームの各々に関する品質情報を送信装置100から受信する。 Here, the selection unit 260 receives, from the transmission device 100, quality information regarding each of two or more streams whose quality differs depending on the orientation of the 3D object based on the viewpoint information, as quality information of the stream regarding the 3D object.

品質情報は、3Dオブジェクトに関するバウンディングボックスを構成する各面の相対品質を示す情報であってもよい。バウンディングボックスは、3Dオブジェクトを射影する3次元の矩形によって表されてもよい。例えば、図９に示すように、バウンディングボックスは、頂点A～頂点Hによって定義されてもよい。このようなケースにおいて、バウンディングボックスの各面は、頂点A,B,F,Eで表される面#1、頂点B,C,G,Fで表される面#2、頂点A,B,C,Dで表される面#3、頂点E,F,G,Hで表される面#4、頂点A,D,H,Eで表される面#5、頂点D,C,G,Hで表される面#6を含む。 The quality information may be information indicating the relative quality of each face constituting a bounding box for a 3D object. The bounding box may be represented by a three-dimensional rectangle onto which the 3D object is projected. For example, as shown in FIG. 9, the bounding box may be defined by vertices A to H. In such a case, the faces of the bounding box include face #1 represented by vertices A, B, F, E, face #2 represented by vertices B, C, G, F, face #3 represented by vertices A, B, C, D, face #4 represented by vertices E, F, G, H, face #5 represented by vertices A, D, H, E, and face #6 represented by vertices D, C, G, H.

このようなケースにおいて、シーン記述によって構築される3次元空間において3Dオブジェクトを見るケースを想定すると、3つの面について主として観察されると想定される。言い換えると、残りの3つの面についてはあまり観察されないと想定される。 In such a case, when we consider the case of viewing a 3D object in the 3D space constructed by the scene description, we assume that three faces are primarily observed. In other words, we assume that the remaining three faces are not observed very often.

動作例3では、特定コンテンツに含まれる3Dオブジェクトに関するストリームとして、視点情報に基づいた3Dオブジェクトの向きによって品質が異なる2以上のストリームが準備される。 In operation example 3, two or more streams are prepared as streams related to 3D objects included in specific content, with the quality differing depending on the orientation of the 3D object based on viewpoint information.

特に限定されるものではないが、品質情報は、図１０に示す態様でシーン記述に含まれてもよい。図１０では、3Dオブジェクトの向きが異なるストリームとして、6つのストリームが例示されている。品質情報は、”quality” [#1,#2,#3,#4,#5,#6]の形式で表されてもよい。なお、[ ]内において、#1～#6は、面#1～面#6の品質インデックスを意味している。品質インデックスは、1～9の範囲の値を取り得てもよい。品質インデックスの値が大きいほど、品質が高いことを意味してもよい。例えば、”id”=”1”で識別されるストリームでは、#1,#2,#3の品質（”8”）が高く、#4,#5,#6の品質（”3”）が低い。”id”=”2”で識別されるストリームでは、#1,#2,#3の品質（”3”）が低く、#4,#5,#6の品質（”8”）が高い。 Although not limited to this, the quality information may be included in the scene description in the manner shown in FIG. 10. In FIG. 10, six streams are illustrated as streams with different orientations of 3D objects. The quality information may be expressed in the form of "quality" [#1,#2,#3,#4,#5,#6]. In the [ ], #1 to #6 refer to the quality indexes of surfaces #1 to #6. The quality index may take values in the range of 1 to 9. The higher the quality index value, the higher the quality. For example, in a stream identified by "id"="1", the quality ("8") of #1, #2, and #3 is high, and the quality ("3") of #4, #5, and #6 is low. In a stream identified by "id"="2", the quality ("3") of #1, #2, and #3 is low, and the quality ("8") of #4, #5, and #6 is high.

このような前提下において、メディア処理装置200は、以下に示す動作を実行してもよい。以下においては、2以上のストリームの中から選択されたストリーム（3Dオブジェクト）の選択について主として説明する。 Under these conditions, the media processing device 200 may perform the following operations. In the following, the selection of a stream (3D object) selected from two or more streams is mainly described.

モード1では、図１１の上段に示すように、メディア処理装置200（選択部260）は、ユーザの視点位置に最も近い頂点（例えば、頂点B）を特定した上で、最も近い頂点を有する3つの面（例えば、頂点A,B,F,Eで表される面#1、頂点B,C,G,Fで表される面#2、頂点A,B,C,Dで表される面#3）を特定してもよい。選択部260は、特定された3つの面の品質インデックスの総和が最大となるストリーム（図１０に示す例では、”id”=”１”で識別されるストリーム）を選択してもよい。 In mode 1, as shown in the upper part of FIG. 11, the media processing device 200 (selection unit 260) may identify the vertex closest to the user's viewpoint (for example, vertex B), and then identify the three faces that have the closest vertex (for example, face #1 represented by vertices A, B, F, and E, face #2 represented by vertices B, C, G, and F, and face #3 represented by vertices A, B, C, and D). The selection unit 260 may select the stream that maximizes the sum of the quality indexes of the three identified faces (in the example shown in FIG. 10, the stream identified by "id"="1").

なお、モード1では、視点情報に基づいてユーザの視点位置に最も近い頂点が特定されることから、選択部260は、視点情報及び品質情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。 In addition, in mode 1, since the vertex closest to the user's viewpoint position is identified based on the viewpoint information, it can be considered that the selection unit 260 selects the stream to be transmitted to the user terminal 300 from among two or more streams based on the viewpoint information and quality information.

モード2では、3Dオブジェクトの縮小表示又は拡大表示が実行されるケースで適用されるモードであってもよい。例えば、図１１の中段に示すように、メディア処理装置200（選択部260）は、ユーザの視点位置に最も近い頂点（例えば、頂点B）を特定した上で、最も近い頂点を有する3つの面（例えば、頂点A,B,F,Eで表される面#1、頂点B,C,G,Fで表される面#2、頂点A,B,C,Dで表される面#3）を特定してもよい。3Dオブジェクトの縮小表示が実行されるケースでは、3Dオブジェクトの画素が間引かれる。従って、選択部260は、特定された3つの面の品質インデックスの総和が最小となるストリーム（図１０に示す例では、”id”=”2”で識別されるストリーム）を選択してもよい。一方で、3Dオブジェクトの拡大表示が実行されるケースでは、3Dオブジェクトの画素が補間される。従って、選択部260は、特定された3つの面の品質インデックスの総和が最大となるストリーム（図１０に示す例では、”id”=”1”で識別されるストリーム）を選択してもよい。 Mode 2 may be a mode applied in cases where a reduced or enlarged display of a 3D object is performed. For example, as shown in the middle of FIG. 11, the media processing device 200 (selection unit 260) may identify the vertex (e.g., vertex B) closest to the user's viewpoint position, and then identify three faces having the closest vertex (e.g., face #1 represented by vertices A, B, F, and E, face #2 represented by vertices B, C, G, and F, and face #3 represented by vertices A, B, C, and D). In cases where a reduced display of a 3D object is performed, pixels of the 3D object are thinned out. Therefore, the selection unit 260 may select a stream (in the example shown in FIG. 10, the stream identified by "id"="2") that minimizes the sum of the quality indexes of the three identified faces. On the other hand, in cases where an enlarged display of a 3D object is performed, pixels of the 3D object are interpolated. Therefore, the selection unit 260 may select a stream (in the example shown in FIG. 10, the stream identified by "id"="1") that maximizes the sum of the quality indexes of the three identified faces.

なお、モード2では、視点情報に基づいてユーザの視点位置に最も近い頂点が特定されることから、選択部260は、視点情報及び品質情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。 In addition, in mode 2, since the vertex closest to the user's viewpoint position is identified based on the viewpoint information, it can be considered that the selection unit 260 selects the stream to be transmitted to the user terminal 300 from among two or more streams based on the viewpoint information and quality information.

モード3では、ユーザの視線方向において、2つの3Dオブジェクト（3Dオブジェクト#1及び3Dオブジェクト#2）が重なるケースで適用されるモードであってもよい。ここでは、3Dオブジェクト#1に関するストリームの選択について説明する。例えば、図１１の下段に示すように、メディア処理装置200（選択部260）は、ユーザの視点位置に最も近い頂点（例えば、頂点B）を特定した上で、最も近い頂点を有する3つの面（例えば、頂点A,B,F,Eで表される面#1、頂点B,C,G,Fで表される面#2、頂点A,B,C,Dで表される面#3）を特定してもよい。ここで、ユーザの視点位置に最も近い頂点（例えば、頂点B）とユーザの視点位置とを結ぶ線分上において3Dオブジェクト#2が重なっており、特定された3つの面が3Dオブジェクト#2によって遮られる。従って、選択部260は、特定された3つの面の品質インデックスの総和が最小となるストリーム（図１０に示す例では、”id”=”2”で識別されるストリーム）を選択してもよい。 Mode 3 may be applied when two 3D objects (3D object #1 and 3D object #2) overlap in the user's line of sight. Here, the selection of the stream related to 3D object #1 will be described. For example, as shown in the lower part of FIG. 11, the media processing device 200 (selection unit 260) may identify the vertex (e.g., vertex B) closest to the user's viewpoint position, and then identify three faces having the closest vertex (e.g., face #1 represented by vertices A, B, F, and E, face #2 represented by vertices B, C, G, and F, and face #3 represented by vertices A, B, C, and D). Here, 3D object #2 overlaps on the line segment connecting the vertex (e.g., vertex B) closest to the user's viewpoint position and the user's viewpoint position, and the identified three faces are blocked by 3D object #2. Therefore, the selection unit 260 may select the stream (in the example shown in FIG. 10, the stream identified by "id"="2") that has the smallest sum of the quality indexes of the identified three faces.

なお、モード3では、視点情報に基づいてユーザの視点位置に最も近い頂点が特定されることから、選択部260は、視点情報及び品質情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。さらに、モード3では、視点情報及び3Dオブジェクトの配置情報に基づいて2つの3Dオブジェクトの重なりが特定されることから、選択部260は、視点情報、品質情報及び配置情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。3Dオブジェクトの配置情報（例えば、図１０に示す”rotation_object”、”scale_object”、”translation_object）は、シーン記述に含まれてもよい。3Dオブジェクトの配置情報は、図１０に示す”link_area”であると考えてもよい。すなわち、選択部260は、3次元空間における3Dオブジェクトの配置情報を送信装置100から受信してもよい。 In addition, in mode 3, since the vertex closest to the user's viewpoint position is identified based on the viewpoint information, the selection unit 260 may be considered to select the stream to be transmitted to the user terminal 300 from among two or more streams based on the viewpoint information and quality information. Furthermore, in mode 3, since the overlap of two 3D objects is identified based on the viewpoint information and the arrangement information of the 3D objects, the selection unit 260 may be considered to select the stream to be transmitted to the user terminal 300 from among two or more streams based on the viewpoint information, quality information, and arrangement information. The arrangement information of the 3D objects (for example, "rotation_object", "scale_object", and "translation_object" shown in FIG. 10) may be included in the scene description. The arrangement information of the 3D objects may be considered to be "link_area" shown in FIG. 10. That is, the selection unit 260 may receive the arrangement information of the 3D objects in the three-dimensional space from the transmission device 100.

なお、図１０に示す品質情報では、各ストリームにおいて6つの面の品質インデックスの総和が等しい。しかしながら、実施形態はこれに限定されるものではない。6つの面の品質インデックスの総和は、2以上のストリーム間で異なっていてもよい。 Note that in the quality information shown in FIG. 10, the sum of the quality indexes of the six aspects is equal in each stream. However, the embodiment is not limited to this. The sum of the quality indexes of the six aspects may be different between two or more streams.

（動作例4）
上述した実施形態は、以下に示す動作例4を含んでもよい。ここでは、動作例4は、動作例3に加えて、以下に示す動作を含む。動作例4では、メディア処理装置200（例えば、後述する選択部270）は、特定コンテンツに含まれる2以上のオブジェクトの各々に関する重要度情報を送信装置100から受信する受信部を構成してもよい。 (Example 4)
The above-described embodiment may include the following Operation Example 4. Here, Operation Example 4 includes the operations described below in addition to Operation Example 3. In Operation Example 4, media processing device 200 (e.g., selection unit 270 described below) may constitute a receiving unit that receives importance information related to each of two or more objects included in specific content from transmitting device 100.

具体的には、動作例4では、図１２に示すように、メディア処理装置200は、図２に示す構成に加えて、選択部270を有する。選択部270は、シーン記述、3Dオブジェクト及び360°映像を送信装置100から受信する。選択部270は、2以上のストリームの中から選択されたストリーム（3Dオブジェクト）をレンダラ220に入力する。 Specifically, in operation example 4, as shown in FIG. 12, the media processing device 200 has a selection unit 270 in addition to the configuration shown in FIG. 2. The selection unit 270 receives a scene description, a 3D object, and a 360° video from the transmission device 100. The selection unit 270 inputs a stream (3D object) selected from two or more streams to the renderer 220.

ここで、選択部270は、2以上のオブジェクトの各々に関する重要度情報を送信装置100から受信する。オブジェクトは、3Dオブジェクト及び360°映像を含んでもよい。 Here, the selection unit 270 receives importance information regarding each of the two or more objects from the transmission device 100. The objects may include 3D objects and 360° video.

例えば、重要度情報は、2以上のオブジェクト間の相対的な重要度を示す情報であってもよい。例えば、図１３に示すように、シーン記述によって構築される3次元空間において、オブジェクトA（背景）、オブジェクトB（人）及びオブジェクトC（犬）が存在するケースについて考える。オブジェクトA（背景）は、360°映像の一例であり、オブジェクトB（人）及びオブジェクトC（犬）は、3Dオブジェクトの一例である。このようなケースにおいて、重要度情報は、オブジェクトA（背景）、オブジェクトB（人）及びオブジェクトC（犬）の各々の間の相対的な重要度を示す情報であってもよい。 For example, the importance information may be information indicating the relative importance between two or more objects. For example, as shown in FIG. 13, consider a case in which object A (background), object B (person), and object C (dog) exist in a three-dimensional space constructed by a scene description. Object A (background) is an example of a 360° image, and object B (person) and object C (dog) are examples of 3D objects. In such a case, the importance information may be information indicating the relative importance between each of object A (background), object B (person), and object C (dog).

特に限定されるものではないが、重要度情報は、図１４に示す態様でシーン記述に含まれてもよい。図１４では、重要度情報は、weightで表されてもよい。weightは、1～9の範囲の値を取り得てもよい。weightの値が大きいほど、重要度が高いことを意味してもよい。図１４では、”object_id”=”0”で識別されるオブジェクトA（背景）のweight（”9”）が最も高く、”object_id”=”1”で識別されるオブジェクトB（人）のweight（”3”）が最も低く、”object_id”=”2”で識別されるオブジェクトC（犬）のweight（”8”）がオブジェクトB（人）のweightよりも高くオブジェクトA（背景）のweightよりも低いケースが例示されている。 Although not particularly limited, the importance information may be included in the scene description in the manner shown in FIG. 14. In FIG. 14, the importance information may be represented by weight. Weight may take a value ranging from 1 to 9. A larger weight value may indicate a higher importance. FIG. 14 illustrates a case in which object A (background) identified by "object_id"="0" has the highest weight ("9"), object B (person) identified by "object_id"="1" has the lowest weight ("3"), and object C (dog) identified by "object_id"="2" has a weight ("8") higher than the weight of object B (person) and lower than the weight of object A (background).

このような前提下において、メディア処理装置200は、以下に示す動作を実行してもよい。以下においては、2以上のストリームの中から選択されたストリーム（3Dオブジェクト）の選択について主として説明する。 Under these assumptions, the media processing device 200 may execute the following operations. In the following, the selection of a stream (3D object) selected from two or more streams is mainly described.

第1に、メディア処理装置200（選択部270）は、重要度が最も高い3Dオブジェクトについて、品質が最も高いストリームを選択する。ストリームの選択方法は、動作例3と同様であってもよい。例えば、選択部270は、オブジェクトC（犬）の重要度がオブジェクトB（人）の重要度よりも大きいため、オブジェクトC（犬）について、ユーザの視点位置に最も近い頂点を有する3つの面の品質インデックスの総和が最大となるストリームを選択する。 First, the media processing device 200 (selection unit 270) selects the stream with the highest quality for the 3D object with the highest importance. The stream selection method may be the same as in operation example 3. For example, because the importance of object C (dog) is greater than the importance of object B (person), the selection unit 270 selects the stream for object C (dog) that maximizes the sum of the quality indexes of the three faces that have the vertex closest to the user's viewpoint.

第2に、メディア処理装置200（選択部270）は、重要度が最も高い3Dオブジェクト以外の3Dオブジェクトについて、品質が最も低いストリームを選択する。続いて、選択部270は、重要度が高い3Dオブジェクトから順に、特定条件が満たされる範囲内において、品質が最も低いストリームを品質が高いストリームに置き換える。特定条件は、送信装置100からメディア処理装置200への回線の帯域が閾値以下である第1条件を含んでもよく、メディア処理装置200の処理負荷が閾値以下である第2条件を含んでもよい。特定条件は、第1条件及び第2条件の組み合わせによって定義されてもよい。例えば、選択部270は、オブジェクトB（人）の重要度がオブジェクトC（犬）の重要度よりも小さいため、オブジェクトB（人）について、特定条件が満たされる範囲内において、品質が高いストリームを選択する。 Secondly, the media processing device 200 (selection unit 270) selects the lowest quality stream for 3D objects other than the 3D object with the highest importance. Next, the selection unit 270 replaces the lowest quality stream with a higher quality stream within the range in which the specific condition is satisfied, starting from the 3D object with the highest importance. The specific condition may include a first condition that the bandwidth of the line from the transmission device 100 to the media processing device 200 is equal to or lower than a threshold, and may include a second condition that the processing load of the media processing device 200 is equal to or lower than a threshold. The specific condition may be defined by a combination of the first and second conditions. For example, because the importance of object B (person) is lower than the importance of object C (dog), the selection unit 270 selects a high quality stream for object B (person) within the range in which the specific condition is satisfied.

上述したように、メディア処理装置200（選択部270）は、視点情報、品質情報及び重要度情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。メディア処理装置200（選択部270）は、視点情報、品質情報、配置情報及び重要度情報に基づいて、2以上のストリームの中からユーザ端末300に送信すべきストリームを選択すると考えてもよい。 As described above, the media processing device 200 (selection unit 270) may be considered to select a stream to be transmitted to the user terminal 300 from among two or more streams based on viewpoint information, quality information, and importance information. The media processing device 200 (selection unit 270) may be considered to select a stream to be transmitted to the user terminal 300 from among two or more streams based on viewpoint information, quality information, placement information, and importance information.

なお、動作例4では、360°映像について、1つのストリームが存在するケースについて例示した。しかしながら、実施形態はこれに限定されるものではない。360°映像についても、品質が異なる2以上のストリームが存在してもよい。 Note that in operation example 4, a case where one stream exists for 360° video has been exemplified. However, the embodiment is not limited to this. For 360° video, two or more streams of different qualities may also exist.

動作例4では、3Dオブジェクトについて、視点情報に基づいた3Dオブジェクトの向きによって品質が異なる2以上のストリームが存在するケースについて例示した。しかしながら、実施形態はこれに限定されるものではない。3Dオブジェクトについて、3Dオブジェクトの向きによらずに、品質が異なる2以上のストリームが存在してもよい。 In the fourth operational example, a case has been exemplified in which, for a 3D object, there are two or more streams whose quality differs depending on the orientation of the 3D object based on viewpoint information. However, the embodiment is not limited to this. For a 3D object, there may be two or more streams whose quality differs regardless of the orientation of the 3D object.

（動作例5）
上述した実施形態は、以下に示す動作例3を含んでもよい。動作例3では、メディア処理装置200（例えば、レンダラ220）は、特定コンテンツによって構成される3次元空間（シーン記述によって構築される3次元空間）においてユーザの視点位置の移動範囲を定義する情報要素を送信装置100から受信する受信部を構成してもよい。 (Example 5)
The above-described embodiment may include the following Operation Example 3. In Operation Example 3, the media processing device 200 (e.g., the renderer 220) may constitute a receiving unit that receives, from the transmitting device 100, an information element that defines a movement range of a user's viewpoint position in a three-dimensional space configured by specific content (a three-dimensional space constructed by a scene description).

第1に、動作例5では、情報要素は、特定コンテンツに含まれる3Dオブジェクトの内側へのユーザの視点位置の移動を制限する情報要素（以下、第1情報要素）を含んでもよい。例えば、図１５に示すように、シーン記述によって構築される3次元空間に3Dオブジェクトが配置されるケースにおいて、3Dオブジェクトの内側への視点位置の移動が制限されてもよい。但し、3Dオブジェクトが建築物であるケース、3Dオブジェクトの内側に別のシーンが存在するケースなどにおいては、3Dオブジェクトの内側への視点位置の移動が許容されてもよい。 First, in operation example 5, the information element may include an information element (hereinafter, a first information element) that restricts movement of the user's viewpoint position to inside a 3D object included in specific content. For example, as shown in FIG. 15, in a case where a 3D object is placed in a three-dimensional space constructed by a scene description, movement of the viewpoint position to inside the 3D object may be restricted. However, in a case where the 3D object is a building or another scene exists inside the 3D object, movement of the viewpoint position to inside the 3D object may be permitted.

第2に、動作例5では、情報要素は、3次元空間の外側へのユーザの視点位置の移動を制限する情報要素（以下、第2情報要素）を含んでもよい。例えば、図１６に示すように、3次元空間は、直方体及び回転楕円体の組合せで定義されてもよい。3次元空間を定義する直方体の数は2以上であってもよく、3次元空間を定義する回転楕円体の数は2以上であってもよい。但し、3次元空間の外側へのユーザの視点位置の移動が許容されるケースがあってもよい。 Secondly, in operation example 5, the information element may include an information element (hereinafter, second information element) that restricts movement of the user's viewpoint position outside the three-dimensional space. For example, as shown in FIG. 16, the three-dimensional space may be defined by a combination of a rectangular parallelepiped and a spheroid. The number of rectangular parallelepipeds that define the three-dimensional space may be two or more, and the number of spheroids that define the three-dimensional space may be two or more. However, there may be cases in which movement of the user's viewpoint position outside the three-dimensional space is permitted.

特に限定されるものではないが、第1情報要素は、図１７に示す態様でシーン記述に含まれてもよい。図１７では、第1情報要素は、viewing_inside_object_flagで表されてもよい。viewing_inside_object_flagは、3Dオブジェクト毎に設定されてもよい。例えば、viewing_inside_object_flagが”0”である場合に、3Dオブジェクト内への視点位置の移動が制限され、viewing_inside_object_flagが”1”である場合に、3Dオブジェクト内への視点位置の移動が許容されてもよい。 Although not particularly limited, the first information element may be included in the scene description in the manner shown in FIG. 17. In FIG. 17, the first information element may be represented by viewing_inside_object_flag. The viewing_inside_object_flag may be set for each 3D object. For example, when viewing_inside_object_flag is "0", movement of the viewpoint position into the 3D object may be restricted, and when viewing_inside_object_flag is "1", movement of the viewpoint position into the 3D object may be permitted.

特に限定されるものではないが、第2情報要素は、図１８に示す態様でシーン記述に含まれてもよい。図１８では、第2情報要素は、3次元空間を定義する直方体を定義する情報要素（cuboid_center_x, cuboid_center_y, cuboid_center_z, cuboid_size_x, cuboid_size_y, cuboid_size_z）を含んでもよい。cuboid_center_x, cuboid_center_y, cuboid_center_zは、直方体の中心位置を示す情報要素であり、cuboid_size_x, cuboid_size_y, cuboid_size_zは、直方体のサイズを示す情報要素である。第2情報要素は、3次元空間を定義する回転楕円体を定義する情報要素（spheroid_center_x, spheroid_center_y, spheroid_center_z, spheroid_size_x, spheroid_size_y, spheroid_size_z）を含んでもよい。spheroid_center_x, spheroid_center_y, spheroid_center_zは、回転楕円体の中心位置を示す情報要素であり、spheroid_size_x, spheroid_size_y, spheroid_size_zは、回転楕円体のサイズを示す情報要素である。なお、cuboid_enableは、直方体によって3次元空間を定義するか否かを示す情報要素であり、spheroid_enableは、回転楕円体によって3次元空間を定義するか否かを示す情報要素であってもよい。図１８では、2つの直方体及び2つの回転楕円体によって3次元空間が定義されるケースが例示されている。 Although not particularly limited, the second information element may be included in the scene description in the manner shown in FIG. 18. In FIG. 18, the second information element may include information elements (cuboid_center_x, cuboid_center_y, cuboid_center_z, cuboid_size_x, cuboid_size_y, cuboid_size_z) that define a rectangular parallelepiped that defines a three-dimensional space. The cuboid_center_x, cuboid_center_y, and cuboid_center_z are information elements that indicate the center position of the rectangular parallelepiped, and the cuboid_size_x, cuboid_size_y, and cuboid_size_z are information elements that indicate the size of the rectangular parallelepiped. The second information element may include information elements (spheroid_center_x, spheroid_center_y, spheroid_center_z, spheroid_size_x, spheroid_size_y, spheroid_size_z) that define a spheroid that defines a three-dimensional space. spheroid_center_x, spheroid_center_y, and spheroid_center_z are information elements that indicate the center position of a spheroid, and spheroid_size_x, spheroid_size_y, and spheroid_size_z are information elements that indicate the size of the spheroid. Note that cuboid_enable is an information element that indicates whether or not a three-dimensional space is defined by a rectangular parallelepiped, and spheroid_enable may be an information element that indicates whether or not a three-dimensional space is defined by a spheroid. FIG. 18 illustrates an example in which a three-dimensional space is defined by two rectangular parallelepipeds and two spheroids.

このような前提下において、メディア処理装置200（レンダラ220）は、以下に示す動作を実行してもよい。 Under these assumptions, the media processing device 200 (renderer 220) may perform the operations described below.

第1に、レンダラ220は、ユーザの視点位置が移動範囲外に移動する場合において、ユーザの視点位置の軌跡と移動範囲の境界との交点を視点位置として、特定コンテンツを生成してもよい。すなわち、レンダラ220は、視点位置が移動範囲外に移動しようとした時点の位置（境界位置）で視点位置を固定してもよい。 First, when the user's viewpoint position moves outside the movement range, the renderer 220 may generate specific content by using the intersection point of the trajectory of the user's viewpoint position and the boundary of the movement range as the viewpoint position. In other words, the renderer 220 may fix the viewpoint position at the position (boundary position) at the time when the viewpoint position is about to move outside the movement range.

第2に、レンダラ220は、ユーザの視点位置が移動範囲外に移動する場合において、視点位置の移動が制限されている旨をユーザに通知してもよい。例えば、レンダラ220は、「ここから先は移動できません」などのメッセージを表示してもよい。 Second, when the user's viewpoint position moves outside the movement range, the renderer 220 may notify the user that movement of the viewpoint position is restricted. For example, the renderer 220 may display a message such as "You cannot move beyond this point."

（作用及び効果）
実施形態では、メディア処理装置200は、視点情報に基づいて特定コンテンツを生成した上で、特定コンテンツをユーザ端末300に送信する。このような構成によれば、視点の自由度を有する第2コンテンツを含む特定コンテンツをユーザ端末300側で生成する必要がなく、ユーザ端末300は、メディア処理装置200に対して視点情報を提供すれば、特定コンテンツを提示することができる。従って、メディア処理装置200とユーザ端末300との間の遅延が生じるものの、ユーザ端末300の処理負荷を軽減することができる。 (Action and Effects)
In the embodiment, media processing device 200 generates specific content based on viewpoint information and transmits the specific content to user terminal 300. With this configuration, there is no need for user terminal 300 to generate specific content including second content with viewpoint freedom, and user terminal 300 can present the specific content by simply providing viewpoint information to media processing device 200. Therefore, although a delay occurs between media processing device 200 and user terminal 300, the processing load on user terminal 300 can be reduced.

動作例1では、メディア処理装置200は、特定コンテンツに付加されるシーケンス番号と対応付けて、特定コンテンツの生成で用いた視点情報をユーザ端末300に送信する。このような構成によれば、ユーザ端末300は、特定コンテンツの生成で用いた視点情報及びシーケンス番号を把握することができるため、メディア処理装置200で特定コンテンツを生成する際に用いる視点情報がユーザ端末300で特定コンテンツを表示する際に用いる視点情報と異なるケースを想定した場合であっても、特定コンテンツを適切に表示することができる。 In operation example 1, the media processing device 200 transmits the viewpoint information used in generating the specific content to the user terminal 300 in association with the sequence number added to the specific content. With this configuration, the user terminal 300 can grasp the viewpoint information and sequence number used in generating the specific content, and therefore can appropriately display the specific content even in cases where the viewpoint information used in generating the specific content by the media processing device 200 differs from the viewpoint information used in displaying the specific content on the user terminal 300.

動作例2では、メディア処理装置200は、コンテンツの構成及び特定視点情報（推奨ビューポート情報）を送信装置100から受信する。このような構成によれば、メディア処理装置200は、特定視点情報に基づいて特定コンテンツを生成することができ、視点情報をフィードバックする第1ユーザ端末400及び視点情報をフィードバックしない第2ユーザ端末500が混在するケースを想定した場合であっても、特定コンテンツを適切に表示することができる。 In operation example 2, the media processing device 200 receives a content configuration and specific viewpoint information (recommended viewport information) from the transmitting device 100. With this configuration, the media processing device 200 can generate specific content based on the specific viewpoint information, and can appropriately display the specific content even in a case where a first user terminal 400 that feeds back viewpoint information and a second user terminal 500 that does not feed back viewpoint information are mixed.

動作例2では、メディア処理装置200は、ユーザ端末が第1ユーザ端末400である場合であっても、リセット信号に応じて、特定視点情報に基づいて特定コンテンツを生成する。このような構成によれば、シーン記述によって構築される3次元空間において視点位置及び視線方向が第1ユーザ端末400において不明となるケース（3次元空間において迷子になるケース）を想定した場合であっても、リセット信号によって特定視点情報に基づいた特定コンテンツに復帰することができる。 In operation example 2, the media processing device 200 generates specific content based on specific viewpoint information in response to a reset signal, even when the user terminal is the first user terminal 400. With this configuration, even in a case where the viewpoint position and line of sight direction become unknown in the three-dimensional space constructed by the scene description in the first user terminal 400 (a case where the user becomes lost in the three-dimensional space), the reset signal can return to specific content based on specific viewpoint information.

動作例3では、メディア処理装置200は、特定コンテンツに含まれる3Dオブジェクトに関するストリームの品質情報として、視点情報に基づいた3Dオブジェクトの向きによって品質が異なる2以上のストリームの各々に関する品質情報を送信装置100から受信する。このような構成によれば、3Dオブジェクトに関するバウンディングボックスを構成する各面の品質が均一でなくてもよいという新たな知見に基づいて、伝送トラフィックを抑制しながらも、3Dオブジェクトを適切に表示することができる。 In operation example 3, the media processing device 200 receives, from the transmitting device 100, quality information about each of two or more streams whose quality varies depending on the orientation of the 3D object based on the viewpoint information, as quality information about a stream related to a 3D object included in specific content. With this configuration, based on the new knowledge that the quality of each face constituting a bounding box for a 3D object does not have to be uniform, it is possible to appropriately display a 3D object while suppressing transmission traffic.

動作例4では、メディア処理装置200は、特定コンテンツに含まれる2以上のオブジェクトの各々に関する重要度情報を送信装置100から受信する。このような構成によれば、360°映像及び3Dオブジェクトなどのオブジェクト毎の重要度を設定する仕組みを導入することによって、伝送トラフィックを抑制しながらも、特定コンテンツに含まれる各オブジェクトを適切に表示することができる。 In operation example 4, the media processing device 200 receives importance information for each of two or more objects included in the specific content from the transmitting device 100. With this configuration, by introducing a mechanism for setting the importance of each object, such as a 360° video image or a 3D object, it is possible to appropriately display each object included in the specific content while suppressing transmission traffic.

動作例5では、メディア処理装置200は、特定コンテンツによって構成される3次元空間においてユーザの視点位置の移動範囲を定義する情報要素を送信装置100から受信する。このような構成によれば、ユーザ端末300に表示される特定コンテンツの破綻を生じることなく、視点の自由度を有するコンテンツを含む特定コンテンツを適切に表示することができる。 In operation example 5, the media processing device 200 receives from the transmission device 100 an information element that defines the range of movement of the user's viewpoint position in the three-dimensional space configured by the specific content. With this configuration, it is possible to appropriately display the specific content, including content with freedom of viewpoint, without causing a breakdown in the specific content displayed on the user terminal 300.

［変更例1］
以下において、実施形態の変更例1について説明する。以下においては、実施形態に対する相違点について主として説明する。 [Change Example 1]
Modification 1 of the embodiment will be described below. Differences from the embodiment will be mainly described below.

変更例1では、特定コンテンツが第1コンテンツ及び第2コンテンツの双方を含む場合において、第1コンテンツと第2コンテンツとの同期を取る方法について説明する。 In the first modification, a method for synchronizing a first content with a second content when a specific content includes both the first content and the second content is described.

なお、以下において、同期とは、第1コンテンツ（例えば、MPU）と第2コンテンツ（ファイル）との提示時刻が適切に揃うことを意味する。従って、同期は、2D映像と3Dオブジェクトとの提示時刻が揃うことを含んでもよく、音声と3Dオブジェクトとの提示時刻が揃うことを含んでもよい。同様に、同期は、2D映像と360°映像との提示時刻が揃うことを含んでもよく、音声と360°映像との提示時刻が揃うことを含んでもよい。 Note that in the following, synchronization means that the presentation times of the first content (e.g., MPU) and the second content (file) are properly aligned. Therefore, synchronization may include the presentation times of the 2D video and the 3D object being aligned, and may also include the presentation times of the audio and the 3D object being aligned. Similarly, synchronization may include the presentation times of the 2D video and the 360° video being aligned, and may also include the presentation times of the audio and the 360° video being aligned.

第1方法では、メディア処理装置200が第1制御情報（MMT-SI）に基づいて、第1コンテンツと第2コンテンツとの同期を取るケースについて説明する。メディア処理装置200は、MMT-SIをエントリーポイントとして、シーン記述（第2コンテンツ）の有無を確認した上で、シーン記述が存在する場合には、MPUタイムスタンプ記述子を流用して、第1コンテンツ及び第2コンテンツを含む特定コンテンツの提示時刻を特定する。 In the first method, a case will be described in which the media processing device 200 synchronizes the first content and the second content based on the first control information (MMT-SI). Using the MMT-SI as an entry point, the media processing device 200 checks whether or not a scene description (second content) is present, and if a scene description is present, it uses the MPU timestamp descriptor to identify the presentation time of specific content including the first content and the second content.

具体的には、図１９に示すように、2D映像及び音声は、MPUタイムスタンプ記述子（図１９では、単にtimestamp）に基づいて提示されるため、2D映像及び音声の同期が取れる。 Specifically, as shown in FIG. 19, 2D video and audio are presented based on the MPU timestamp descriptor (simply "timestamp" in FIG. 19), so that the 2D video and audio can be synchronized.

一方で、シーン記述に含まれる最初のフレームの提示時刻は、MMT-SIに含まれるMPUタイムスタンプ記述子を参照することによって特定される。シーン記述に含まれる２番目以降フレームの提示時刻は、シーン記述に含まれるフレーム番号及び第2コンテンツのフレームレートによって特定することが可能である。例えば、フレームレートが30fpsであるケースを考えると、n番目のフレームの提示時刻は、MPUタイムスタンプ記述子によって特定される時刻に1/30×nを加算することによって特定される。但し、シーン記述に含まれる最初のフレームのフレーム番号は”0”である。 On the other hand, the presentation time of the first frame included in the scene description is identified by referencing the MPU timestamp descriptor included in MMT-SI. The presentation times of the second and subsequent frames included in the scene description can be identified by the frame number included in the scene description and the frame rate of the second content. For example, if the frame rate is 30 fps, the presentation time of the nth frame is identified by adding 1/30 x n to the time identified by the MPU timestamp descriptor. However, the frame number of the first frame included in the scene description is "0".

第1方法では、シーン記述に含まれる最初のフレームの提示時刻をシーン記述が含まないケースを例示したが、シーン記述は、シーン記述に含まれる最初のフレームの提示時刻を含んでもよい。 In the first method, a case is exemplified in which the scene description does not include the presentation time of the first frame included in the scene description, but the scene description may include the presentation time of the first frame included in the scene description.

第2方法では、メディア処理装置200が第2制御情報（シーン記述）に基づいて、第1コンテンツと第2コンテンツとの同期を取るケースについて説明する。メディア処理装置200は、シーン記述をエントリーポイントとして、MMT-SI（第1コンテンツ）の有無を確認した上で、MMT-SIが存在する場合には、シーン記述に含まれる提示時刻に基づいて、第1コンテンツ及び第2コンテンツを含む特定コンテンツの提示時刻を特定する。 In the second method, a case will be described in which the media processing device 200 synchronizes the first content and the second content based on the second control information (scene description). The media processing device 200 uses the scene description as an entry point to check whether or not MMT-SI (first content) is present, and if MMT-SI is present, identifies the presentation time of specific content, including the first content and the second content, based on the presentation time included in the scene description.

このようなケースにおいて、シーン記述は、第2コンテンツの提示時刻を示す絶対時刻情報を含む。絶対時刻情報は、シーン記述に含まれる最初のフレームの提示時刻であってもよい。 In such a case, the scene description includes absolute time information indicating the presentation time of the second content. The absolute time information may be the presentation time of the first frame included in the scene description.

例えば、絶対時刻情報は、UTCを基準時刻として生成されてもよい。基準時刻は、TAIが用いられてもよく、GPSから提供される時刻が用いられてもよい。基準時刻は、NTPサーバから提供される時刻であってもよく、PTPサーバから提供される時刻であってもよい。さらに、絶対時刻情報は、MPUタイムスタンプ記述子と同一基準時刻に基づいて生成されてもよい。 For example, the absolute time information may be generated using UTC as the reference time. The reference time may be the TAI or may be the time provided by a GPS. The reference time may be the time provided by an NTP server or may be the time provided by a PTP server. Furthermore, the absolute time information may be generated based on the same reference time as the MPU timestamp descriptor.

さらに、シーン記述は、第1コンテンツを特定するための参照情報を含む。参照情報は、第1コンテンツを構成するMPUを特定するための情報であってもよい。すなわち、参照情報は、シーン記述に含まれるオブジェクトとして第1コンテンツ（MPU）を扱うための情報である。 Furthermore, the scene description includes reference information for identifying the first content. The reference information may be information for identifying an MPU that constitutes the first content. In other words, the reference information is information for treating the first content (MPU) as an object included in the scene description.

具体的には、図２０に示すように、シーン記述に含まれる最初のフレームの提示時刻は、シーン記述に含まれる絶対時刻情報によって特定される。シーン記述に含まれる２番目以降フレームの提示時刻は、シーン記述に含まれるフレーム番号及び第2コンテンツのフレームレートによって特定することが可能である。例えば、フレームレートが30fpsであるケースを考えると、n番目のフレームの提示時刻は、MPUタイムスタンプ記述子によって特定される時刻に1/30×nを加算することによって特定される。但し、シーン記述に含まれる最初のフレームのフレーム番号は”0”である。 Specifically, as shown in FIG. 20, the presentation time of the first frame included in the scene description is specified by the absolute time information included in the scene description. The presentation times of the second and subsequent frames included in the scene description can be specified by the frame number included in the scene description and the frame rate of the second content. For example, if the frame rate is 30 fps, the presentation time of the nth frame is specified by adding 1/30×n to the time specified by the MPU timestamp descriptor. However, the frame number of the first frame included in the scene description is "0".

一方で、2D映像及び音声は、MPUタイムスタンプ記述子（図２０では、単にtimestamp）に基づいて提示されるため、2D映像及び音声の同期が取れる。ここで、上述した参照情報がシーン記述に含まれるため、メディア処理装置200は、シーン記述に含まれる参照情報に基づいて、第2コンテンツとともに提示すべき第1コンテンツの有無を確認することができる。 On the other hand, 2D video and audio are presented based on the MPU timestamp descriptor (simply "timestamp" in FIG. 20), so that the 2D video and audio can be synchronized. Here, because the above-mentioned reference information is included in the scene description, the media processing device 200 can check whether or not there is a first content that should be presented together with the second content, based on the reference information included in the scene description.

第2方法では、2D映像と音声との同期がMMT-SIに含まれるMPUタイムスタンプ記述子に基づいて取られているが、変更例1では、2D映像と音声との同期についても、シーン記述に含まれる情報要素（絶対時刻情報及び参照情報）に基づいて取られてもよい。このようなケースにおいて、少なくとも、MMT-SIに含まれるMPUタイムスタンプ記述子については省略されてもよい。さらに、MMT-SIそのものが省略されてもよい。 In the second method, synchronization between 2D video and audio is achieved based on the MPU timestamp descriptor included in MMT-SI, but in modification example 1, synchronization between 2D video and audio may also be achieved based on information elements included in the scene description (absolute time information and reference information). In such a case, at least the MPU timestamp descriptor included in MMT-SI may be omitted. Furthermore, MMT-SI itself may be omitted.

なお、MMT-SIに含まれるMPUタイムスタンプ記述子の基準時刻（以下、第1基準時刻）とシーン記述に含まれる絶対時刻情報の基準時刻（第2基準時刻）とが異なる場合には、第1制御情報（MMT-SI）及び第2制御情報（シーン記述）の少なくともいずれか1つは、第1基準時刻と第2基準時刻との変換情報を含んでもよい。例えば、MMT-SIは、第1基準時刻（例えば、UTC）で表されたMPUタイムスタンプ記述子に加えて、第2基準時刻（例えば、UTC以外の基準時刻）で表されたMPUタイムスタンプ記述子を含んでもよい。シーン記述は、第2基準時刻（例えば、UTC以外の基準時刻）で表された絶対時刻情報に加えて、第1基準時刻（例えば、UTC）で表された絶対時刻情報を含んでもよい。 When the reference time of the MPU timestamp descriptor included in the MMT-SI (hereinafter, the first reference time) differs from the reference time of the absolute time information included in the scene description (the second reference time), at least one of the first control information (MMT-SI) and the second control information (scene description) may include conversion information between the first reference time and the second reference time. For example, the MMT-SI may include an MPU timestamp descriptor expressed in the second reference time (e.g., a reference time other than UTC) in addition to the MPU timestamp descriptor expressed in the first reference time (e.g., UTC). The scene description may include absolute time information expressed in the first reference time (e.g., UTC) in addition to the absolute time information expressed in the second reference time (e.g., a reference time other than UTC).

なお、MMT-SIに含まれるMPUタイムスタンプ記述子は、第1絶対時刻情報と称されてもよく、シーン記述に含まれる絶対時刻情報は、第2絶対時刻情報と称されてもよい。 The MPU timestamp descriptor included in the MMT-SI may be referred to as the first absolute time information, and the absolute time information included in the scene description may be referred to as the second absolute time information.

［変更例2］
以下において、実施形態の変更例2について説明する。以下においては、実施形態に対する相違点について主として説明する。具体的には、変更例2では、特定コンテンツを定義するシーン記述の編集に用いるシーン記述編集装置について主として説明する。 [Change Example 2]
In the following, a second modification of the embodiment will be described. In the following, differences from the embodiment will be mainly described. Specifically, in the second modification, a scene description editing device used for editing a scene description that defines a specific content will be mainly described.

（メディア処理装置及びシーン記述編集装置）
以下において、変更例2に係るメディア処理装置及びシーン記述編集装置について説明する。図２1は、変更例2に係るメディア処理装置200及びシーン記述編集装置600を示すブロック図である。 (Media processing device and scene description editing device)
The following describes a media processing device and a scene description editing device according to Modification 2. Figure 21 is a block diagram showing a media processing device 200 and a scene description editing device 600 according to Modification 2.

受付部210は、視点情報を受け付ける。変更例2では、受付部210は、視点情報をシーン記述編集装置600から受信してもよい。 The reception unit 210 receives the viewpoint information. In modification example 2, the reception unit 210 may receive the viewpoint information from the scene description editing device 600.

レンダラ220は、視点情報に基づいて、第2コンテンツを少なくとも含む特定コンテンツを生成する。以下において、特定コンテンツは、第2コンテンツに加えて第1コンテンツを含むケースについて例示する。 The renderer 220 generates specific content that includes at least the second content based on the viewpoint information. In the following, an example is given of the specific content including the first content in addition to the second content.

第1に、レンダラ220は、第1制御情報（MMT-SI）に基づいて、特定コンテンツの一部として、2D映像及び音声を含む第1コンテンツを生成する。第1コンテンツの生成において視点情報は不要である。 First, the renderer 220 generates a first content including 2D video and audio as part of a specific content based on the first control information (MMT-SI). Viewpoint information is not required in generating the first content.

ここで、上述した実施形態では、第1制御情報（MMT-SI）、2D映像及び音声をメディア処理装置200が送信装置100から受信するが、変更例2では、メディア処理装置200は、第1制御情報（MMT-SI）、2D映像及び音声をシーン記述編集装置600から受信する。 In the above-described embodiment, the media processing device 200 receives the first control information (MMT-SI), 2D video, and audio from the transmitting device 100, but in modification 2, the media processing device 200 receives the first control information (MMT-SI), 2D video, and audio from the scene description editing device 600.

第2に、レンダラ220は、第2制御情報（シーン記述）に基づいて、特定コンテンツの一部として、360°映像及び3Dオブジェクトを含む第2コンテンツを生成する。第2コンテンツの生成において視点情報が用いられる。 Secondly, the renderer 220 generates second content including the 360° video and the 3D object as part of the specific content based on the second control information (scene description). The viewpoint information is used in generating the second content.

ここで、上述した実施形態では、第2制御情報（シーン記述）、360°映像及び3Dオブジェクトをメディア処理装置200が送信装置100から受信するが、変更例2では、メディア処理装置200は、第2制御情報（シーン記述）、360°映像及び3Dオブジェクトをシーン記述編集装置600から受信する。 In the above-described embodiment, the media processing device 200 receives the second control information (scene description), the 360° video, and the 3D object from the transmission device 100, but in modification example 2, the media processing device 200 receives the second control information (scene description), the 360° video, and the 3D object from the scene description editing device 600.

符号化処理部230は、レンダラ220によって生成された特定コンテンツを符号化する。変更例2では、符号化処理部230は、特定コンテンツをシーン記述編集装置600に送信してもよい。 The encoding processing unit 230 encodes the specific content generated by the renderer 220. In a second modification, the encoding processing unit 230 may transmit the specific content to the scene description editing device 600.

第2に、シーン記述編集装置600は、検出部610と、復号処理部620と、レンダラ630と、編集部640と、データベース650と、を有する。 Second, the scene description editing device 600 has a detection unit 610, a decoding processing unit 620, a renderer 630, an editing unit 640, and a database 650.

検出部610は、上述した検出部310と同様に、ユーザの視点位置及び視線方向を検出する。検出部610は、加速度センサを含んでもよく、GPSセンサを含んでもよい。検出部610は、ユーザによって手動で入力されるユーザI/F（例えば、タッチセンサ、キーボード、マウス、コントローラなど）を含んでもよい。検出部610は、視点情報（視点位置及び視線方向）をメディア処理装置200に送信してもよい。検出部610は、視点情報（ビューポート）をレンダラ630に出力してもよい。 The detection unit 610 detects the user's viewpoint position and line of sight direction, similar to the detection unit 310 described above. The detection unit 610 may include an acceleration sensor or a GPS sensor. The detection unit 610 may include a user I/F (e.g., a touch sensor, a keyboard, a mouse, a controller, etc.) that is manually input by the user. The detection unit 610 may transmit viewpoint information (viewpoint position and line of sight direction) to the media processing device 200. The detection unit 610 may output viewpoint information (viewport) to the renderer 630.

復号処理部620は、上述した復号処理部320と同様に、メディア処理装置200から受信する特定コンテンツを復号する。復号処理部620は、メディア処理装置200から受信する提示時刻を復号してもよい。復号処理部620は、特定コンテンツをレンダラ630に出力してもよく、提示時刻をレンダラ630に出力してもよい。 The decoding processing unit 620 decodes the specific content received from the media processing device 200, similar to the decoding processing unit 320 described above. The decoding processing unit 620 may decode the presentation time received from the media processing device 200. The decoding processing unit 620 may output the specific content to the renderer 630, or may output the presentation time to the renderer 630.

レンダラ630は、上述したレンダラ330と同様に、復号処理部620によって復号された特定コンテンツを出力する。レンダラ630は、復号処理部620によって復号された提示時刻に基づいて特定コンテンツを出力してもよい。例えば、レンダラ630は、特定コンテンツに含まれる映像コンテンツをディスプレイに出力し、特定コンテンツに含まれる音声コンテンツをスピーカに出力してもよい。 Similar to the renderer 330 described above, the renderer 630 outputs the specific content decoded by the decoding processing unit 620. The renderer 630 may output the specific content based on the presentation time decoded by the decoding processing unit 620. For example, the renderer 630 may output the video content included in the specific content to a display and output the audio content included in the specific content to a speaker.

変更例2では、レンダラ630から出力される特定コンテンツは、シーン記述などの編集に用いられるため、特定コンテンツのプレビューであると考えてもよい。 In the second modification, the specific content output from the renderer 630 is used for editing scene descriptions, etc., and may therefore be considered a preview of the specific content.

編集部640は、視点の自由度を有するコンテンツ（第2コンテンツ）の構成を定義するシーン記述の編集に用いる構成である。編集部640は、シーン記述をメディア処理装置200に送信してもよい。編集部640は、シーン記述が更新された場合に、更新されたシーン記述をメディア処理装置200に送信してもよい。シーン記述の更新は、シーン記述の編集と読み替えられてもよい。編集部640は、シーン記述を編集するためのユーザI/F（例えば、タッチセンサ、キーボード、マウス、コントローラなど）を含んでもよい。 The editing unit 640 is a component used to edit a scene description that defines the configuration of content (second content) that has freedom of viewpoint. The editing unit 640 may transmit the scene description to the media processing device 200. When the scene description is updated, the editing unit 640 may transmit the updated scene description to the media processing device 200. Updating the scene description may be interpreted as editing the scene description. The editing unit 640 may include a user I/F (e.g., a touch sensor, a keyboard, a mouse, a controller, etc.) for editing the scene description.

シーン記述は、上述したように、1つのファイル毎に生成され、360°映像及び3Dオブジェクトを特定する情報をフレーム毎に含む。例えば、シーン記述は、フレームの3Dオブジェクトの名称を示す情報要素（object_name）、フレーム番号を示す情報要素（frame_number）、フレームにおける3Dオブジェクトの位置を示す情報要素（translation_object）、フレームにおける3Dオブジェクトの回転を示す情報要素（rotation_object）、フレームにおける3Dオブジェクトの大きさを示す情報要素（scale_object）などを含む。 As described above, the scene description is generated for each file, and includes information for each frame that identifies the 360° video and 3D objects. For example, the scene description includes an information element (object_name) indicating the name of the 3D object in the frame, an information element (frame_number) indicating the frame number, an information element (translation_object) indicating the position of the 3D object in the frame, an information element (rotation_object) indicating the rotation of the 3D object in the frame, an information element (scale_object) indicating the size of the 3D object in the frame, etc.

データベース650は、3Dオブジェクト、2D映像、音声、360°映像などのメディアを格納する。データベース650は、メディア処理装置200の送信要求又は読出要求に応じて必要なメディアファイルをメディア処理装置200に送信してもよい。 The database 650 stores media such as 3D objects, 2D video, audio, and 360° video. The database 650 may transmit required media files to the media processing device 200 in response to a transmission request or a read request from the media processing device 200.

（シーン記述編集装置のUI）
以下において、変更例2に係るシーン記述編集装置のUI（User Interface）の一例について説明する。図２２は、変更例2に係るシーン記述編集装置600のUIの一例を示す図である。シーン記述編集装置600のUIは、シーン記述を編集するための編集画面であると考えてもよい。 (UI of scene description editor)
The following describes an example of a UI (User Interface) of the scene description editing device according to Modification 2. Fig. 22 is a diagram showing an example of the UI of the scene description editing device 600 according to Modification 2. The UI of the scene description editing device 600 may be considered as an editing screen for editing the scene description.

図２２に示すように、編集画面（UI）は、メディアリスト、タイムライン、プレビュー、シーン記述などを含んでもよい。 As shown in FIG. 22, the editing interface (UI) may include a media list, a timeline, a preview, a scene description, and the like.

メディアリストは、特定コンテンツの作成者（以下、単に作成者）によって指定される3Dオブジェクト、2D映像、音声、360°映像などのメディアのリストが表示される領域である。 The media list is an area that displays a list of media such as 3D objects, 2D video, audio, and 360° video specified by the creator of a particular content (hereinafter simply referred to as the creator).

タイムラインは、作成者の操作によってシーンを構成する要素がメディアリストから選択され、選択された要素がタイムライン上に並べられる領域である。タイムラインは、タイムライン上において作成者によって選択された時刻を示すインディケータ（線分など）を含み、インディケータによって示される時刻のシーンが特定コンテンツの2D映像としてプレビューに表示されてもよい。 The timeline is an area where the creator selects elements that make up a scene from a media list, and the selected elements are arranged on the timeline. The timeline includes an indicator (such as a line) that indicates the time selected by the creator on the timeline, and the scene at the time indicated by the indicator may be displayed in the preview as a 2D image of the specific content.

プレビューは、作成者の操作によって指定された視点情報（視点位置及び視点方向）に従って特定コンテンツが表示される領域である。視点情報は、上述した検出部610によって検出されてもよい。作成者は、プレビューに表示される特定コンテンツを確認しながらシーン記述の編集を行ってもよい。プレビューは、作成者が利用するヘッドマウントディスプレイによって表示されてもよい。また、上空から下向きや東から西向きなど、同一時刻の同一シーンを異なる視点で表示するなど、2以上のプレビューを備えてもよい。同一シーンを異なる複数の視点から表示することで、3次元空間におけるオブジェクトの配置を容易に確認しながらシーン記述を編集することができる。 The preview is an area in which specific content is displayed according to viewpoint information (viewpoint position and viewpoint direction) specified by the creator's operation. The viewpoint information may be detected by the detection unit 610 described above. The creator may edit the scene description while checking the specific content displayed in the preview. The preview may be displayed by a head-mounted display used by the creator. In addition, two or more previews may be provided, such as displaying the same scene at the same time from different viewpoints, such as looking downward from the sky or looking east to west. By displaying the same scene from multiple different viewpoints, it is possible to edit the scene description while easily checking the arrangement of objects in three-dimensional space.

シーン記述は、作成者によって編集されるシーン記述が表示される領域である。シーン記述は、動作例2で説明した特定視点情報（推奨ビューポート情報）を含んでもよい（例えば、図７を参照）。シーン記述は、動作例3で説明した品質情報及び配置情報を含んでもよい（例えば、図１０を参照）。シーン記述は、動作例4で説明した重要度情報を含んでもよい（例えば、図１４を参照）。シーン記述は、動作例5で説明したユーザの視点位置の移動範囲を定義する情報要素を含んでもよい（例えば、図１７及び図１８を参照）。 The scene description is an area where the scene description edited by the creator is displayed. The scene description may include specific viewpoint information (recommended viewport information) described in operation example 2 (see, for example, Figure 7). The scene description may include quality information and placement information described in operation example 3 (see, for example, Figure 10). The scene description may include importance information described in operation example 4 (see, for example, Figure 14). The scene description may include information elements that define the movement range of the user's viewpoint position described in operation example 5 (see, for example, Figures 17 and 18).

（動作例）
以下において、シーン記述編集装置600を用いた編集動作について説明する。 (Example of operation)
An editing operation using the scene description editing device 600 will now be described.

第1に、シーン記述編集装置600は、作成者によってシーン記述の編集が開始される際に、RTSP SETUPによってシーン記述の編集を開始する旨をメディア処理装置200に通知してもよい。RTSP SETUPは、シーン記述編集装置600のIPアドレス、待ち受けポート番号、編集モードで動作する旨の要求などを含んでもよい。 First, when the creator starts editing the scene description, the scene description editing device 600 may notify the media processing device 200 of the start of editing the scene description by RTSP SETUP. The RTSP SETUP may include the IP address of the scene description editing device 600, a listening port number, a request to operate in edit mode, and the like.

第2に、視点情報（視点位置及び視点方向）は、作成者の操作によって変更可能である。視点情報が作成者の操作によって変更されると、変更された視点情報がシーン記述編集装置600からメディア処理装置200に送信される。このような仕組みは、ユーザ端末300からメディア処理装置200に視点情報を送信する仕組みと同様である。 Secondly, the viewpoint information (viewpoint position and viewpoint direction) can be changed by the creator's operation. When the viewpoint information is changed by the creator's operation, the changed viewpoint information is transmitted from the scene description editing device 600 to the media processing device 200. This mechanism is similar to the mechanism for transmitting viewpoint information from the user terminal 300 to the media processing device 200.

第3に、3Dオプジェクトなどの要素の位置、大きさ、向きなどのパラメータは、作成者の操作（編集）によって変更可能である。パラメータが作成者の操作によって変更されると、変更されたパラメータがシーン記述に反映される。変更されたシーン記述は、シーン記述編集装置600からメディア処理装置200に送信され、メディア処理装置200のレンダラ220によってレンダリングされ、レンダリングされた特定コンテンツは、メディア処理装置200からシーン記述編集装置600に送信される。すなわち、パラメータの変更に応じて、プレビューに表示される特定コンテンツが更新される。シーン記述に基づいた特定コンテンツがメディア処理装置200からシーン記述編集装置600に送信される仕組みは、メディア処理装置200からユーザ端末300に特定コンテンツを送信する仕組みと同様である。 Third, parameters such as the position, size, and orientation of elements such as 3D objects can be changed by the creator's operation (editing). When parameters are changed by the creator's operation, the changed parameters are reflected in the scene description. The changed scene description is transmitted from the scene description editing device 600 to the media processing device 200 and rendered by the renderer 220 of the media processing device 200, and the rendered specific content is transmitted from the media processing device 200 to the scene description editing device 600. In other words, the specific content displayed in the preview is updated in response to the change in parameters. The mechanism by which the specific content based on the scene description is transmitted from the media processing device 200 to the scene description editing device 600 is similar to the mechanism by which the specific content is transmitted from the media processing device 200 to the user terminal 300.

第4に、編集画面（UI）に含まれるアイコン710（図２２では、「推奨ビュー」）を選択する作成者の操作によって、動作例2で説明した特定視点情報（推奨ビューポート情報）に対応する特定コンテンツがプレビューに表示されてもよい。 Fourth, the creator may select icon 710 ("recommended view" in FIG. 22) included in the editing screen (UI) to display specific content in the preview that corresponds to the specific viewpoint information (recommended viewport information) described in operation example 2.

第5に、編集画面（UI）に含まれるアイコン720（図２２では、「推奨ビュー記録」）を選択する作成者の操作によって、プレビューに表示される特定コンテンツに対応する視点情報が動作例2で説明した特定視点情報（推奨ビューポート情報）として記録されてもよい。アイコン720によって記録される特定視点情報（推奨ビューポート情報）は、シーン記述に反映されてもよい。例えば、プレビューに表示される特定コンテンツに対応する視点情報がヘッドマウントディスプレイによって操作され、ヘッドマウントディスプレイによって操作された視点情報がアイコン720の選択によって特定視点情報（推奨ビューポート情報）として記録されてもよい。すなわち、特定視点情報は、特定コンテンツの作成者が利用するヘッドマウントディスプレイの操作に応じて生成される。 Fifth, viewpoint information corresponding to specific content displayed in the preview may be recorded as specific viewpoint information (recommended viewport information) described in operation example 2 by the creator selecting icon 720 ("recommended view record" in FIG. 22) included in the editing screen (UI). The specific viewpoint information (recommended viewport information) recorded by icon 720 may be reflected in the scene description. For example, viewpoint information corresponding to specific content displayed in the preview may be operated by a head mounted display, and the viewpoint information operated by the head mounted display may be recorded as specific viewpoint information (recommended viewport information) by selecting icon 720. In other words, the specific viewpoint information is generated in response to the operation of the head mounted display used by the creator of the specific content.

第6に、動作例5で説明したユーザの視点位置の移動範囲を定義する情報要素（例えば、図１８に示す”Viewing_space”）は、作成者の操作（マウス操作など）によって指定されてもよい。指定された移動範囲を定義する情報要素（”Viewing_space”））は、シーン記述に反映されてもよい。移動範囲を定義する情報要素（”Viewing_space”）を表示するか否かは切り替え可能であってもよい（表示or非表示）。 Sixth, the information element (for example, "Viewing_space" shown in FIG. 18) that defines the movement range of the user's viewpoint position described in operation example 5 may be specified by the creator's operation (such as mouse operation). The information element ("Viewing_space") that defines the specified movement range may be reflected in the scene description. It may be possible to switch whether or not to display the information element ("Viewing_space") that defines the movement range (display or non-display).

変更例2では、シーン記述編集装置600は、メディア処理装置200とは別体として実装され、ネットワークを介してメディア処理装置200と接続されてもよい。例えば、メディア処理装置200は、クラウド上にSaaS（Software as a Service）として実装されてもよい。しかしながら、変更例2はこれに限定されるものではない。例えば、シーン記述編集装置600は、メディア処理装置200と一体として1つの装置（コンピュータ、ハードウェア）として実装されてもよい。このようなケースにおいて、上述した符号化処理部230及び復号処理部620は省略されてもよい。また、送信は出力と読み替えられ、受信は取得と読み替えられてもよい。 In the second modification, the scene description editing device 600 may be implemented as a separate entity from the media processing device 200, and may be connected to the media processing device 200 via a network. For example, the media processing device 200 may be implemented as Software as a Service (SaaS) on the cloud. However, the second modification is not limited to this. For example, the scene description editing device 600 may be implemented as a single device (computer, hardware) integrated with the media processing device 200. In such a case, the encoding processing unit 230 and the decoding processing unit 620 described above may be omitted. In addition, transmission may be read as output, and reception may be read as acquisition.

（作用及び効果）
変更例2では、シーン記述編集装置600は、シーン記述及び前記視点情報に基づいてメディア処理装置200で生成された特定コンテンツを取得する。このような構成によれば、特定コンテンツをメディア処理装置200によって生成（レンダリング）した上で、生成された特定コンテンツをメディア処理装置200からユーザ端末300に送信するケースを想定した場合に、シーン記述編集装置600は、ユーザ端末300と同様の仕組みで、メディア処理装置200から特定コンテンツを取得することができる。従って、ユーザ端末300に表示されると想定される特定コンテンツを確認しながら、シーン記述を適切に編集することができる。 (Action and Effects)
In Modification 2, the scene description editing device 600 acquires the specific content generated by the media processing device 200 based on the scene description and the viewpoint information. With this configuration, assuming a case in which the specific content is generated (rendered) by the media processing device 200 and then the generated specific content is transmitted from the media processing device 200 to the user terminal 300, the scene description editing device 600 can acquire the specific content from the media processing device 200 in a manner similar to that of the user terminal 300. Therefore, it is possible to appropriately edit the scene description while checking the specific content that is expected to be displayed on the user terminal 300.

［その他の実施形態］
本発明は上述した開示によって説明したが、この開示の一部をなす論述及び図面は、この発明を限定するものであると理解すべきではない。この開示から当業者には様々な代替実施形態、実施例及び運用技術が明らかとなろう。 [Other embodiments]
Although the present invention has been described by the above disclosure, the description and drawings forming a part of this disclosure should not be understood as limiting the present invention. From this disclosure, various alternative embodiments, examples and operating techniques will become apparent to those skilled in the art.

上述した開示では、特定コンテンツが第1コンテンツ及び第2コンテンツの双方を含むケースについて例示したが、上述した開示はこれに限定されるものではない。特定コンテンツは、少なくとも第2コンテンツを含めばよい。 In the above disclosure, a case in which the specific content includes both the first content and the second content has been exemplified, but the above disclosure is not limited to this. The specific content may include at least the second content.

上述した開示では特に触れていないが、MMTに関する用語は、ISO/IEC 23008-1、ARIB STD-B60、ARIB TR-B39などで規定された内容に基づいて解釈されてもよい。 Although not specifically mentioned in the above disclosure, terms related to MMT may be interpreted based on the contents specified in ISO/IEC 23008-1, ARIB STD-B60, ARIB TR-B39, etc.

上述した開示では、MMT-SIに含まれる第1絶対時刻情報として、MPUタイムスタンプ記述子を例示した。しかしながら、上述した開示はこれに限定されるものではない。MMT-SIに含まれる第1絶対時刻情報は、MPU拡張タイムスタンプ記述子であってもよい。 In the above disclosure, an MPU timestamp descriptor is exemplified as the first absolute time information included in MMT-SI. However, the above disclosure is not limited to this. The first absolute time information included in MMT-SI may be an MPU extended timestamp descriptor.

上述した開示では特に触れていないが、メディア処理装置200は、必要に応じて、第2コンテンツの一部を送信装置100に要求してもよい。このような構成によれば、第2コンテンツの伝送に伴う帯域を節約し、メディア処理装置200の処理負荷の増大を抑制することができる。 Although not specifically mentioned in the above disclosure, the media processing device 200 may request a portion of the second content from the transmitting device 100 as necessary. This configuration can save bandwidth associated with the transmission of the second content and suppress an increase in the processing load on the media processing device 200.

上述した開示では、第1コンテンツの伝送方式としてMMTPを例示した。しかしながら、上述した開示はこれに限定されるものではない。第1コンテンツの伝送方式は、ISO/IEC 23009-1（以下、MPEG-DASH（Dynamic Adaptive Stream over HTTP））に準拠する方式であってもよい。このようなケースにおいて、第1制御情報は、MPD（Media Presentation Description）であってもよい。すなわち、上述した開示において、MMT-SIはMPDと読み替えられてもよい。 In the above disclosure, MMTP is exemplified as the transmission method of the first content. However, the above disclosure is not limited to this. The transmission method of the first content may be a method conforming to ISO/IEC 23009-1 (hereinafter, MPEG-DASH (Dynamic Adaptive Stream over HTTP)). In such a case, the first control information may be MPD (Media Presentation Description). That is, in the above disclosure, MMT-SI may be read as MPD.

上述した開示では特に触れていないが、「取得」は「受信」と読み替えられてもよい。 Although not specifically mentioned in the above disclosure, "obtain" may be read as "receive."

特に限定されるものではないが、動作例2は、以下のように表現されてもよい。送信装置100は、視点の自由度を有するコンテンツの構成を送信する送信部を備え、送信部は、コンテンツを少なくとも含む特定コンテンツの生成に用いられる特定視点情報を送信する。受信装置は、視点の自由度を有するコンテンツの構成を受信する受信部を備え、受信部は、コンテンツを少なくとも含む特定コンテンツの生成に用いられる特定視点情報を受信する。このようなケースにおいて、受信装置は、メディア処理装置200であってもよく、ユーザ端末300であってもよい。 Although not particularly limited, the operation example 2 may be expressed as follows. The transmitting device 100 includes a transmitting unit that transmits a content configuration having viewpoint freedom, and the transmitting unit transmits specific viewpoint information used to generate specific content including at least the content. The receiving device includes a receiving unit that receives a content configuration having viewpoint freedom, and the receiving unit receives the specific viewpoint information used to generate specific content including at least the content. In such a case, the receiving device may be the media processing device 200 or the user terminal 300.

特に限定されるものではないが、動作例3は、以下のように表現されてもよい。送信装置100は、視点の自由度を有するコンテンツの構成を送信する送信部を備え、送信部は、コンテンツを少なくとも含む特定コンテンツに含まれる3次元オブジェクトに関するストリームの品質情報を送信し、品質情報は、3次元オブジェクトの向きによって品質が異なる2以上のストリームの各々に関する品質情報を含む。受信装置は、視点の自由度を有するコンテンツの構成を受信する受信部を備え、受信部は、コンテンツを少なくとも含む特定コンテンツに含まれる3次元オブジェクトに関するストリームの品質情報を受信し、品質情報は、3次元オブジェクトの向きによって品質が異なる2以上のストリームの各々に関する品質情報を含む。このようなケースにおいて、受信装置は、メディア処理装置200であってもよく、ユーザ端末300であってもよい。 Although not particularly limited, the operation example 3 may be expressed as follows. The transmitting device 100 includes a transmitting unit that transmits a configuration of content having freedom of viewpoint, and the transmitting unit transmits quality information of a stream related to a three-dimensional object included in specific content that includes at least the content, and the quality information includes quality information related to each of two or more streams whose quality differs depending on the orientation of the three-dimensional object. The receiving device includes a receiving unit that receives a configuration of content having freedom of viewpoint, and the receiving unit receives quality information of a stream related to a three-dimensional object included in specific content that includes at least the content, and the quality information includes quality information related to each of two or more streams whose quality differs depending on the orientation of the three-dimensional object. In such a case, the receiving device may be the media processing device 200 or the user terminal 300.

特に限定されるものではないが、動作例4は、以下のように表現されてもよい。送信装置100は、視点の自由度を有するコンテンツの構成を送信する送信部を備え、送信部は、コンテンツを少なくとも含む特定コンテンツに含まれる2以上のオブジェクトの各々に関する重要度情報を送信する。受信装置は、視点の自由度を有するコンテンツの構成を受信する受信部を備え、受信部は、コンテンツを少なくとも含む特定コンテンツに含まれる2以上のオブジェクトの各々に関する重要度情報を受信する。このようなケースにおいて、受信装置は、メディア処理装置200であってもよく、ユーザ端末300であってもよい。 Although not particularly limited, operation example 4 may be expressed as follows. The transmitting device 100 includes a transmitting unit that transmits a content configuration having freedom of viewpoint, and the transmitting unit transmits importance information regarding each of two or more objects included in specific content that at least includes the content. The receiving device includes a receiving unit that receives a content configuration having freedom of viewpoint, and the receiving unit receives importance information regarding each of two or more objects included in specific content that at least includes the content. In such a case, the receiving device may be a media processing device 200 or a user terminal 300.

特に限定されるものではないが、動作例4は、以下のように表現されてもよい。送信装置100は、視点の自由度を有するコンテンツの構成を送信する送信部を備え、送信部は、コンテンツを少なくとも含む特定コンテンツによって構成される3次元空間においてユーザの視点位置の移動範囲を定義する情報要素を送信する。受信装置は、視点の自由度を有するコンテンツの構成を受信する受信部を備え、受信部は、コンテンツを少なくとも含む特定コンテンツによって構成される3次元空間においてユーザの視点位置の移動範囲を定義する情報要素を受信する。このようなケースにおいて、受信装置は、メディア処理装置200であってもよく、ユーザ端末300であってもよい。 Although not particularly limited, the operation example 4 may be expressed as follows. The transmitting device 100 includes a transmitting unit that transmits a configuration of content having freedom of viewpoint, and the transmitting unit transmits information elements that define a movement range of a user's viewpoint position in a three-dimensional space configured by specific content that includes at least the content. The receiving device includes a receiving unit that receives a configuration of content having freedom of viewpoint, and the receiving unit receives information elements that define a movement range of a user's viewpoint position in a three-dimensional space configured by specific content that includes at least the content. In such a case, the receiving device may be the media processing device 200 or the user terminal 300.

上述した開示では特に触れていないが、送信装置100、メディア処理装置200及びユーザ端末300が行う各処理をコンピュータに実行させるプログラムが提供されてもよい。また、プログラムは、コンピュータ読取り可能媒体に記録されていてもよい。コンピュータ読取り可能媒体を用いれば、コンピュータにプログラムをインストールすることが可能である。ここで、プログラムが記録されたコンピュータ読取り可能媒体は、非一過性の記録媒体であってもよい。非一過性の記録媒体は、特に限定されるものではないが、例えば、CD-ROMやDVD-ROM等の記録媒体であってもよい。 Although not specifically mentioned in the above disclosure, a program may be provided that causes a computer to execute each process performed by the transmitting device 100, the media processing device 200, and the user terminal 300. The program may also be recorded on a computer-readable medium. Using a computer-readable medium, it is possible to install the program on a computer. Here, the computer-readable medium on which the program is recorded may be a non-transient recording medium. The non-transient recording medium is not particularly limited, and may be, for example, a recording medium such as a CD-ROM or DVD-ROM.

或いは、送信装置100、メディア処理装置200及びユーザ端末300が行う各処理を実行するためのプログラムを記憶するメモリ及びメモリに記憶されたプログラムを実行するプロセッサによって構成されるチップが提供されてもよい。 Alternatively, a chip may be provided that is configured with a memory that stores programs for executing the processes performed by the transmitting device 100, the media processing device 200, and the user terminal 300, and a processor that executes the programs stored in the memory.

（付記）
上述した開示は、以下に示すように表現されてもよい。 (Additional Note)
The above disclosure may be expressed as follows:

第1の特徴は、視点の自由度を有するコンテンツの構成を定義するシーン記述及び視点情報をメディア処理装置に出力する出力部と、前記コンテンツを少なくとも含む特定コンテンツであって、前記シーン記述及び前記視点情報に基づいて前記メディア処理装置で生成された特定コンテンツを取得する取得部と、を備える、シーン記述編集装置である。 The first feature is a scene description editing device that includes an output unit that outputs a scene description and viewpoint information that define the configuration of content with viewpoint freedom to a media processing device, and an acquisition unit that acquires specific content that includes at least the content and is generated by the media processing device based on the scene description and the viewpoint information.

第2の特徴は、第1の特徴において、前記メディア処理装置で生成された前記特定コンテンツは、前記特定コンテンツの視聴者が用いるユーザ端末に送信される前記特定コンテンツである、シーン記述編集装置である。 The second feature is a scene description editing device according to the first feature, in which the specific content generated by the media processing device is the specific content that is transmitted to a user terminal used by a viewer of the specific content.

第3の特徴は、第1の特徴又は第2の特徴において、前記出力部は、前記シーン記述が更新された場合に、更新されたシーン記述を前記メディア処理装置に出力する、シーン記述編集装置である。 The third feature is the scene description editing device according to the first or second feature, wherein, when the scene description is updated, the output unit outputs the updated scene description to the media processing device.

第4の特徴は、第1の特徴乃至第3の特徴の少なくともいずれか1つにおいて、前記出力部は、前記特定コンテンツに適用される特定視点情報を前記メディア処理装置に出力する、シーン記述編集装置である。 The fourth feature is a scene description editing device in at least one of the first to third features, in which the output unit outputs specific viewpoint information applied to the specific content to the media processing device.

第5の特徴は、第1の特徴乃至第4の特徴の少なくともいずれか1つにおいて、前記特定視点情報は、前記特定コンテンツの作成者が利用するヘッドマウントディスプレイの操作に応じて生成される、シーン記述編集装置である。 The fifth feature is a scene description editing device in at least one of the first to fourth features, in which the specific viewpoint information is generated in response to an operation of a head-mounted display used by the creator of the specific content.

第6の特徴は、第1の特徴乃至第5の特徴の少なくともいずれか1つにおいて、前記出力部は、前記特定コンテンツによって構成される3次元空間においてユーザの視点位置の移動範囲を定義する情報要素を前記メディア処理装置に出力する、シーン記述編集装置である。 The sixth feature is a scene description editing device in at least one of the first to fifth features, in which the output unit outputs to the media processing device information elements that define a range of movement of a user's viewpoint position in a three-dimensional space formed by the specific content.

10…伝送システム、100…送信装置、200…メディア処理装置、210…受付部、220…レンダラ、230…符号化処理部、260…選択部、270…選択部、300…ユーザ端末、310…検出部、320…復号処理部、330…レンダラ、400…第1ユーザ端末、500…第2ユーザ端末、600…シーン記述編集装置、610…検出部、620…復号処理部、630…レンダラ、640…編集部、650…データベース 10...transmission system, 100...transmission device, 200...media processing device, 210...reception unit, 220...renderer, 230...encoding processing unit, 260...selection unit, 270...selection unit, 300...user terminal, 310...detection unit, 320...decoding processing unit, 330...renderer, 400...first user terminal, 500...second user terminal, 600...scene description editing device, 610...detection unit, 620...decoding processing unit, 630...renderer, 640...editing unit, 650...database

本発明は、シーン記述編集装置及びプログラムに関する。 The present invention relates to a scene description editing device and a program .

そこで、本発明は、上述した課題を解決するためになされたものであり、メディア処理装置のレンダリング機能がユーザ端末で利用される場合に、特定コンテンツを定義するシーン記述を適切に編集することを可能とするシーン記述編集装置及びプログラムを提供することを目的とする。 Therefore, the present invention has been made to solve the above-mentioned problems, and aims to provide a scene description editing device and program that enables appropriate editing of scene descriptions that define specific content when the rendering function of a media processing device is used in a user terminal.

本発明によれば、メディア処理装置のレンダリング機能がユーザ端末で利用される場合に、特定コンテンツを定義するシーン記述を適切に編集することを可能とするシーン記述編集装置及びプログラムを提供することができる。 According to the present invention, it is possible to provide a scene description editing device and program that enable appropriate editing of a scene description that defines specific content when the rendering function of a media processing device is used in a user terminal.

上述した開示では特に触れていないが、送信装置100、メディア処理装置200、ユーザ端末300及びシーン記述編集装置600が行う各処理をコンピュータに実行させるプログラムが提供されてもよい。また、プログラムは、コンピュータ読取り可能媒体に記録されていてもよい。コンピュータ読取り可能媒体を用いれば、コンピュータにプログラムをインストールすることが可能である。ここで、プログラムが記録されたコンピュータ読取り可能媒体は、非一過性の記録媒体であってもよい。非一過性の記録媒体は、特に限定されるものではないが、例えば、CD-ROMやDVD-ROM等の記録媒体であってもよい。 Although not specifically mentioned in the above disclosure, a program may be provided that causes a computer to execute each process performed by the transmitting device 100, the media processing device 200 , the user terminal 300 , and the scene description editing device 600. The program may be recorded on a computer-readable medium. The computer-readable medium can be used to install the program on a computer. Here, the computer-readable medium on which the program is recorded may be a non-transient recording medium. The non-transient recording medium is not particularly limited, and may be, for example, a recording medium such as a CD-ROM or a DVD-ROM.

或いは、送信装置100、メディア処理装置200、ユーザ端末300及びシーン記述編集装置600が行う各処理を実行するためのプログラムを記憶するメモリ及びメモリに記憶されたプログラムを実行するプロセッサによって構成されるチップが提供されてもよい。 Alternatively, a chip may be provided that is configured by a memory that stores programs for executing the processes performed by the transmitting device 100, the media processing device 200 , the user terminal 300 , and the scene description editing device 600, and a processor that executes the programs stored in the memory.

Claims

an output unit that outputs a scene description and viewpoint information that define a configuration of content having viewpoint freedom to a media processing device;
an acquisition unit that acquires specific content including at least the content, the specific content being generated by the media processing device based on the scene description and the viewpoint information.

The scene description editing device according to claim 1, wherein the specific content generated by the media processing device is the specific content that is transmitted to a user terminal used by a viewer of the specific content.

The scene description editing device according to claim 1, wherein the output unit outputs the updated scene description to the media processing device when the scene description is updated.

The scene description editing device according to claim 1, wherein the output unit outputs specific viewpoint information applied to the specific content to the media processing device.

The scene description editing device according to claim 4, wherein the specific viewpoint information is generated in response to an operation of a head-mounted display used by the creator of the specific content.

The scene description editing device according to claim 1, wherein the output unit outputs to the media processing device information elements that define a range of movement of a user's viewpoint position in a three-dimensional space formed by the specific content.