WO2024161992A1

WO2024161992A1 - Information processing device, information processing method, and program

Info

Publication number: WO2024161992A1
Application number: PCT/JP2024/001050
Authority: WO
Inventors: 彬人中井; 亨中川; 越沖本
Original assignee: ソニーグループ株式会社
Priority date: 2023-02-01
Filing date: 2024-01-17
Publication date: 2024-08-08

Abstract

The present technology relates to an information processing device, an information processing method, and a program designed to enable automatic switching between multiple types of playback processing related to audio playback, including 3D sound playback. Out of multiple types of playback processing including 3D sound playback processing and non-3D sound playback processing, playback processing corresponding to the state of a user is executed on an inputted audio signal to generate an audio signal for playback.

Description

Information processing device, information processing method, and program

This technology relates to an information processing device, an information processing method, and a program, and in particular to an information processing device, an information processing method, and a program that enable automatic switching between multiple types of playback processes for audio playback, including 3D audio playback.

Patent Document 1 discloses a 3D sound reproduction technology that applies a head-related transfer function (HRTF) to an audio signal to reproduce the original sound field in a reproduction environment such as a concert hall or movie theater in a reproduction sound field that differs in time or space from the original sound field.

JP 2017-195581 A

There are times when you want to listen to the audio produced during sound production or editing in an actual playback environment, or when you want to listen to it without being affected by the playback environment, and so on. In these cases, multiple types of playback processing (filter processing) may be applied to the produced audio before listening to it, but the task of switching the playback processing applied reduces the work efficiency of sound production.

This technology was developed in light of these circumstances, and makes it possible to automatically switch between multiple types of playback processing for audio playback, including 3D audio playback.

The information processing device or program of this technology is an information processing device having a playback signal generation unit that performs playback processing corresponding to the user's state from among multiple types of playback processing, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal to generate an audio signal for playback, or a program for causing a computer to function as such an information processing device.

The information processing method of the present technology is an information processing method in which an information processing device having a playback signal generation unit performs a playback process corresponding to the user's state from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, on an input audio signal to generate an audio signal for playback.

In the information processing device, information processing method, and program of the present technology, a playback process that corresponds to the user's state is executed from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, for the input audio signal, and an audio signal for playback is generated.

1 is a block diagram showing an example configuration of a playback processing device according to an embodiment to which the present technology is applied. FIG. 2 is a diagram for explaining a first embodiment of switching of playback modes of a playback processing device. FIG. 11 is a diagram for explaining a second embodiment of playback mode switching of the playback processing device. 13 is a flowchart showing an example of a processing procedure of a second embodiment of playback process switching of a playback processing device; FIG. 13 is a diagram showing an example of the configuration of an audio production system to which third and fourth embodiments of the playback mode switching are applied. FIG. 1 is a diagram showing a measurement flow in a measurement environment. FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a third embodiment of playback mode switching is applied. 13 is a flowchart showing an example of a processing procedure of a third embodiment of playback process switching of a playback processing device; FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a fourth embodiment of playback mode switching is applied. 11 is a diagram illustrating an example of operation information that is displayed superimposed on an actual image of the console device when a user U operates the console device. 13 is a flowchart showing an example of a processing procedure of a fourth embodiment of playback process switching of a playback processing device; 11A and 11B are diagrams illustrating an example of switching of a playback mode on an operation device connected to a playback processing device. 1 is a block diagram showing an example of the configuration of an embodiment of a computer to which the present technology is applied.

Below, we will explain the implementation of this technology with reference to the drawings.

<<Reproduction Processing Device According to the Present Embodiment>>
FIG. 1 is a block diagram showing an example of the configuration of a playback processing device according to an embodiment to which the present technology is applied.

In FIG. 1, a playback processing device 11 according to this embodiment is used for producing or editing the sound of content such as movies (hereinafter, the term "sound production" includes editing). An audio playback device 12 such as headphones or speakers is connected to the playback processing device 11. The playback processing device 11 has a playback signal generation unit 21, a sensing unit 22, a trigger generation unit 23, and multi-channel audio playback software 24.

The playback signal generation unit 21 switches between playback processes among binaural processing (3D audio playback processing), 2ch mixdown/stereo playback processing (non-3D audio playback processing), and pass-through processing (non-3D audio playback processing) for the multi-channel audio signals from the multi-channel audio playback software 24. These playback processes are switched based on the user's state, and more specifically, based on a trigger signal from the trigger generation unit 23. The playback signal generation unit 21 generates an audio signal for playback by the playback process and supplies it to the audio playback device 12.

The playback signal generating unit 21 may also have a function of generating multimedia information such as video and GUI related to the audio and supplying it to a video output device connected to the audio playback device 12. Examples of video output devices include display devices such as monitors, VR goggles (HMD: Head Mounted Display), and AR goggles. The sensing unit 22 detects (acquires) information indicating the state of the user for determining whether to switch the playback process in the playback signal generating unit 21. Examples of information detected by the sensing unit 22 include an image of the user captured by a camera, the user's head posture detected by a head tracker, the user's line of sight detected by an eye tracker, user operations on a GUI (Graphical User Interface), and user operations on switches, buttons, etc. The information (sensing information) acquired by the sensing unit 22 is supplied to the trigger generating unit 23. The sensing unit 22 detects (acquires) any information necessary for determining whether to switch the playback process and indicating the state of an object related to the user as sensing information. Sensing information that can be used appropriately includes biometric sensing information (head tracking, gaze direction, focal position, posture and position tracking), appearance information (person recognition, face recognition, headphone recognition using images obtained from a camera), positioning information using GPS or ultrasound, device input information (GUI information, equipment such as keyboards and controllers, headphone type information), etc.

The trigger generation unit 23 generates a trigger signal that instructs switching of the playback process based on the sensing information from the sensing unit 22, and supplies it to the playback signal generation unit 21. For example, when the sensing information matches a predetermined condition, the trigger generation unit 23 supplies the trigger signal to the playback signal generation unit 21. The trigger generation unit 23 may be incorporated in the same hardware as the sensor that detects the sensing information, or may be incorporated in separate hardware or software that receives the sensing information, or may be included in the trigger receiving unit 31 of the playback signal generation unit 21. The trigger signal corresponds to a signal that specifies the playback process to be executed by the playback signal generation unit 21 from among multiple types of playback processes that can be executed by the playback signal generation unit 21.

The multi-channel audio playback software 24 represents a processing unit that executes multi-channel audio playback software such as a DAW (Digital Audio Workstation), and generates (or edits) multi-channel (including 1-channel) audio signals. The generated multi-channel (multi-channel) audio signals are supplied to the playback signal generation unit 21. The multi-channel audio playback software may be plug-in software that runs on a DAW, or may be separated from a DAW and run as a standalone application. In this case, the multi-channel audio signals are output from the DAW and input to the standalone application. In addition, software routines other than DAWs may be used as the multi-channel audio playback software as long as they output multi-channel audio signals (such as data after rendering of object audio).

The playback signal generation unit 21 has a trigger receiving unit 31, a switching processing unit 32, a binaural processing unit 33, a 2ch mixdown/stereo playback processing unit 34, and a pass-through processing unit 35. The trigger receiving unit 31 receives a trigger signal from the trigger generation unit 23 and determines whether or not a trigger signal has been supplied from the trigger generation unit 23. When the trigger receiving unit 31 receives a trigger signal, the switching processing unit 32 switches between the binaural processing unit 33, the 2ch mixdown/stereo playback processing unit 34, and the pass-through processing unit 35, whichever processing unit performs playback processing on the multi-channel audio signals from the multi-channel audio playback software 24.

The binaural processing unit 33 performs binaural playback processing (binaural processing), which is one of the 3D sound playback methods. The 3D sound playback method is an audio playback technology that reproduces input signals to both ears of a listener in an original sound field (including a virtual original sound field) such as a concert hall or movie theater, using headphones or speakers at the entrance of the listener's ear canal in a playback sound field that is different in time or space from the original sound field. The binaural processing unit 33 performs filtering processing to convolve a head-related transfer function (HRTF) in order to reflect the transfer characteristics of a specific space (original sound field) in the audio signal from the multi-channel audio playback software 24 as binaural processing. Incidentally, binaural playback is premised on playback using headphones, but the binaural processing unit 33 is not limited to binaural playback and also includes cases where any playback processing (3D audio playback processing) classified as a 3D sound playback method is performed. Examples of playback processing classified as 3D sound playback methods include, in addition to binaural playback, transaural playback processing (transaural processing) that assumes playback using two speakers. In addition to binaural processing, transaural processing includes processing to remove crosstalk, but the binaural processing unit 33 may also perform transaural processing. In addition, the binaural processing unit 33 may perform processing of an appropriate 3D sound playback method depending on the type of audio playback device 12 connected to the playback processing device 11.

In addition, the transfer characteristics of sound (transfer functions or impulse responses) that can be applied to the binaural processing (3D sound reproduction processing) of the binaural processing unit 33, or to 3D sound reproduction processing instead of binaural processing, include HRTF (Head-Related Transfer Function), HRIR (Head-Related Impulse Response), a combination of HRTF and RTF (Room transfer function), a combination of HRIR and RIR (Room Impulse Response), BRTF (Binaural Room Transfer Function), BRIR (Binaural Room Impulse Response), and any of the transfer functions or their impulse responses from the headphones to the eardrum (entrance of the ear canal), or a combination of these. In the explanation of this technology, it is assumed that HRTF is applied as the binaural processing (3D sound reproduction processing) of the binaural processing unit 33.

The 2ch mixdown/stereo playback processor 34 performs mixdown playback processing (mixdown processing) on the multi-channel audio signal from the multi-channel audio playback software 24 to generate an audio signal for 2ch stereo playback. Furthermore, if the audio signal from the multi-channel audio playback software 24 is a 1ch (monaural) audio signal, the 2ch mixdown/stereo playback processor 34 generates an audio signal for 2ch stereo playback. Note that in the following, the 2ch mixdown/stereo playback processor 34 will only perform mixdown processing, without considering the case where a mono audio signal is supplied from the multi-channel audio playback software 24.

The pass-through processing unit 35 supplies the multi-channel audio signals from the multi-channel audio playback software 24 directly to the corresponding channels of the audio playback device 12 without performing mix-down processing or the like on the multi-channel audio signals from the multi-channel audio playback software 24. Audio signals from the multi-channel audio playback software 24 corresponding to channels that do not exist in the audio playback device 12 are not supplied from the pass-through processing unit 35 to the audio playback device 12. However, the pass-through processing unit 35 can also perform signal routing processing. In this case, the multi-channel audio signals from the multi-channel audio playback software 24 are each supplied to the specified channels of the audio playback device 12.

In the following, the playback signal generating unit 21 switches the playback process (acoustic process) of the multi-channel audio signal from the multi-channel audio playback software 24 between binaural processing in the binaural processing unit 33 and mixdown processing in the 2ch mixdown/stereo playback processing unit 34 based on a trigger signal from the trigger generating unit 23. The playback process method executed by the playback signal generating unit 21 is referred to as the playback mode (or audio playback mode) of the playback processing device 11 or the playback signal generating unit 21, and the playback mode is switched between binaural processing and mixdown processing. Switching the HRTF applied to the binaural processing in the binaural processing unit 33 or adjusting the applied HRTF (parameter) also corresponds to switching the playback mode. The playback mode of the playback processing device 11 or the playback signal generating unit 21 is also simply referred to as the playback mode. This technology can be applied to cases where multiple types of playback processes, including 3D sound playback processes such as binaural processing and non-3D sound playback processes such as 2ch mixdown processing and pass-through processing, are switched and executed according to the user's state, and there is no particular limit to the types and number of playback processes that can be switched and executed.

<First embodiment of switching of playback mode of playback processing device 11>
Fig. 2 is a diagram for explaining a first embodiment of switching of the playback mode of the playback processing device 11. Fig. 2 illustrates various peripheral devices of an audio production system including the playback processing device 11, and a user U. The user U is a producer who uses the playback processing device 11 to produce the audio of content such as a movie. The console machine 41 is a device connected to the playback processing device 11 and inputs the user U's operations on the multi-channel audio playback software 24. The console machine 41 may be, for example, an operating device such as a mixing console, a keyboard, or a mouse.

Monitors

42A and 42B are connected to the playback processing device 11, and work in conjunction with the multi-channel audio playback software 24 to display images such as GUI information and production screens to the user U. The number of

monitors

42A and 42B is not limited to two, and may be one or three or more. The camera 43 is connected to the playback processing device 11, and mainly supplies images of the user U to the playback processing device 11 as images detected by the sensing unit 22. The speaker 45 is connected to the playback processing device 11, and outputs the audio signal supplied from the playback processing device 11 as sound waves.

The headphones 44 are connected to the playback processing device 11 and are worn on the head of the user U. The headphones 44 output the 2ch audio signals supplied from the playback processing device 11 as sound waves near the left and right ears (external ear inlets). The speaker 45 outputs the audio signals supplied from the playback processing device 11 instead of the headphones 44 or in addition to the headphones 44.

In an audio production system including a playback processing device 11 as shown in FIG. 2, a user U registers the following setting information a through d in advance regarding switching of playback modes (playback processes).

(Setting information for a to d)
a. Automatic switching of playback modes ON/OFF
b) Default Playback Mode
c. The state of the user when setting the playback mode to mixdown processing (position of the gaze point, face direction, type of selected window, etc.)
d. The user's state when setting the playback mode to binaural processing (position of gaze point, face direction, type of selected window, etc.)

The setting information a sets whether or not automatic switching of the playback mode is enabled according to the setting information b to d. The playback mode is switched according to the setting information b to d only when automatic switching of the playback mode is set to ON. The setting information b is the playback mode (type of playback processing) that is initially set, and the playback mode (type of playback processing) that is set when the playback mode is not set by c or d. The setting information c is the condition (condition c) for setting the playback mode to mixdown processing (playback processing of c), and the setting information d is the condition (condition d) for setting the playback mode to binaural processing (playback processing of d). As the condition c and the condition d, a specific state is set that is a state (including actions, operations, etc.) of the user U that can be specified from the sensing information acquired by the sensing unit 22 of the playback processing device 11, and is other than an operation, etc., that the user intends only to switch the playback mode. For example, when the user U gazes at a specific position, it is determined that the condition c or d is satisfied, and the specific position is set as the position of the gaze point that determines the condition c or d. The position where the user U is gazing at may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like. Alternatively, when the user turns his/her face (head front) in a specific direction, it is determined that the condition c or d is satisfied, and the specific direction is set as the facial direction that determines the condition c or d. The facial direction of the user U may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like. For example, when two

monitors

42A and 42B are used as shown in FIG. 2, the position of the gaze point or the facial direction of the user U on one monitor (e.g., monitor 42A) may be registered as the setting information of c in which the playback mode is set to mixdown processing. The position of the gaze point or the facial direction of the user U on the other monitor (e.g., monitor 42B) may be registered as the setting information of d in which the playback mode is set to binaural processing. Furthermore, when the user U selects a specific window from one or more windows displayed on the screen of the

monitor

42A or 42B, it is determined that the condition c or d is satisfied, and the specific window is set as the type of selected window that determines the condition c or d. The type of selected window selected by the user U can be identified from GUI information acquired by the sensing unit 22, information on user operation, and the like. In addition to these, the direction of the user U's line of sight may be a specific direction (between multiple displays, outside the display screen, a specific direction in the room) that is registered as the setting condition of c or d, or the state of the headphones attached or detached (worn or removed) for the user U may be registered as the setting condition of c or d. Furthermore, the setting states b to d may be preset, rather than being set by the user.

Here, the following modes can be adopted for switching the playback mode, for example. In a first mode, the playback process of c is set as the playback mode when the condition of c is satisfied, and the playback process of d is set as the playback mode when the condition of d is satisfied. In this case, when neither the condition of c nor the condition of d is satisfied, the playback mode is set to the default playback mode set in b. In a second mode, after the condition of c is satisfied, the playback process of c is set as the playback mode until the condition of d is satisfied, and after the condition of d is satisfied, the playback process of d is set as the playback mode until the condition of c is satisfied. In a third mode, only one of the setting information of c and d is set. For example, it is assumed that the setting information of c is set. In this case, when the condition of c is satisfied, the playback process of c is set as the playback mode, and when the condition of c is not satisfied, the playback process of d is set as the playback mode.

According to the first embodiment described above, since multiple types of playback processes related to audio playback, including 3D audio playback, are automatically switched based on preset conditions, the user does not need to manually switch between playback processes. For example, when it comes to audio production work, it is easier for the producer to work while listening to audio that has not been subjected to 3D audio playback processing. When checking the created audio, it is easier to make an appropriate judgment on the quality of the audio by listening to audio that has been subjected to 3D audio playback processing, which reproduces the audio heard in the actual playback environment (original sound field). The producer often repeats the audio production work and the confirmation work many times, and it is troublesome for the producer to manually switch the playback process (playback mode) each time, which is inefficient. In the first embodiment, the playback mode is automatically switched according to preset conditions. Therefore, if the state of the producer when performing the audio production work is, for example, a state in which the producer is facing down (at his/her hands) (a state in which the producer is gazing at the production (editing) equipment at hand), the playback mode is set to the mixdown process under the condition that the producer is in that state. For example, if the state of the producer when checking the production result is a state in which the producer is looking up (information indicating that the producer is not gazing at the production (editing) equipment at hand), the playback mode is set to binaural processing under the condition of this state. This eliminates the need for the producer to manually switch the playback mode, and appropriate playback processing is performed according to the work of audio production and the work of checking the production result. In addition, when head tracker information (tracking data) is used as sensing information for detecting the user's state, the tracking data can also be used for the absolute position of object audio, so that it can be used in combination. Note that the playback mode may be switched depending on other states such as the user's line of sight and the position of the gaze point, rather than the direction of the user U (direction of the head in front). In addition, when the playback mode is switched depending on the range of orientations and positions of the determination target, such as the direction of the head in front (face), the direction of sight, and the position of the gaze point, the orientations and positions that are the boundaries of each range are determined in advance. The orientation and position of the determination target are compared with the orientations and positions that are the boundaries of those ranges to determine which range of orientations and positions the determination target falls into. In explaining this technology, we will not mention the orientation or position of the range boundaries.

<Second embodiment of switching of playback mode of playback processing device 11>
FIG. 3 is a diagram for explaining a second embodiment of the switching of the playback mode of the playback processing device 11. In the second embodiment, the switching of the image presented to the user is performed in conjunction with the switching of the audio playback mode. In A and B of FIG. 3, the user U is a producer at a place where the work of sound production is performed (production work place) and a listener who listens to the sound of the playback sound field. The user U wears VR goggles (HMD) 51 with headphones. The VR goggles 51 are connected to the playback processing device 11, and the playback processing device 11 supplies the VR goggles 51 with a 2ch audio signal to be played by the headphones of the VR goggles 51 and a video signal to be displayed on the VR goggles 51. The video signal supplied from the playback processing device 11 to the VR goggles 51 is switched between a video signal of a CG (Computer Graphics) image generated by the playback processing device 11 as shown in A of FIG. 3 and a video signal of a real-life image captured by a camera (VR outside world camera) as shown in B of FIG. 3.

A in FIG. 3 is a CG image of a virtual space (CG space) that reproduces the space of the original sound field of the sound produced using the multi-channel sound reproduction software 24 with CG, for example, a CG image that reproduces a movie theater screen from the viewpoint of a listener in a specified seat. Note that A in FIG. 3 may not be a CG image, but may be a live-action image obtained by photographing the space of the original sound field. B in FIG. 3 is a live-action image captured by the VR external camera of the VR goggles 51 in the front direction (or line of sight) of the head of the user U who produces sound with the playback processing device 11. The live-action image shows, for example, the production equipment of the sound production system (peripheral equipment connected to the playback processing device 11, etc.) that is placed in the production work area, such as the console machine 41 and monitors 42A and 42B shown in FIG. 2. Note that B in FIG. 3 may not be a live-action image of the production work area, but may be a CG image that imitates the production work area. The production work location simulated as a CG image is not limited to an actual production work location, but may be a virtual production work location, and when a virtual production work location is used, the production equipment (monitors, input devices, etc.) used in the audio production system may also be virtual equipment (equipment that does not actually exist).

The images displayed on these VR goggles 51 are automatically switched in conjunction with the switching of the audio playback mode. For example, the playback processing device 11 is switched to different playback modes when the user U looks up and when the user U looks toward his/her hands (down). Specifically, when the user U looks up, the playback mode is set to binaural processing, and when the user U looks toward his/her hands, the playback mode is set to mixdown processing. Note that when the user U looks up or toward his/her hands, this refers to when the user U's face (front of the head) or line of sight (point of gaze) looks up or toward his/her hands (down), and it is determined whether the user U looks up or toward his/her hands based on the sensing information acquired by the sensing unit 22.

In conjunction with switching the playback mode of the playback processing device 11 in this way, the image displayed on the VR goggles 51 is switched between images A and B in Fig. 3. Specifically, when the user U looks up, a CG image like that of Fig. 3A, which reproduces the original sound field space using CG, is displayed on the VR goggles 51. When the user U looks toward his or her hands, a live-action image like that of Fig. 3B, which is a photograph of the user U's hands (surroundings) at the production work location, is displayed.

According to the second embodiment, as in the first embodiment, multiple types of playback processes related to audio playback, including 3D audio playback, are automatically switched based on preset conditions, eliminating the need for the user to manually switch between playback processes. In addition, when the user U looks down to perform audio production work, etc., where peripheral devices are present, the user U can view the live-action video displayed on the VR goggles 51, and can easily operate the peripheral devices through the live-action video. At this time, the user U can listen to the mixdown-processed audio that has not been subjected to 3D audio playback processing through headphones, and can perform audio production work while listening to audio suitable for audio production. On the other hand, when the user U looks up to check the production results, etc., the user U can view the CG video of the original sound field space displayed on the VR goggles 51, and can visually recognize the space of the original sound field. At this time, the user U can listen to the binaurally processed audio, which is a 3D audio playback process, through headphones. In other words, the user U can check the created audio by the audio heard when it is played back in the original sound field environment. Therefore, user U can properly judge the quality of the created audio while listening to the binaurally processed audio when the created audio is played in the original sound field environment and visually grasping the space of the original sound field through the CG image.

For example, let us assume that a listener is watching a movie in a seat at a specific position in a movie theater, which is the original sound field, and the sound heard by the listener in the movie theater is reproduced by binaural processing as the sound of the playback sound field heard by user U, the producer at the production work site. In this case, the CG image of the space seen by the listener in the movie theater is displayed as the CG image of A in FIG. 3, which is displayed on the VR goggles 51 and presented to user U. This allows user U to visually grasp the spatial state of the original sound field, such as the listening position of the listener in the movie theater and the arrangement of the screen and speakers relative to the listening position. User U can then listen to the sound output from the movie theater speakers and heard by the listener as the sound of the playback sound field that has been binaurally processed. Thus, user U can confirm whether the sound produced using the multi-channel sound playback software 24 is appropriate when listened to at the listening position in the movie theater recognized by the CG image. If the confirmed audio is not appropriate, the user U repeats the audio production (editing) work using the multi-channel audio playback software 24 until it is appropriate. When repeating such audio production work and confirmation of the production results, the audio and video viewed by the user U are automatically switched between mixdown processed audio and video (video from the production work site) suitable for the audio production work, and binaural processed audio and video (video from the movie theater) suitable for confirming the production results, thus dramatically improving work efficiency.

As described above, the video presented to user U can be automatically switched in conjunction with the automatic switching of the playback mode in the playback processing device 11, which reduces the effort required for user U to manually switch between audio production work and checking the production results, significantly improving work efficiency.

<Processing Procedure of the Second Example of Playback Mode Switching of the Playback Processing Device 11>
4 is a flow chart showing an example of a processing procedure of the second embodiment of the playback processing switching of the playback processing device 11. In step S1, the sensing unit 22 acquires sensing information indicating the state of the user. In step S2, when the trigger generation unit 23 detects that a condition for switching from one of the playback processing of the binaural processing and the mixdown processing to the other playback processing is satisfied based on the sensing information acquired in step S1 and the setting information determined in advance, the trigger generation unit 23 supplies a trigger signal indicating that to the playback signal generation unit 21. When the trigger reception unit 31 of the playback signal generation unit 21 receives the trigger signal, the trigger reception unit 31 supplies a switching to the playback processing indicated by the trigger signal to the switching processing unit 32. The switching processing unit 32 determines the playback mode to be set based on the information from the trigger reception unit 31 and the current playback mode (playback processing).

In step S3, the switching processing unit 32 judges whether the playback mode to be set is binaural processing or not. If the answer is yes in step S3, the process proceeds to step S4, and if the answer is no, the process proceeds to step S7. In step S4, the switching processing unit 32 enables binaural processing in the binaural processing unit 33. The binaural processing unit 33 performs binaural processing on the multi-channel audio signal supplied from the multi-channel audio playback software 24. In step S5, the binaural processing unit 33 generates an audio signal (2-channel audio signal) for playback in the audio playback device 12 based on the audio signal after binaural processing, and outputs it to the audio playback device 12. In step S6, the playback signal generation unit 21 generates a CG image of the space of the original sound source, and outputs the video signal of the generated CG image to a display device such as VR goggles that the user views. After step S6, the process of this flowchart ends. Note that the process of this flowchart is executed repeatedly.

In step S7, if the judgment in step S3 is negative, the switching processing unit 32 enables the mixdown processing in the 2ch mixdown/stereo playback processing unit 34. The 2ch mixdown/stereo playback processing unit 34 performs mixdown processing on the multi-ch audio signal supplied from the multi-ch audio playback software 24. In step S8, the 2ch mixdown/stereo playback processing unit 34 outputs the 2ch audio signal after the mixdown processing to the audio playback device 12 as an audio signal for playback in the audio playback device 12. In step S9, the playback signal generating unit 21 acquires live-action video of the production work location (hands-on space) from the sensing unit 22 (camera) and outputs the video signal of the live-action video to a display device such as VR goggles that the user views. After step S9, the processing of this flowchart ends. Note that the first embodiment of the playback processing switching of the playback processing device 11 differs from the second embodiment in that the processing in steps S6 and S9 is not performed.

<Third and fourth embodiments of switching of playback mode of playback processing device 11>
In the third and fourth embodiments of the playback mode switching of the playback processing device 11, not only is the playback mode switched between binaural processing and mixdown processing based on sensing information indicating the state of the user, but also the HRTF applied to the binaural processing is switched or adjusted. Moreover, in the third and fourth embodiments, it is not necessarily required that the playback mode be switched to the mixdown processing. Therefore, in the explanation of the third and fourth embodiments, it is assumed that the playback processing device 11 only switches or adjusts the HRTF applied to the binaural processing as the playback mode switching. However, in the third and fourth embodiments, the processing in the first or second embodiment may be combined to switch the playback mode to the mixdown processing.

<Audio Production System to which the Third and Fourth Embodiments are Applied>
Fig. 5 is a diagram showing an example of the configuration of an audio production system to which the third and fourth embodiments of playback mode switching are applied. In Fig. 5, the measurement environment represents the measurement environment when the HRTF applied to the binaural processing of the playback processing device 11 is actually measured in advance. As described above, the transfer characteristic of audio such as BRTF can be applied to the binaural processing instead of the HRTF, and in that case, the transfer characteristic applied to the binaural processing instead of the HRTF may be measured in the measurement environment.

The measurement environment is exemplified by a movie theater as an original sound field. A movie theater as an original sound field is also called a dubbing stage, and is the space of the original sound field that is reproduced as the sound of the playback sound field in sound production. The playback environment represents the playback environment in which the sound of the original sound field is reproduced as the sound of the playback sound field in the sound production location used in sound production. The sound production location is a location different from the original sound field, such as a studio or the producer's home, but it may be the same location as the original sound field. The measurement processing device 81 shown in the measurement environment acquires an HRTF corresponding to the acoustic characteristics of the original sound field, such as a movie theater, and generates a BRTF file (described later). The measurement processing device 81 also acquires condition information indicating the conditions when the HRTF was measured, and stores the condition information in the BRTF file together with the HRTF.

The playback processing device 11 in the playback environment corresponds to the playback processing device 11 in FIG. 1, and the headphones 44 are one form of the audio playback device 12 in FIG. 1 connected to the playback processing device 11. The headphones 44 may be headphones attached to the VR goggles 51 in FIG. 3, or may be other audio playback devices. The playback processing device 11 acquires the BRTF file generated by the measurement processing device 81, and sets parameters to be used in binaural processing based on the data in the BRTF file. The playback processing device 11 may be able to acquire the BRTF file via a network such as the Internet, or may be able to acquire the BRTF file by using a recording medium such as a flash memory.

Figure 6 shows the measurement flow in the measurement environment. HRTF measurements are taken with the subject sitting in a designated seat in the cinema, with a microphone attached to their ear. In this state, playback sound is output from the cinema's speaker 91, and the HRTF from the speaker 91 to the ear (e.g., ear canal position, eardrum position) is measured.

For example, as shown in speech bubble #1 in Figure 6, assume that HRTF measurements are taken with the subject sitting in a seat at position A in each of postures 1 to 3. Also, as shown in speech bubble #2, assume that HRTF measurements are taken with the subject sitting in a seat at position B in each of postures 1 to 3. Furthermore, assume that HRTF measurements are taken with the subject sitting in a seat at position C in each of postures 1 to 3.

As shown in speech bubble #4, spatial shape data indicating the shape of the movie theater is acquired as condition information. For example, the width, height, and depth of the theater are recorded as spatial shape data as the smallest elements that indicate the shape of the theater. Note that information indicating more detailed shapes, such as vertex information or point clouds, may also be recorded as spatial shape data.

As shown in speech bubble #5, position information of speaker 91, which is the measurement sound source (original sound source) used in measuring HRTF, is acquired as condition information. For example, coordinates indicating the position of speaker 91 in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as position information of speaker 91.

As shown in speech bubble #6, measurement position information indicating the subject's position (measurement position) when the HRTF is measured and measurement posture information indicating the posture (measurement posture) are acquired as condition information. For example, the coordinates indicating the subject's position in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as measurement position information. For example, the Euler angles of the subject's head are recorded as measurement posture information.

The measurement processing device 81 stores the HRTF and condition information measured in the above manner in a BRTF file. The BRTF file stores group data consisting of the same type of data for each combination of positions A to C and postures 1 to 3, for example. The group data for each combination includes spatial shape data, position information of the measurement sound source (original sound source), measurement position information, measurement posture information, transfer characteristic data from the headphones 44 to the ears, and HRTF measurement data measured with the subject sitting in a seat at each position in each measurement posture. However, the spatial shape data, position information of the measurement sound source, and transfer characteristic data from the headphones 44 to the ears are common regardless of the combination of positions A to C and postures 1 to 3, so they may be stored in the BRTF file as data outside the group data.

<Third embodiment of switching of playback mode of playback processing device 11>
In a third embodiment of the playback mode switching of the playback processing device 11, the HRTF applied to the binaural processing is switched and the audio playback mode is switched based on sensing information indicating the state of the user U. In addition, in the third embodiment, it is assumed that the images presented to the user U through the VR goggles 51 are only CG images. Note that live-action images (such as images of peripheral devices at the production work site) may be displayed depending on the state of the user U.

FIG. 7 is a block diagram showing an example configuration of a playback processing device 11 to which a third embodiment of playback mode switching is applied. FIG. 7 shows blocks that are not shown in the block diagram of the playback processing device 11 in FIG. 1, but some of the blocks in FIG. 7 are subdivisions of the blocks shown in FIG. 1, and some of the blocks shown in FIG. 1 are omitted in FIG. 7. In FIG. 7, the playback processing device 11 has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103. Note that the audio control unit 102 and the display control unit 103 are, with some exceptions, included in the playback signal generation unit 21 in FIG. 1.

The BRTF file acquisition unit 101 acquires a BRTF file generated by the measurement processing device 81 of FIG. 4. The acquired BRTF file is preferably a file that stores measurement data measured using a producer (user U) who produces audio using the playback processing device 11 as the subject, but is not limited to this. The BRTF file includes coefficient data, spatial information, and measurement posture information. The coefficient data corresponds to the HRTF measurement data. Binaural processing can be performed by convolution processing using an FIR (Finite Impulse Response) filter. At that time, the coefficient of the FIR filter is set based on the characteristics of the HRTF to be applied to the binaural processing. The coefficient data of the BRTF file represents the HRTF measurement data as coefficient data of the FIR filter, and the process of calculating the coefficient of the FIR filter from the HRTF measurement data may be performed by the audio control unit 102 or the like after reading the HRTF measurement data from the BRTF file. The spatial information includes spatial shape data, position information of the measurement sound source (original sound source), and measurement position information. As described in FIG. 6, the coefficient data, spatial information, and measurement posture information each include a plurality of data measured at a plurality of measurement positions (positions A to C in FIG. 6) and a plurality of measurement postures (postures 1 to 3 in FIG. 6) that are linked (associated) with the measurement positions and measurement postures. Note that the reproduction processing device 11 acquires FIR filter coefficient data from the BRTF file as measurement data of HRTF that specifies the content of binaural processing, but the measurement data acquired to specify the content of binaural processing does not have to be FIR filter coefficient data. Since the content of binaural processing can be specified by acoustic characteristics (transfer characteristics) such as HRTF in the original sound field, the reproduction processing device 11 may acquire information on the acoustic characteristics in the original sound field. Also, the reproduction processing device 11 may theoretically calculate information on the acoustic characteristics in the original sound field based on the spatial shape, the position of the original sound source, the measurement position (listening position), etc., instead of acquiring information on the acoustic characteristics in the original sound field obtained by actual measurement from the BRTF file.

The audio control unit 102 includes the binaural processing unit 33 of FIG. 1. The audio control unit 102 has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113. The coefficient reading unit 111 obtains information (playback posture information) on the posture (playback posture) of the user U, who is the producer, at the time of audio playback (current time) from the playback posture information acquisition unit 126, and reads coefficient data (HRTF measurement data) corresponding to the playback posture of the user U from the BRTF file. The coefficient data corresponding to the playback posture is coefficient data corresponding to the HRTF measured in a measurement posture close to the playback posture. Note that when multiple coefficient data measured at multiple measurement positions are included as coefficient data in the BRTF file, the coefficient data obtained by the coefficient reading unit 111 is, for example, coefficient data measured at a measurement position specified in advance by the user U. In addition, in this third embodiment, the BRTF file may include only data measured at one measurement position. In that case, the coefficient reading unit 111 reads the coefficient data measured in a measurement posture close to the posture among the coefficient data corresponding to the HRTF measured at that measurement position.

The convolution processing unit 112 sets the coefficients of the FIR filter based on the coefficient data read by the coefficient reading unit 111. The convolution processing unit 112 performs convolution processing using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. This performs binaural processing on the audio signal supplied from the multi-channel audio playback software 24, applying an HRTF according to the posture of the user U. The audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1.

The display control unit 103 generates CG images to be displayed on a display device such as VR goggles. The display control unit 103 has a spatial information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a CG space drawing unit 125, a playback attitude information acquisition unit 126, and an image drawing processing unit 127. The spatial information reading unit 121 reads the spatial shape data contained in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired by the BRTF file acquisition unit 101.

The CG model acquisition unit 122 acquires material data of a 3D model corresponding to objects (walls, floors, ceilings, speakers, screens, seats, etc.) present in the original sound field from the CG data storage unit 123 based on the spatial shape data and the positional information of the measured sound source (original sound source) from the spatial information reading unit 121, and generates a CG model that mimics the space of the original sound field in a virtual space (CG space).

The measurement position information reading unit 124 reads the measurement position information included in the spatial information from the BRTF file acquired by the BRTF file acquisition unit 101. The CG space drawing unit 125 renders the CG space generated by the CG model acquisition unit 122 to generate a 2D CG image. The position of the virtual camera (viewpoint) during rendering is set to a position in the CG space corresponding to the measurement position in the original sound field based on the measurement position information acquired by the measurement position information reading unit 124. The attitude of the virtual camera (viewpoint) during rendering is set to a posture corresponding to the posture of the user U at the production work site based on the playback posture of the user U at the current time acquired by the playback posture information acquisition unit 126. Note that when multiple measurement position information is included as spatial information of the BRTF file, for example, measurement position information specified by the user U in advance is read by the measurement position information reading unit 124, and the virtual camera during rendering is set to a position in the CG space corresponding to the measurement position information. The measurement position information referenced as the position of the virtual camera in the CG space is the same as the measurement position information associated with the coefficient data acquired by the coefficient reading unit 111. The CG space rendering unit 125 generates, by rendering, a 2D CG image captured by a virtual camera set in the CG space.

The playback posture information acquisition unit 126 acquires playback posture information of the user U at the time of audio playback by the playback processing device 11 (current time) based on the sensing information of the sensing unit 22 in FIG. 1. The playback posture information of the user U is, for example, the posture of the user U's head. The posture of the user U's head can be recognized from head tracker information acquired by the sensing unit 22. When the user U is wearing VR goggles (HMD), the head tracker can detect the posture of the user U's head using an IMU (Inertial Measurement Unit) installed in the VR goggles. When the sensing unit 22 acquires an image captured by a camera that captures the user U, the posture of the user U's head may be detected from the captured image.

The video rendering processing unit 127 generates a video signal for displaying the 2D CG video generated by the CG space rendering unit 125 on a display device such as VR goggles connected to the playback processing device 11, and outputs the signal to the display device.

According to the third embodiment of the playback mode switching of the playback processing device 11, the user U, who is a producer performing audio production at the audio production work site, can listen to the audio that is heard when the audio produced by the multi-channel audio playback software 24 is played in the original sound field by binaurally processed audio. When the user U changes the playback posture, the HRTF is changed according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if he or she similarly changed their posture. In addition, the user U is presented with a CG image that is viewed at the listening position in the original sound field, and when the user U changes the playback posture, the space of the original sound field that the listener would see if the listener in the original sound field similarly changed their posture is presented as a CG image. Therefore, the user U can perform audio production work and check the production results while viewing realistic audio and CG images. In addition, since the playback process (playback mode) is automatically switched when the playback posture is changed, the user U can reduce the effort required to switch playback modes, and work efficiency is significantly improved.

<Processing Procedure of Third Example of Playback Mode Switching of the Playback Processing Device 11>
Fig. 8 is a flow chart showing an example of a processing procedure of the third embodiment of the playback mode switching of the playback processing device 11. In step S11, the BRTF file acquisition unit 101 acquires the BRTF file generated by the measurement processing device 81 of Fig. 4. In step S12, the spatial information reading unit 121 reads the spatial shape data included in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired in step S11. In step S13, the measurement position information reading unit 124 reads the measurement position information from the BRTF file acquired in step S11. In step S14, the playback posture information acquisition unit 126 acquires the playback posture information of the user U at the current time.

In step S15, the CG model acquisition unit 122 acquires material data of a 3D model corresponding to the objects (walls, floors, ceilings, speakers, screens, seats, etc.) that exist in the original sound field from the CG data storage unit 123 based on the spatial information (spatial shape data and position information of the measured sound source) read in step S12, and generates a CG model that imitates the space of the original sound field in a virtual space (CG space). In step S16, the CG space rendering unit 125 sets the position and attitude of the virtual camera when rendering the CG space based on the measurement position information read in step S13 and the playback attitude information acquired in step S14, and generates a 2D CG image by rendering. In step S17, the image rendering processing unit 127 generates an image signal for displaying the CG image generated in step S16 on a display device connected to the playback processing device 11, and outputs it to the display device.

In step S18, the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of the user U from the BRTF file acquired in step S11, based on the playback posture information acquired in step S14. In step S19, the convolution processing unit 112 sets the coefficients of an FIR filter based on the coefficient data read in step S18, and performs convolution processing (binaural processing) using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. In step S20, the audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1. When the processing of step S20 ends, the processing of this flowchart ends. The processing of this flowchart is executed repeatedly.

In the third embodiment of the playback mode switching, only the switching or adjustment of the HRTF applied to the binaural processing is performed as the playback mode switching. However, this is not limited to this, and the playback mode switching may also include switching to mixdown processing. In this case, when the playback mode is switched to mixdown processing, live-action footage or CG footage of the production work site may be displayed on the display device.

<Fourth embodiment of switching of playback mode of playback processing device 11>
In a fourth embodiment of the playback mode switching of the playback processing device 11, the HRTF (or BRTF) applied to the binaural processing is adjusted and the audio playback mode is switched based on sensing information indicating the state of the user U. Also, in the fourth embodiment, the image presented to the user U through the VR goggles 51 is switched between CG images and real-life images in conjunction with the switching of the playback mode, similar to the second embodiment.

Figure 9 is a block diagram showing an example configuration of a playback processing device 11 to which the fourth embodiment of the playback mode switching is applied. In the figure, parts that are common to Figure 7 are given the same reference numerals, and their explanation will be omitted as appropriate. The playback processing device 11 in Figure 9 is common to the playback processing device 11 in Figure 7 in that it has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103. In addition, the audio control unit 102 in Figure 9 has a coefficient reading unit 111, a convolution processing unit 112, an audio playback processing unit 113, a reverberation amount adjustment setting value reading unit 141, and a reverberation amount adjustment processing unit 142. The display control unit 103 in FIG. 9 also includes a space information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a playback attitude information acquisition unit 126, a video rendering processing unit 127, a CG space rendering/video output switching unit 131, a hand space video acquisition unit 132, and a user operation information acquisition unit 133.

Therefore, the audio control unit 102 in Fig. 9 is common to the audio control unit 102 in Fig. 7 in that it has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113. However, the audio control unit 102 in Fig. 9 differs from the audio control unit 102 in Fig. 7 in that it newly has a reverberation amount adjustment setting value reading unit 141 and a reverberation amount adjustment processing unit 142. Furthermore, the display control unit 103 in Fig. 9 is common to the display control unit 103 in Fig. 7 in that it has a spatial information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a playback attitude information acquisition unit 126, and a video drawing processing unit 127. However, the display control unit 103 in FIG. 9 differs from the display control unit 103 in FIG. 7 in that it has a CG space rendering/video output switching unit 131 instead of the CG space rendering unit 125 in FIG. 7, and in that it newly has a hand space video acquisition unit 132 and a user operation information acquisition unit 133.

In the audio control unit 102 of FIG. 9, the reverberation adjustment setting value reading unit 141 obtains playback posture information of the user U, who is the producer at the time of audio playback (current time), from the playback posture information acquisition unit 126, and reads the reverberation adjustment setting value corresponding to the playback posture of the user U. The reverberation adjustment setting value is a value that adjusts the RTF (room transfer function) or RIR (room impulse response) in binaural processing. For example, if the coefficient data acquired from the BRTF file by the coefficient reading unit 111 is coefficient data corresponding to a transfer characteristic such as HTRF that does not take RTF into account, the reverberation adjustment setting value can be coefficient data (that generates reverberation) according to the RTF that is added to the coefficient data acquired from the BRTF file. In this case, the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U. For example, a first playback mode in which the sound heard in the original sound field environment is reproduced by binaural processing, and a second playback mode in which the sound heard in the environment of the sound production work place is reproduced by binaural processing are switched depending on the playback posture of the user U. In this case, in the first playback mode, a setting value is used as the reverberation adjustment setting value such that coefficient data that generates a large reverberation is added to the coefficient data acquired from the BRTF file. In the second playback mode, a setting value is used such that coefficient data that generates a small reverberation is added. In the first playback mode and the second playback mode, a reverberation adjustment setting value according to the playback posture of the user U may be used. In the second playback mode, a mixdown process may be performed instead of binaural processing.

On the other hand, if the coefficient data acquired from the BRTF file by the coefficient reading unit 111 is coefficient data corresponding to a transfer characteristic such as BTRF that takes RTF into consideration, the reverberation adjustment setting value can be a value (gain) that adjusts the magnitude of coefficient data acquired from the BRTF file that is greatly influenced by RTF (has a large influence on reverberation). In this case, the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U. For example, as in the above, the first playback mode and the second playback mode are switched according to the playback posture of the user U. In the first playback mode, a gain that generates a large reverberation is used as the reverberation adjustment setting value for the coefficient data acquired from the BRTF file. In the second playback mode, a gain that generates a small reverberation is used. In the first playback mode and the second playback mode, a reverberation adjustment setting value according to the playback posture of the user U may be used. In addition, in the B second playback mode, a mixdown process may be performed instead of a binaural process.

The reverberation amount adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired from the BRTF file by the coefficient reading unit 111 and the reverberation adjustment setting value read by the reverberation amount adjustment setting value reading unit 141, thereby adjusting the coefficient data so that reverberation characteristics according to the reverberation adjustment setting value are added, and generates coefficients of an FIR filter that take into account the reverberation characteristics according to the reverberation adjustment setting value (playback posture). The generated coefficients are set as coefficients of the FIR filter in the convolution processing unit 112, and binaural processing is performed.

Here, the coefficient data acquired by the coefficient reading unit 111 from the BRTF file may be coefficient data associated with a measurement posture corresponding to the playback posture of the user U, as in the third embodiment, or may be coefficient data associated with fixed measurement posture information regardless of the playback posture of the user U. The reverberation adjustment setting value may be a value corresponding to the line of sight of the user U, rather than a value corresponding to the playback posture of the user U. The first playback mode may be selected when the user U faces a direction other than toward the hands (down), and the second playback mode may be selected when the user U faces toward the hands (down). For example, when the user U faces toward the hands (down), the reverberation adjustment setting value may be a value that generates almost no reverberation so that sound production work is easy, and when the user U faces up, a value that generates reverberation generated in the original sound field so that the production results can be properly confirmed. The case where the user U faces up or toward the hands is not limited to the case where the front of the user U's head faces up or toward the hands, but may be the case where the line of sight of the user U faces up or toward the hands.

9, the CG space rendering/video output switching unit 131, like the CG space rendering unit 125 of the third embodiment, sets the position and orientation of the virtual camera based on the measurement position information acquired by the measurement position information reading unit 124 and the playback orientation information acquired by the playback orientation information acquisition unit 126, renders the CG space generated by the CG model acquisition unit 122, and generates a 2D CG image. The generated CG image is supplied to the video rendering processing unit 127 (in the case of the first playback mode described above). On the other hand, in the case of the second playback mode described above, for example, when the CG space rendering/video output switching unit 131 detects that the user U has turned toward his/her hands based on the playback orientation information acquired by the playback orientation information acquisition unit 126, the CG space rendering/video output switching unit 131 switches from generating CG images to acquiring a live-action image of the user U's hand space acquired by the hand space image acquisition unit 132, and supplies it to the video rendering processing unit 127. The hand space image acquisition unit 132 can acquire, for example, an image captured by an external camera installed in VR goggles or the like as a live-action image of the hand space. The image rendering processing unit 127 generates an image signal for displaying the CG image or live-action image from the CG space rendering/image output switching unit 131 on a display device connected to the playback processing device 11, and outputs the signal to the display device.

When a console device 41 in FIG. 2 such as a mixing console is operated, or when an operation is performed on a GUI displayed on a monitor or the like, the user operation information acquisition unit 133 acquires the operation content as user operation information. When the CG space rendering/video output switching unit 131 supplies live-action video to the video rendering processing unit 127 for display on the display device, it superimposes information indicating the user's operation content (operation information) on the live-action video based on the user operation information from the user operation information acquisition unit 133. As a result, the operation information is presented on the display device superimposed on the live-action video.

10 is a diagram illustrating an example of operation information displayed superimposed on the live-action image of the console device 41 when the user U operates the console device 41. In FIG. 10, the console device 41 is a live-action image. Operation information 161 is displayed superimposed on this. The operation information 161 includes an enlarged view (CG image) of the operated part of the console device 41 and information on the numerical value (edited value) changed by the operation. By presenting such operation information 161 to the user U superimposed on the live-action image, it becomes easier to operate the console device 41, which is difficult to operate using only the live-action image. Note that even when the CG space drawing/image output switching unit 131 supplies the CG image to the image drawing processing unit 127 to display it on the display device, the operation information as shown in FIG. 10 may be superimposed on the CG image and presented to the user U. The operation information may be only the numerical value changed by the operation, and the operation information is not limited to the form shown in FIG. 10. In addition, when editing the track volume with a fader controller, editing panning with a mouse or encoder, inputting track names and values with a keyboard, and the like, the operation information may be displayed superimposed on the live-action image of the operation device. For the console 41, a CG image is used instead of a live-action image, and the operation information is displayed superimposed on the CG image of the console 41, and instead of the console 41, a flat surface, a box, a panel with unevenness, or the like that imitates the console 41 can be placed at the user U's hand. In addition, the sensation of operating the faders, knobs, etc. of the console 41 can be reproduced by haptic reproduction technology. In this case, it is possible to make it appear as if the console 41 is not actually in the user U's hand. Furthermore, by wearing VR goggles while working, the user U can enjoy the sensation of being in a studio or a workroom where sound production work is carried out, even if the sound production work place is not a studio or a workroom where sound equipment such as the actual console 41 is located.

According to the fourth embodiment of the playback mode switching of the playback processing device 11, the user U, who is a producer performing audio production at the audio production work site, can listen to the binaurally processed audio that is the same as the audio that would be heard if the audio produced by the multi-channel audio playback software 24 were played in the original sound field. In addition, by changing the playback posture of the user U, the HRTF is adjusted to take into account the reverberation characteristics according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if they similarly changed their posture. In addition, the user U is presented with automatic switching between the CG image viewed at the listening position in the original sound field and the live-action image of the space in front of him/her. Therefore, the image presented to the user U is automatically switched in conjunction with the automatic switching of the playback mode, so that the effort required for the user U to manually switch between the audio production work and the work of checking the production results can be reduced, and work efficiency is significantly improved.

<Processing Procedure of Playback Mode Switching in the Playback Processing Device 11 in the Fourth Example>
11 is a flow chart showing an example of a processing procedure of the fourth embodiment of the playback process switching of the playback processing device 11. In FIG. 11, steps S41 to S44 and steps S46 to S48 are common to steps S11 to S17 in FIG. 8 in the third embodiment, so the description will be omitted. In step S45, the CG space drawing/video output switching unit 131 judges whether the output image is a CG image or a live-action image (hand image) based on the playback posture information acquired in step S44. For example, if the posture (head front) of the user U is facing up, it is judged that the output image is a CG image, and if the posture of the user U is facing the hands, it is judged that the output image is a live-action image. In step S45, if it is judged that the output image is a live-action image, the process proceeds to step S49, and the CG space drawing/video output switching unit 131 acquires a live-action image of the hand space from the hand space image acquisition unit 132. In step S50, the CG space rendering/video output switching unit 131 acquires user operation information from the user operation information acquisition unit 133. When the CG space rendering/video output switching unit 131 detects that a user operation has been performed based on the user operation information, it superimposes operation information indicating the operation content on the live-action video acquired in step S49. In step S51, the video rendering processing unit 127 generates a video signal for displaying on the display device the live-action video acquired in step S49 or the live-action video on which the operation information has been superimposed in step S50, and outputs the video signal to the display device. The process proceeds from step S51 to step S52.

In step S52, the reverberation adjustment setting value reading unit 141 reads the reverberation adjustment setting value corresponding to the playback posture of user U based on the playback posture information acquired in step S44. In step S53, the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of user U from the BRTF file acquired in step S41 based on the playback posture information acquired in step S44. Note that the coefficient data read by the coefficient reading unit 111 may be coefficient data corresponding to a specific playback posture regardless of the playback posture of user U. In step S54, the reverberation adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired in step S53 and the reverberation adjustment setting value acquired in step S52, and generates coefficients of an FIR filter that take into account the reverberation amount according to the playback posture. In step S55, the convolution processing unit 112 sets the coefficients generated in step S54 as the coefficients of an FIR filter, and performs convolution processing (binaural processing) using the FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. In step S56, the audio playback processing unit 113 outputs the audio signal that has been binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1. When the processing of step S56 ends, the processing of this flowchart ends. The processing of this flowchart is executed repeatedly.

<Other Examples>
The switching of the playback mode in the playback processing device 11 may be performed in coordination with the operation of an arbitrary operation device connected to the playback processing device 11. In Fig. 12, a controller 171 and a console machine 172 are illustrated as examples of operation devices connected to the playback processing device 11. For example, the sensing unit 22 in Fig. 1 acquires the tilt angle of a joystick 171A of the controller 171. At this time, the trigger generation unit 23 outputs a trigger signal for switching the playback mode when a specific tilt angle of the joystick 171A is detected.

As a specific example, the tilt angle of the joystick 171A is linked to the distance from the center of the listener's head in the original sound field to the original sound source. At this time, if it is detected that the tilt angle of the joystick 171A is an angle at which the head and the original sound source overlap, or an angle at which the center of the head and the original sound source can be considered to be sufficiently close, the trigger generation unit 23 sets the playback mode to mixdown processing, or outputs a trigger signal to change the HRTF applied to binaural processing. If it is detected that the tilt angle of the joystick 171A is an angle at which the center of the head and the original sound source are considered to be apart by a predetermined distance or more, the trigger generation unit 23 sets the playback mode to binaural processing, or outputs a trigger signal to change the HRTF applied to binaural processing.

The joystick 171A can also be used to change the coordinate position of the original sound source (or sound image) in the original sound field. For example, it can be used to manipulate the distance of the original sound source relative to an origin at a predetermined position in the original sound field, or to manipulate the horizontal or elevation angle of the original sound source relative to the origin. In this case, the trigger generation unit 23 may output a trigger signal according to the distance between the original sound source and the origin. Furthermore, the position of the slider 172A of the console device 172 may be linked to the distance from the center of the listener's head in the original sound field to the original sound source, and the trigger generation unit 23 may output a trigger signal according to the position of the slider 172A.

<Example of computer configuration>
The above-mentioned series of processes can be executed by hardware or software. When the series of processes is executed by software, the program constituting the software is installed in a computer. Here, the computer includes a computer built into dedicated hardware, and a general-purpose personal computer, for example, capable of executing various functions by installing various programs.

FIG. 13 is a block diagram showing an example of the hardware configuration of a computer that executes the above-mentioned series of processes using a program.

In a computer, a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, and a RAM (Random Access Memory) 203 are interconnected by a bus 204.

Further connected to the bus 204 is an input/output interface 205. Connected to the input/output interface 205 are an input unit 206, an output unit 207, a storage unit 208, a communication unit 209, and a drive 210.

The input unit 206 includes a keyboard, mouse, microphone, etc. The output unit 207 includes a display, speaker, etc. The storage unit 208 includes a hard disk, non-volatile memory, etc. The communication unit 209 includes a network interface, etc. The drive 210 drives removable media 211 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory.

In a computer configured as described above, the CPU 201 loads a program stored in the storage unit 208, for example, into the RAM 203 via the input/output interface 205 and the bus 204, and executes the program, thereby performing the above-mentioned series of processes.

The program executed by the computer (CPU 201) can be provided by being recorded on removable media 211, such as package media, for example. The program can also be provided via a wired or wireless transmission medium, such as a local area network, the Internet, or digital satellite broadcasting.

In a computer, a program can be installed in the storage unit 208 via the input/output interface 205 by inserting the removable medium 211 into the drive 210. The program can also be received by the communication unit 209 via a wired or wireless transmission medium and installed in the storage unit 208. Alternatively, the program can be pre-installed in the ROM 202 or storage unit 208.

The program executed by the computer may be a program in which processing is performed chronologically in the order described in this specification, or a program in which processing is performed in parallel or at the required timing, such as when called.

In this specification, the processing performed by a computer according to a program does not necessarily have to be performed in chronological order according to the order described in the flowchart. In other words, the processing performed by a computer according to a program also includes processing that is executed in parallel or individually (for example, parallel processing or processing by objects).

The program may be processed by one computer (processor), or may be distributed among multiple computers. Furthermore, the program may be transferred to a remote computer for execution.

Furthermore, in this specification, a system refers to a collection of multiple components (devices, modules (parts), etc.), regardless of whether all the components are in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device in which multiple modules are housed in a single housing, are both systems.

Also, for example, the configuration described above as one device (or processing unit) may be divided and configured as multiple devices (or processing units). Conversely, the configurations described above as multiple devices (or processing units) may be combined and configured as one device (or processing unit). Of course, configurations other than those described above may also be added to the configuration of each device (or processing unit). Furthermore, as long as the configuration and operation of the system as a whole are substantially the same, part of the configuration of one device (or processing unit) may be included in the configuration of another device (or other processing unit).

Also, for example, this technology can be configured as cloud computing, in which a single function is shared and processed collaboratively by multiple devices via a network.

Furthermore, for example, the above-mentioned program can be executed in any device. In that case, it is sufficient that the device has the necessary functions (functional blocks, etc.) and is able to obtain the necessary information.

Furthermore, for example, each step described in the above flowchart can be executed by one device, or can be shared and executed by multiple devices. Furthermore, if one step includes multiple processes, the multiple processes included in that one step can be executed by one device, or can be shared and executed by multiple devices. In other words, multiple processes included in one step can be executed as multiple step processes. Conversely, processes described as multiple steps can be executed collectively as one step.

In addition, the processing of the steps that describe a program executed by a computer may be executed chronologically in the order described in this specification, or may be executed in parallel, or individually at the required timing, such as when a call is made. In other words, as long as no contradictions arise, the processing of each step may be executed in an order different from the order described above. Furthermore, the processing of the steps that describe this program may be executed in parallel with the processing of other programs, or may be executed in combination with the processing of other programs.

Note that the multiple present technologies described in this specification can be implemented independently and individually, provided no contradictions arise. Of course, any multiple present technologies can also be implemented in combination. For example, part or all of the present technologies described in any embodiment can be implemented in combination with part or all of the present technologies described in other embodiments. Also, part or all of any of the present technologies described above can be implemented in combination with other technologies not described above.

<Examples of configuration combinations>
The present technology can also be configured as follows.
(1)
An information processing device having a playback signal generation unit that performs a playback process corresponding to a user's state from among multiple types of playback processes including 3D audio playback processing and non-3D audio playback processing on an input audio signal to generate an audio signal for playback.
(2)
The information processing device according to (1), wherein the 3D sound reproduction process is a process for reflecting acoustic characteristics of a space in the input audio signal.
(3)
The information processing device according to (1) or (2), wherein the non-3D sound reproduction process is a process for changing a number of channels of the input audio signal.
(4)
The information processing device according to any one of (1) to (3), wherein the playback signal generation unit executes a playback process selected based on sensing information indicating a state of the user.
(5)
The information processing device according to any one of (1) to (4), wherein the state of the user is a state related to a head posture or a line of sight direction of the user.
(6)
The information processing device according to any one of (1) to (5), wherein the state of the user is an operation state of an operation member used for purposes other than switching the playback process executed by the playback signal generating section.
(7)
The information processing device according to (2), wherein the 3D sound reproduction process is a process of convolving a transfer function corresponding to the acoustic characteristics with the input audio signal.
(8)
The information processing device according to (7), wherein the 3D sound reproduction process is a process using an FIR filter.
(9)
The information processing device according to (7) or (8), wherein the transfer function is a head-related transfer function, a binaural transfer function, or a combination of a head-related transfer function and a room transfer function.
(10)
The information processing device according to any one of (2) and (7) to (9), wherein the reproduction signal generation unit uses the acoustic characteristics actually measured in the space.
(11)
The information processing device according to (10), wherein the reproduction signal generation unit uses the acoustic characteristic corresponding to the current posture of the user among the acoustic characteristics actually measured in a plurality of postures.
(12)
The information processing device according to any one of (7) to (11), wherein the reproduction signal generation unit switches the reproduction process to be executed by changing or adjusting the transfer function.
(13)
The information processing device according to any one of (2) and (7) to (12), further comprising a display control unit that outputs a CG image of the space having the acoustic characteristics reflected by the 3D audio reproduction process.
(14)
The information processing device according to (13), wherein the display control unit outputs the CG image in conjunction with execution of the 3D sound reproduction process by the reproduction signal generation unit.
(15)
The information processing device described in (14), wherein the CG image is an image captured by a virtual camera of a CG space that reproduces the space having the acoustic characteristics that the 3D audio playback processing reflects on the input audio signal.
(16)
The information processing device according to (15), wherein the CG image is an image captured by changing the posture of the virtual camera in accordance with the posture of the user.
(17)
The information processing device according to any one of (14) to (16), wherein the display control unit outputs a live-action video in conjunction with execution of the non-3D sound reproduction process by the reproduction signal generation unit.
(18)
The information processing device according to (17), wherein the live-action image is an image captured by a camera around the user.
(19)
An information processing method of an information processing device having a playback signal generation unit, the playback signal generation unit executing a playback process corresponding to a user's state from among multiple types of playback processes including 3D sound playback processing and non-3D sound playback processing on an input audio signal, to generate an audio signal for playback.
(20)
A program for causing a computer to function as a playback signal generating unit that generates an audio signal for playback by executing a playback process corresponding to the user's state from among multiple types of playback processes, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal.

Note that this embodiment is not limited to the above-described embodiment, and various modifications are possible without departing from the gist of this disclosure. Furthermore, the effects described in this specification are merely examples and are not limiting, and other effects may also be present.

1 Posture, 11 Playback processing device, 12 Audio playback device, 21 Playback signal generation unit, 22 Sensing unit, 23 Trigger generation unit, 24 Multi-channel audio playback software, 31 Trigger reception unit, 32 Switching processing unit, 33 Binaural processing unit, 34 2ch mixdown/stereo playback processing unit, 35 Pass-through processing unit, 41 Console device, 42A, 42B Monitor, 43 Camera, 44 Headphones, 45 Speaker, 51 VR goggles, 81 Measurement processing unit, 91 Speaker, 101 BR TF file acquisition unit, 102, audio control unit, 103, display control unit, 111, coefficient reading unit, 112, convolution processing unit, 113, audio playback processing unit, 121, spatial information reading unit, 122, CG model acquisition unit, 123, CG data storage unit, 124, measurement position information reading unit, 125, CG space drawing unit, 126, playback posture information acquisition unit, 127, video drawing processing unit, 131, video output switching unit, 132, handheld space video acquisition unit, 133, user operation information acquisition unit, 141, reverberation amount adjustment setting value reading unit, 142, reverberation amount adjustment processing unit

Claims

An information processing device having a playback signal generation unit that performs a playback process corresponding to a user's state from among multiple types of playback processes including 3D audio playback processing and non-3D audio playback processing on an input audio signal to generate an audio signal for playback.
The information processing device according to claim 1 , wherein the 3D sound reproduction process is a process for reflecting acoustic characteristics of a space in the input audio signal.
The information processing device according to claim 1 , wherein the non-3D sound reproduction process is a process for changing a number of channels of the input audio signal.
The information processing device according to claim 1 , wherein the playback signal generating section executes a playback process selected based on sensing information indicating a state of the user.
The information processing device according to claim 1 , wherein the state of the user is a state related to a head posture or a line of sight direction of the user.
The information processing apparatus according to claim 1 , wherein the user's state is an operation state of an operation member used for purposes other than switching the playback process executed by the playback signal generating section.
The information processing device according to claim 2 , wherein the 3D sound reproduction process is a process of convolving a transfer function corresponding to the acoustic characteristics with the input audio signal.
The information processing device according to claim 7 , wherein the 3D sound reproduction process is a process using an FIR filter.
The information processing device according to claim 7 , wherein the transfer function is a head-related transfer function, a binaural transfer function, or a combination of a head-related transfer function and a room transfer function.
The information processing device according to claim 2 , wherein the reproduction signal generating section uses the acoustic characteristics actually measured in the space.
The information processing device according to claim 10 , wherein the reproduction signal generating section uses the acoustic characteristic corresponding to the current posture of the user from among the acoustic characteristics actually measured in a plurality of postures.
The information processing device according to claim 7 , wherein the reproduction signal generating section switches the reproduction process to be executed by changing or adjusting the transfer function.
The information processing device according to claim 2 , further comprising a display control unit that outputs a CG image of the space having the acoustic characteristics reflected by the 3D audio reproduction process.
The information processing device according to claim 13 , wherein the display control unit outputs the CG image in conjunction with execution of the 3D sound reproduction process by the reproduction signal generation unit.
The information processing device according to claim 14 , wherein the CG image is an image captured by a virtual camera of a CG space that reproduces the space having the acoustic characteristics that the 3D audio reproduction process reflects on the input audio signal.
The information processing device according to claim 15 , wherein the CG image is an image captured by changing the posture of the virtual camera in accordance with the posture of the user.
The information processing device according to claim 14 , wherein the display control unit outputs an actual video image in conjunction with execution of the non-3D sound reproduction process by the reproduction signal generation unit.
The information processing device according to claim 17 , wherein the actual image is an image of the surroundings of the user captured by a camera.
An information processing method of an information processing device having a playback signal generation unit, the playback signal generation unit executing a playback process corresponding to a user's state from among multiple types of playback processes including 3D sound playback processing and non-3D sound playback processing on an input audio signal, to generate an audio signal for playback.
A program for causing a computer to function as a playback signal generating unit that generates an audio signal for playback by executing a playback process corresponding to the user's state from among multiple types of playback processes, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal.