[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024161992A1 - Information processing device, information processing method, and program - Google Patents

Information processing device, information processing method, and program Download PDF

Info

Publication number
WO2024161992A1
WO2024161992A1 PCT/JP2024/001050 JP2024001050W WO2024161992A1 WO 2024161992 A1 WO2024161992 A1 WO 2024161992A1 JP 2024001050 W JP2024001050 W JP 2024001050W WO 2024161992 A1 WO2024161992 A1 WO 2024161992A1
Authority
WO
WIPO (PCT)
Prior art keywords
playback
user
processing device
processing
audio
Prior art date
Application number
PCT/JP2024/001050
Other languages
French (fr)
Japanese (ja)
Inventor
彬人 中井
亨 中川
越 沖本
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2024161992A1 publication Critical patent/WO2024161992A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • This technology relates to an information processing device, an information processing method, and a program, and in particular to an information processing device, an information processing method, and a program that enable automatic switching between multiple types of playback processes for audio playback, including 3D audio playback.
  • Patent Document 1 discloses a 3D sound reproduction technology that applies a head-related transfer function (HRTF) to an audio signal to reproduce the original sound field in a reproduction environment such as a concert hall or movie theater in a reproduction sound field that differs in time or space from the original sound field.
  • HRTF head-related transfer function
  • This technology was developed in light of these circumstances, and makes it possible to automatically switch between multiple types of playback processing for audio playback, including 3D audio playback.
  • the information processing device or program of this technology is an information processing device having a playback signal generation unit that performs playback processing corresponding to the user's state from among multiple types of playback processing, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal to generate an audio signal for playback, or a program for causing a computer to function as such an information processing device.
  • the information processing method of the present technology is an information processing method in which an information processing device having a playback signal generation unit performs a playback process corresponding to the user's state from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, on an input audio signal to generate an audio signal for playback.
  • a playback process that corresponds to the user's state is executed from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, for the input audio signal, and an audio signal for playback is generated.
  • FIG. 1 is a block diagram showing an example configuration of a playback processing device according to an embodiment to which the present technology is applied.
  • FIG. 2 is a diagram for explaining a first embodiment of switching of playback modes of a playback processing device.
  • FIG. 11 is a diagram for explaining a second embodiment of playback mode switching of the playback processing device.
  • 13 is a flowchart showing an example of a processing procedure of a second embodiment of playback process switching of a playback processing device;
  • FIG. 13 is a diagram showing an example of the configuration of an audio production system to which third and fourth embodiments of the playback mode switching are applied.
  • FIG. 1 is a diagram showing a measurement flow in a measurement environment.
  • FIG. 1 is a diagram showing a measurement flow in a measurement environment.
  • FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a third embodiment of playback mode switching is applied.
  • 13 is a flowchart showing an example of a processing procedure of a third embodiment of playback process switching of a playback processing device;
  • FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a fourth embodiment of playback mode switching is applied.
  • 11 is a diagram illustrating an example of operation information that is displayed superimposed on an actual image of the console device when a user U operates the console device.
  • FIG. 13 is a flowchart showing an example of a processing procedure of a fourth embodiment of playback process switching of a playback processing device; 11A and 11B are diagrams illustrating an example of switching of a playback mode on an operation device connected to a playback processing device.
  • 1 is a block diagram showing an example of the configuration of an embodiment of a computer to which the present technology is applied.
  • FIG. 1 is a block diagram showing an example of the configuration of a playback processing device according to an embodiment to which the present technology is applied.
  • a playback processing device 11 is used for producing or editing the sound of content such as movies (hereinafter, the term "sound production” includes editing).
  • An audio playback device 12 such as headphones or speakers is connected to the playback processing device 11.
  • the playback processing device 11 has a playback signal generation unit 21, a sensing unit 22, a trigger generation unit 23, and multi-channel audio playback software 24.
  • the playback signal generating unit 21 may also have a function of generating multimedia information such as video and GUI related to the audio and supplying it to a video output device connected to the audio playback device 12.
  • video output devices include display devices such as monitors, VR goggles (HMD: Head Mounted Display), and AR goggles.
  • the sensing unit 22 detects (acquires) information indicating the state of the user for determining whether to switch the playback process in the playback signal generating unit 21.
  • Examples of information detected by the sensing unit 22 include an image of the user captured by a camera, the user's head posture detected by a head tracker, the user's line of sight detected by an eye tracker, user operations on a GUI (Graphical User Interface), and user operations on switches, buttons, etc.
  • the information (sensing information) acquired by the sensing unit 22 is supplied to the trigger generating unit 23.
  • the sensing unit 22 detects (acquires) any information necessary for determining whether to switch the playback process and indicating the state of an object related to the user as sensing information.
  • Sensing information that can be used appropriately includes biometric sensing information (head tracking, gaze direction, focal position, posture and position tracking), appearance information (person recognition, face recognition, headphone recognition using images obtained from a camera), positioning information using GPS or ultrasound, device input information (GUI information, equipment such as keyboards and controllers, headphone type information), etc.
  • biometric sensing information head tracking, gaze direction, focal position, posture and position tracking
  • appearance information person recognition, face recognition, headphone recognition using images obtained from a camera
  • positioning information using GPS or ultrasound positioning information using GPS or ultrasound
  • device input information GUI information, equipment such as keyboards and controllers, headphone type information
  • the trigger generation unit 23 generates a trigger signal that instructs switching of the playback process based on the sensing information from the sensing unit 22, and supplies it to the playback signal generation unit 21. For example, when the sensing information matches a predetermined condition, the trigger generation unit 23 supplies the trigger signal to the playback signal generation unit 21.
  • the trigger generation unit 23 may be incorporated in the same hardware as the sensor that detects the sensing information, or may be incorporated in separate hardware or software that receives the sensing information, or may be included in the trigger receiving unit 31 of the playback signal generation unit 21.
  • the trigger signal corresponds to a signal that specifies the playback process to be executed by the playback signal generation unit 21 from among multiple types of playback processes that can be executed by the playback signal generation unit 21.
  • the multi-channel audio playback software 24 represents a processing unit that executes multi-channel audio playback software such as a DAW (Digital Audio Workstation), and generates (or edits) multi-channel (including 1-channel) audio signals.
  • the generated multi-channel (multi-channel) audio signals are supplied to the playback signal generation unit 21.
  • the multi-channel audio playback software may be plug-in software that runs on a DAW, or may be separated from a DAW and run as a standalone application. In this case, the multi-channel audio signals are output from the DAW and input to the standalone application.
  • software routines other than DAWs may be used as the multi-channel audio playback software as long as they output multi-channel audio signals (such as data after rendering of object audio).
  • the playback signal generation unit 21 has a trigger receiving unit 31, a switching processing unit 32, a binaural processing unit 33, a 2ch mixdown/stereo playback processing unit 34, and a pass-through processing unit 35.
  • the trigger receiving unit 31 receives a trigger signal from the trigger generation unit 23 and determines whether or not a trigger signal has been supplied from the trigger generation unit 23.
  • the switching processing unit 32 switches between the binaural processing unit 33, the 2ch mixdown/stereo playback processing unit 34, and the pass-through processing unit 35, whichever processing unit performs playback processing on the multi-channel audio signals from the multi-channel audio playback software 24.
  • the binaural processing unit 33 performs binaural playback processing (binaural processing), which is one of the 3D sound playback methods.
  • the 3D sound playback method is an audio playback technology that reproduces input signals to both ears of a listener in an original sound field (including a virtual original sound field) such as a concert hall or movie theater, using headphones or speakers at the entrance of the listener's ear canal in a playback sound field that is different in time or space from the original sound field.
  • the binaural processing unit 33 performs filtering processing to convolve a head-related transfer function (HRTF) in order to reflect the transfer characteristics of a specific space (original sound field) in the audio signal from the multi-channel audio playback software 24 as binaural processing.
  • HRTF head-related transfer function
  • binaural playback is premised on playback using headphones, but the binaural processing unit 33 is not limited to binaural playback and also includes cases where any playback processing (3D audio playback processing) classified as a 3D sound playback method is performed.
  • Examples of playback processing classified as 3D sound playback methods include, in addition to binaural playback, transaural playback processing (transaural processing) that assumes playback using two speakers.
  • transaural processing includes processing to remove crosstalk, but the binaural processing unit 33 may also perform transaural processing.
  • the binaural processing unit 33 may perform processing of an appropriate 3D sound playback method depending on the type of audio playback device 12 connected to the playback processing device 11.
  • the transfer characteristics of sound that can be applied to the binaural processing (3D sound reproduction processing) of the binaural processing unit 33, or to 3D sound reproduction processing instead of binaural processing, include HRTF (Head-Related Transfer Function), HRIR (Head-Related Impulse Response), a combination of HRTF and RTF (Room transfer function), a combination of HRIR and RIR (Room Impulse Response), BRTF (Binaural Room Transfer Function), BRIR (Binaural Room Impulse Response), and any of the transfer functions or their impulse responses from the headphones to the eardrum (entrance of the ear canal), or a combination of these.
  • HRTF Head-Related Transfer Function
  • HRIR Head-Related Impulse Response
  • the 2ch mixdown/stereo playback processor 34 performs mixdown playback processing (mixdown processing) on the multi-channel audio signal from the multi-channel audio playback software 24 to generate an audio signal for 2ch stereo playback. Furthermore, if the audio signal from the multi-channel audio playback software 24 is a 1ch (monaural) audio signal, the 2ch mixdown/stereo playback processor 34 generates an audio signal for 2ch stereo playback. Note that in the following, the 2ch mixdown/stereo playback processor 34 will only perform mixdown processing, without considering the case where a mono audio signal is supplied from the multi-channel audio playback software 24.
  • the pass-through processing unit 35 supplies the multi-channel audio signals from the multi-channel audio playback software 24 directly to the corresponding channels of the audio playback device 12 without performing mix-down processing or the like on the multi-channel audio signals from the multi-channel audio playback software 24. Audio signals from the multi-channel audio playback software 24 corresponding to channels that do not exist in the audio playback device 12 are not supplied from the pass-through processing unit 35 to the audio playback device 12. However, the pass-through processing unit 35 can also perform signal routing processing. In this case, the multi-channel audio signals from the multi-channel audio playback software 24 are each supplied to the specified channels of the audio playback device 12.
  • the playback signal generating unit 21 switches the playback process (acoustic process) of the multi-channel audio signal from the multi-channel audio playback software 24 between binaural processing in the binaural processing unit 33 and mixdown processing in the 2ch mixdown/stereo playback processing unit 34 based on a trigger signal from the trigger generating unit 23.
  • the playback process method executed by the playback signal generating unit 21 is referred to as the playback mode (or audio playback mode) of the playback processing device 11 or the playback signal generating unit 21, and the playback mode is switched between binaural processing and mixdown processing.
  • Switching the HRTF applied to the binaural processing in the binaural processing unit 33 or adjusting the applied HRTF (parameter) also corresponds to switching the playback mode.
  • the playback mode of the playback processing device 11 or the playback signal generating unit 21 is also simply referred to as the playback mode.
  • This technology can be applied to cases where multiple types of playback processes, including 3D sound playback processes such as binaural processing and non-3D sound playback processes such as 2ch mixdown processing and pass-through processing, are switched and executed according to the user's state, and there is no particular limit to the types and number of playback processes that can be switched and executed.
  • Fig. 2 is a diagram for explaining a first embodiment of switching of the playback mode of the playback processing device 11.
  • Fig. 2 illustrates various peripheral devices of an audio production system including the playback processing device 11, and a user U.
  • the user U is a producer who uses the playback processing device 11 to produce the audio of content such as a movie.
  • the console machine 41 is a device connected to the playback processing device 11 and inputs the user U's operations on the multi-channel audio playback software 24.
  • the console machine 41 may be, for example, an operating device such as a mixing console, a keyboard, or a mouse.
  • Monitors 42A and 42B are connected to the playback processing device 11, and work in conjunction with the multi-channel audio playback software 24 to display images such as GUI information and production screens to the user U.
  • the number of monitors 42A and 42B is not limited to two, and may be one or three or more.
  • the camera 43 is connected to the playback processing device 11, and mainly supplies images of the user U to the playback processing device 11 as images detected by the sensing unit 22.
  • the speaker 45 is connected to the playback processing device 11, and outputs the audio signal supplied from the playback processing device 11 as sound waves.
  • the headphones 44 are connected to the playback processing device 11 and are worn on the head of the user U.
  • the headphones 44 output the 2ch audio signals supplied from the playback processing device 11 as sound waves near the left and right ears (external ear inlets).
  • the speaker 45 outputs the audio signals supplied from the playback processing device 11 instead of the headphones 44 or in addition to the headphones 44.
  • a user U registers the following setting information a through d in advance regarding switching of playback modes (playback processes).
  • the setting information a sets whether or not automatic switching of the playback mode is enabled according to the setting information b to d.
  • the playback mode is switched according to the setting information b to d only when automatic switching of the playback mode is set to ON.
  • the setting information b is the playback mode (type of playback processing) that is initially set, and the playback mode (type of playback processing) that is set when the playback mode is not set by c or d.
  • the setting information c is the condition (condition c) for setting the playback mode to mixdown processing (playback processing of c)
  • the setting information d is the condition (condition d) for setting the playback mode to binaural processing (playback processing of d).
  • a specific state is set that is a state (including actions, operations, etc.) of the user U that can be specified from the sensing information acquired by the sensing unit 22 of the playback processing device 11, and is other than an operation, etc., that the user intends only to switch the playback mode.
  • the specific position is set as the position of the gaze point that determines the condition c or d.
  • the position where the user U is gazing at may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like.
  • the specific direction is set as the facial direction that determines the condition c or d.
  • the facial direction of the user U may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like. For example, when two monitors 42A and 42B are used as shown in FIG. 2, the position of the gaze point or the facial direction of the user U on one monitor (e.g., monitor 42A) may be registered as the setting information of c in which the playback mode is set to mixdown processing.
  • the position of the gaze point or the facial direction of the user U on the other monitor may be registered as the setting information of d in which the playback mode is set to binaural processing.
  • the user U selects a specific window from one or more windows displayed on the screen of the monitor 42A or 42B, it is determined that the condition c or d is satisfied, and the specific window is set as the type of selected window that determines the condition c or d.
  • the type of selected window selected by the user U can be identified from GUI information acquired by the sensing unit 22, information on user operation, and the like.
  • the direction of the user U's line of sight may be a specific direction (between multiple displays, outside the display screen, a specific direction in the room) that is registered as the setting condition of c or d, or the state of the headphones attached or detached (worn or removed) for the user U may be registered as the setting condition of c or d.
  • the setting states b to d may be preset, rather than being set by the user.
  • the following modes can be adopted for switching the playback mode, for example.
  • the playback process of c is set as the playback mode when the condition of c is satisfied
  • the playback process of d is set as the playback mode when the condition of d is satisfied.
  • the playback mode is set to the default playback mode set in b.
  • the playback process of c is set as the playback mode until the condition of d is satisfied
  • the playback process of d is set as the playback mode until the condition of c is satisfied.
  • a third mode only one of the setting information of c and d is set.
  • the setting information of c is set.
  • the playback process of c is set as the playback mode
  • the playback process of d is set as the playback mode.
  • the user since multiple types of playback processes related to audio playback, including 3D audio playback, are automatically switched based on preset conditions, the user does not need to manually switch between playback processes. For example, when it comes to audio production work, it is easier for the producer to work while listening to audio that has not been subjected to 3D audio playback processing. When checking the created audio, it is easier to make an appropriate judgment on the quality of the audio by listening to audio that has been subjected to 3D audio playback processing, which reproduces the audio heard in the actual playback environment (original sound field). The producer often repeats the audio production work and the confirmation work many times, and it is troublesome for the producer to manually switch the playback process (playback mode) each time, which is inefficient.
  • the playback mode is automatically switched according to preset conditions. Therefore, if the state of the producer when performing the audio production work is, for example, a state in which the producer is facing down (at his/her hands) (a state in which the producer is gazing at the production (editing) equipment at hand), the playback mode is set to the mixdown process under the condition that the producer is in that state. For example, if the state of the producer when checking the production result is a state in which the producer is looking up (information indicating that the producer is not gazing at the production (editing) equipment at hand), the playback mode is set to binaural processing under the condition of this state.
  • the playback mode may be switched depending on other states such as the user's line of sight and the position of the gaze point, rather than the direction of the user U (direction of the head in front).
  • the orientations and positions that are the boundaries of each range are determined in advance.
  • the orientation and position of the determination target are compared with the orientations and positions that are the boundaries of those ranges to determine which range of orientations and positions the determination target falls into.
  • FIG. 3 is a diagram for explaining a second embodiment of the switching of the playback mode of the playback processing device 11.
  • the switching of the image presented to the user is performed in conjunction with the switching of the audio playback mode.
  • the user U is a producer at a place where the work of sound production is performed (production work place) and a listener who listens to the sound of the playback sound field.
  • the user U wears VR goggles (HMD) 51 with headphones.
  • the VR goggles 51 are connected to the playback processing device 11, and the playback processing device 11 supplies the VR goggles 51 with a 2ch audio signal to be played by the headphones of the VR goggles 51 and a video signal to be displayed on the VR goggles 51.
  • the video signal supplied from the playback processing device 11 to the VR goggles 51 is switched between a video signal of a CG (Computer Graphics) image generated by the playback processing device 11 as shown in A of FIG. 3 and a video signal of a real-life image captured by a camera (VR outside world camera) as shown in B of FIG. 3.
  • CG Computer Graphics
  • a in FIG. 3 is a CG image of a virtual space (CG space) that reproduces the space of the original sound field of the sound produced using the multi-channel sound reproduction software 24 with CG, for example, a CG image that reproduces a movie theater screen from the viewpoint of a listener in a specified seat.
  • a in FIG. 3 may not be a CG image, but may be a live-action image obtained by photographing the space of the original sound field.
  • B in FIG. 3 is a live-action image captured by the VR external camera of the VR goggles 51 in the front direction (or line of sight) of the head of the user U who produces sound with the playback processing device 11.
  • the live-action image shows, for example, the production equipment of the sound production system (peripheral equipment connected to the playback processing device 11, etc.) that is placed in the production work area, such as the console machine 41 and monitors 42A and 42B shown in FIG. 2.
  • B in FIG. 3 may not be a live-action image of the production work area, but may be a CG image that imitates the production work area.
  • the production work location simulated as a CG image is not limited to an actual production work location, but may be a virtual production work location, and when a virtual production work location is used, the production equipment (monitors, input devices, etc.) used in the audio production system may also be virtual equipment (equipment that does not actually exist).
  • the images displayed on these VR goggles 51 are automatically switched in conjunction with the switching of the audio playback mode.
  • the playback processing device 11 is switched to different playback modes when the user U looks up and when the user U looks toward his/her hands (down). Specifically, when the user U looks up, the playback mode is set to binaural processing, and when the user U looks toward his/her hands, the playback mode is set to mixdown processing. Note that when the user U looks up or toward his/her hands, this refers to when the user U's face (front of the head) or line of sight (point of gaze) looks up or toward his/her hands (down), and it is determined whether the user U looks up or toward his/her hands based on the sensing information acquired by the sensing unit 22.
  • the image displayed on the VR goggles 51 is switched between images A and B in Fig. 3.
  • a CG image like that of Fig. 3A which reproduces the original sound field space using CG
  • a live-action image like that of Fig. 3B which is a photograph of the user U's hands (surroundings) at the production work location, is displayed.
  • multiple types of playback processes related to audio playback are automatically switched based on preset conditions, eliminating the need for the user to manually switch between playback processes.
  • the user U looks down to perform audio production work, etc., where peripheral devices are present, the user U can view the live-action video displayed on the VR goggles 51, and can easily operate the peripheral devices through the live-action video.
  • the user U can listen to the mixdown-processed audio that has not been subjected to 3D audio playback processing through headphones, and can perform audio production work while listening to audio suitable for audio production.
  • the user U when the user U looks up to check the production results, etc., the user U can view the CG video of the original sound field space displayed on the VR goggles 51, and can visually recognize the space of the original sound field.
  • the user U can listen to the binaurally processed audio, which is a 3D audio playback process, through headphones.
  • the user U can check the created audio by the audio heard when it is played back in the original sound field environment. Therefore, user U can properly judge the quality of the created audio while listening to the binaurally processed audio when the created audio is played in the original sound field environment and visually grasping the space of the original sound field through the CG image.
  • the CG image of the space seen by the listener in the movie theater is displayed as the CG image of A in FIG. 3, which is displayed on the VR goggles 51 and presented to user U. This allows user U to visually grasp the spatial state of the original sound field, such as the listening position of the listener in the movie theater and the arrangement of the screen and speakers relative to the listening position.
  • User U can then listen to the sound output from the movie theater speakers and heard by the listener as the sound of the playback sound field that has been binaurally processed. Thus, user U can confirm whether the sound produced using the multi-channel sound playback software 24 is appropriate when listened to at the listening position in the movie theater recognized by the CG image. If the confirmed audio is not appropriate, the user U repeats the audio production (editing) work using the multi-channel audio playback software 24 until it is appropriate.
  • the audio and video viewed by the user U are automatically switched between mixdown processed audio and video (video from the production work site) suitable for the audio production work, and binaural processed audio and video (video from the movie theater) suitable for confirming the production results, thus dramatically improving work efficiency.
  • the video presented to user U can be automatically switched in conjunction with the automatic switching of the playback mode in the playback processing device 11, which reduces the effort required for user U to manually switch between audio production work and checking the production results, significantly improving work efficiency.
  • step S1 the sensing unit 22 acquires sensing information indicating the state of the user.
  • step S2 when the trigger generation unit 23 detects that a condition for switching from one of the playback processing of the binaural processing and the mixdown processing to the other playback processing is satisfied based on the sensing information acquired in step S1 and the setting information determined in advance, the trigger generation unit 23 supplies a trigger signal indicating that to the playback signal generation unit 21.
  • the trigger reception unit 31 of the playback signal generation unit 21 When the trigger reception unit 31 of the playback signal generation unit 21 receives the trigger signal, the trigger reception unit 31 supplies a switching to the playback processing indicated by the trigger signal to the switching processing unit 32.
  • the switching processing unit 32 determines the playback mode to be set based on the information from the trigger reception unit 31 and the current playback mode (playback processing).
  • step S3 the switching processing unit 32 judges whether the playback mode to be set is binaural processing or not. If the answer is yes in step S3, the process proceeds to step S4, and if the answer is no, the process proceeds to step S7.
  • step S4 the switching processing unit 32 enables binaural processing in the binaural processing unit 33.
  • the binaural processing unit 33 performs binaural processing on the multi-channel audio signal supplied from the multi-channel audio playback software 24.
  • step S5 the binaural processing unit 33 generates an audio signal (2-channel audio signal) for playback in the audio playback device 12 based on the audio signal after binaural processing, and outputs it to the audio playback device 12.
  • step S6 the playback signal generation unit 21 generates a CG image of the space of the original sound source, and outputs the video signal of the generated CG image to a display device such as VR goggles that the user views.
  • step S6 the process of this flowchart ends. Note that the process of this flowchart is executed repeatedly.
  • step S7 if the judgment in step S3 is negative, the switching processing unit 32 enables the mixdown processing in the 2ch mixdown/stereo playback processing unit 34.
  • the 2ch mixdown/stereo playback processing unit 34 performs mixdown processing on the multi-ch audio signal supplied from the multi-ch audio playback software 24.
  • step S8 the 2ch mixdown/stereo playback processing unit 34 outputs the 2ch audio signal after the mixdown processing to the audio playback device 12 as an audio signal for playback in the audio playback device 12.
  • step S9 the playback signal generating unit 21 acquires live-action video of the production work location (hands-on space) from the sensing unit 22 (camera) and outputs the video signal of the live-action video to a display device such as VR goggles that the user views.
  • a display device such as VR goggles that the user views.
  • the playback mode switching of the playback processing device 11 not only is the playback mode switched between binaural processing and mixdown processing based on sensing information indicating the state of the user, but also the HRTF applied to the binaural processing is switched or adjusted.
  • the playback mode it is not necessarily required that the playback mode be switched to the mixdown processing. Therefore, in the explanation of the third and fourth embodiments, it is assumed that the playback processing device 11 only switches or adjusts the HRTF applied to the binaural processing as the playback mode switching.
  • the processing in the first or second embodiment may be combined to switch the playback mode to the mixdown processing.
  • Fig. 5 is a diagram showing an example of the configuration of an audio production system to which the third and fourth embodiments of playback mode switching are applied.
  • the measurement environment represents the measurement environment when the HRTF applied to the binaural processing of the playback processing device 11 is actually measured in advance.
  • the transfer characteristic of audio such as BRTF can be applied to the binaural processing instead of the HRTF, and in that case, the transfer characteristic applied to the binaural processing instead of the HRTF may be measured in the measurement environment.
  • the measurement environment is exemplified by a movie theater as an original sound field.
  • a movie theater as an original sound field is also called a dubbing stage, and is the space of the original sound field that is reproduced as the sound of the playback sound field in sound production.
  • the playback environment represents the playback environment in which the sound of the original sound field is reproduced as the sound of the playback sound field in the sound production location used in sound production.
  • the sound production location is a location different from the original sound field, such as a studio or the producer's home, but it may be the same location as the original sound field.
  • the measurement processing device 81 shown in the measurement environment acquires an HRTF corresponding to the acoustic characteristics of the original sound field, such as a movie theater, and generates a BRTF file (described later).
  • the measurement processing device 81 also acquires condition information indicating the conditions when the HRTF was measured, and stores the condition information in the BRTF file together with the HRTF.
  • the playback processing device 11 in the playback environment corresponds to the playback processing device 11 in FIG. 1, and the headphones 44 are one form of the audio playback device 12 in FIG. 1 connected to the playback processing device 11.
  • the headphones 44 may be headphones attached to the VR goggles 51 in FIG. 3, or may be other audio playback devices.
  • the playback processing device 11 acquires the BRTF file generated by the measurement processing device 81, and sets parameters to be used in binaural processing based on the data in the BRTF file.
  • the playback processing device 11 may be able to acquire the BRTF file via a network such as the Internet, or may be able to acquire the BRTF file by using a recording medium such as a flash memory.
  • Figure 6 shows the measurement flow in the measurement environment.
  • HRTF measurements are taken with the subject sitting in a designated seat in the cinema, with a microphone attached to their ear.
  • playback sound is output from the cinema's speaker 91, and the HRTF from the speaker 91 to the ear (e.g., ear canal position, eardrum position) is measured.
  • the HRTF from the speaker 91 to the ear e.g., ear canal position, eardrum position
  • spatial shape data indicating the shape of the movie theater is acquired as condition information.
  • the width, height, and depth of the theater are recorded as spatial shape data as the smallest elements that indicate the shape of the theater.
  • information indicating more detailed shapes such as vertex information or point clouds, may also be recorded as spatial shape data.
  • position information of speaker 91 which is the measurement sound source (original sound source) used in measuring HRTF, is acquired as condition information. For example, coordinates indicating the position of speaker 91 in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as position information of speaker 91.
  • measurement position information indicating the subject's position (measurement position) when the HRTF is measured and measurement posture information indicating the posture (measurement posture) are acquired as condition information.
  • measurement position information indicating the subject's position in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as measurement position information.
  • the Euler angles of the subject's head are recorded as measurement posture information.
  • the measurement processing device 81 stores the HRTF and condition information measured in the above manner in a BRTF file.
  • the BRTF file stores group data consisting of the same type of data for each combination of positions A to C and postures 1 to 3, for example.
  • the group data for each combination includes spatial shape data, position information of the measurement sound source (original sound source), measurement position information, measurement posture information, transfer characteristic data from the headphones 44 to the ears, and HRTF measurement data measured with the subject sitting in a seat at each position in each measurement posture.
  • the spatial shape data, position information of the measurement sound source, and transfer characteristic data from the headphones 44 to the ears are common regardless of the combination of positions A to C and postures 1 to 3, so they may be stored in the BRTF file as data outside the group data.
  • ⁇ Third embodiment of switching of playback mode of playback processing device 11 In a third embodiment of the playback mode switching of the playback processing device 11, the HRTF applied to the binaural processing is switched and the audio playback mode is switched based on sensing information indicating the state of the user U.
  • the images presented to the user U through the VR goggles 51 are only CG images. Note that live-action images (such as images of peripheral devices at the production work site) may be displayed depending on the state of the user U.
  • FIG. 7 is a block diagram showing an example configuration of a playback processing device 11 to which a third embodiment of playback mode switching is applied.
  • FIG. 7 shows blocks that are not shown in the block diagram of the playback processing device 11 in FIG. 1, but some of the blocks in FIG. 7 are subdivisions of the blocks shown in FIG. 1, and some of the blocks shown in FIG. 1 are omitted in FIG. 7.
  • the playback processing device 11 has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103. Note that the audio control unit 102 and the display control unit 103 are, with some exceptions, included in the playback signal generation unit 21 in FIG. 1.
  • the BRTF file acquisition unit 101 acquires a BRTF file generated by the measurement processing device 81 of FIG. 4.
  • the acquired BRTF file is preferably a file that stores measurement data measured using a producer (user U) who produces audio using the playback processing device 11 as the subject, but is not limited to this.
  • the BRTF file includes coefficient data, spatial information, and measurement posture information.
  • the coefficient data corresponds to the HRTF measurement data.
  • Binaural processing can be performed by convolution processing using an FIR (Finite Impulse Response) filter. At that time, the coefficient of the FIR filter is set based on the characteristics of the HRTF to be applied to the binaural processing.
  • FIR Finite Impulse Response
  • the coefficient data of the BRTF file represents the HRTF measurement data as coefficient data of the FIR filter, and the process of calculating the coefficient of the FIR filter from the HRTF measurement data may be performed by the audio control unit 102 or the like after reading the HRTF measurement data from the BRTF file.
  • the spatial information includes spatial shape data, position information of the measurement sound source (original sound source), and measurement position information.
  • the coefficient data, spatial information, and measurement posture information each include a plurality of data measured at a plurality of measurement positions (positions A to C in FIG. 6) and a plurality of measurement postures (postures 1 to 3 in FIG. 6) that are linked (associated) with the measurement positions and measurement postures.
  • the reproduction processing device 11 acquires FIR filter coefficient data from the BRTF file as measurement data of HRTF that specifies the content of binaural processing, but the measurement data acquired to specify the content of binaural processing does not have to be FIR filter coefficient data. Since the content of binaural processing can be specified by acoustic characteristics (transfer characteristics) such as HRTF in the original sound field, the reproduction processing device 11 may acquire information on the acoustic characteristics in the original sound field.
  • the reproduction processing device 11 may theoretically calculate information on the acoustic characteristics in the original sound field based on the spatial shape, the position of the original sound source, the measurement position (listening position), etc., instead of acquiring information on the acoustic characteristics in the original sound field obtained by actual measurement from the BRTF file.
  • the audio control unit 102 includes the binaural processing unit 33 of FIG. 1.
  • the audio control unit 102 has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113.
  • the coefficient reading unit 111 obtains information (playback posture information) on the posture (playback posture) of the user U, who is the producer, at the time of audio playback (current time) from the playback posture information acquisition unit 126, and reads coefficient data (HRTF measurement data) corresponding to the playback posture of the user U from the BRTF file.
  • the coefficient data corresponding to the playback posture is coefficient data corresponding to the HRTF measured in a measurement posture close to the playback posture.
  • the coefficient data obtained by the coefficient reading unit 111 is, for example, coefficient data measured at a measurement position specified in advance by the user U.
  • the BRTF file may include only data measured at one measurement position. In that case, the coefficient reading unit 111 reads the coefficient data measured in a measurement posture close to the posture among the coefficient data corresponding to the HRTF measured at that measurement position.
  • the convolution processing unit 112 sets the coefficients of the FIR filter based on the coefficient data read by the coefficient reading unit 111.
  • the convolution processing unit 112 performs convolution processing using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. This performs binaural processing on the audio signal supplied from the multi-channel audio playback software 24, applying an HRTF according to the posture of the user U.
  • the audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1.
  • the display control unit 103 generates CG images to be displayed on a display device such as VR goggles.
  • the display control unit 103 has a spatial information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a CG space drawing unit 125, a playback attitude information acquisition unit 126, and an image drawing processing unit 127.
  • the spatial information reading unit 121 reads the spatial shape data contained in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired by the BRTF file acquisition unit 101.
  • the CG model acquisition unit 122 acquires material data of a 3D model corresponding to objects (walls, floors, ceilings, speakers, screens, seats, etc.) present in the original sound field from the CG data storage unit 123 based on the spatial shape data and the positional information of the measured sound source (original sound source) from the spatial information reading unit 121, and generates a CG model that mimics the space of the original sound field in a virtual space (CG space).
  • CG space virtual space
  • the measurement position information reading unit 124 reads the measurement position information included in the spatial information from the BRTF file acquired by the BRTF file acquisition unit 101.
  • the CG space drawing unit 125 renders the CG space generated by the CG model acquisition unit 122 to generate a 2D CG image.
  • the position of the virtual camera (viewpoint) during rendering is set to a position in the CG space corresponding to the measurement position in the original sound field based on the measurement position information acquired by the measurement position information reading unit 124.
  • the attitude of the virtual camera (viewpoint) during rendering is set to a posture corresponding to the posture of the user U at the production work site based on the playback posture of the user U at the current time acquired by the playback posture information acquisition unit 126.
  • measurement position information specified by the user U in advance is read by the measurement position information reading unit 124, and the virtual camera during rendering is set to a position in the CG space corresponding to the measurement position information.
  • the measurement position information referenced as the position of the virtual camera in the CG space is the same as the measurement position information associated with the coefficient data acquired by the coefficient reading unit 111.
  • the CG space rendering unit 125 generates, by rendering, a 2D CG image captured by a virtual camera set in the CG space.
  • the playback posture information acquisition unit 126 acquires playback posture information of the user U at the time of audio playback by the playback processing device 11 (current time) based on the sensing information of the sensing unit 22 in FIG. 1.
  • the playback posture information of the user U is, for example, the posture of the user U's head.
  • the posture of the user U's head can be recognized from head tracker information acquired by the sensing unit 22.
  • the head tracker can detect the posture of the user U's head using an IMU (Inertial Measurement Unit) installed in the VR goggles.
  • the sensing unit 22 acquires an image captured by a camera that captures the user U, the posture of the user U's head may be detected from the captured image.
  • the video rendering processing unit 127 generates a video signal for displaying the 2D CG video generated by the CG space rendering unit 125 on a display device such as VR goggles connected to the playback processing device 11, and outputs the signal to the display device.
  • the user U who is a producer performing audio production at the audio production work site, can listen to the audio that is heard when the audio produced by the multi-channel audio playback software 24 is played in the original sound field by binaurally processed audio.
  • the HRTF is changed according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if he or she similarly changed their posture.
  • the user U is presented with a CG image that is viewed at the listening position in the original sound field, and when the user U changes the playback posture, the space of the original sound field that the listener would see if the listener in the original sound field similarly changed their posture is presented as a CG image. Therefore, the user U can perform audio production work and check the production results while viewing realistic audio and CG images.
  • the playback process playback mode
  • the user U can reduce the effort required to switch playback modes, and work efficiency is significantly improved.
  • Fig. 8 is a flow chart showing an example of a processing procedure of the third embodiment of the playback mode switching of the playback processing device 11.
  • the BRTF file acquisition unit 101 acquires the BRTF file generated by the measurement processing device 81 of Fig. 4.
  • the spatial information reading unit 121 reads the spatial shape data included in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired in step S11.
  • the measurement position information reading unit 124 reads the measurement position information from the BRTF file acquired in step S11.
  • the playback posture information acquisition unit 126 acquires the playback posture information of the user U at the current time.
  • step S15 the CG model acquisition unit 122 acquires material data of a 3D model corresponding to the objects (walls, floors, ceilings, speakers, screens, seats, etc.) that exist in the original sound field from the CG data storage unit 123 based on the spatial information (spatial shape data and position information of the measured sound source) read in step S12, and generates a CG model that imitates the space of the original sound field in a virtual space (CG space).
  • step S16 the CG space rendering unit 125 sets the position and attitude of the virtual camera when rendering the CG space based on the measurement position information read in step S13 and the playback attitude information acquired in step S14, and generates a 2D CG image by rendering.
  • step S17 the image rendering processing unit 127 generates an image signal for displaying the CG image generated in step S16 on a display device connected to the playback processing device 11, and outputs it to the display device.
  • step S18 the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of the user U from the BRTF file acquired in step S11, based on the playback posture information acquired in step S14.
  • step S19 the convolution processing unit 112 sets the coefficients of an FIR filter based on the coefficient data read in step S18, and performs convolution processing (binaural processing) using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1.
  • step S20 the audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1.
  • the playback mode switching only the switching or adjustment of the HRTF applied to the binaural processing is performed as the playback mode switching.
  • the playback mode switching may also include switching to mixdown processing.
  • live-action footage or CG footage of the production work site may be displayed on the display device.
  • ⁇ Fourth embodiment of switching of playback mode of playback processing device 11 In a fourth embodiment of the playback mode switching of the playback processing device 11, the HRTF (or BRTF) applied to the binaural processing is adjusted and the audio playback mode is switched based on sensing information indicating the state of the user U. Also, in the fourth embodiment, the image presented to the user U through the VR goggles 51 is switched between CG images and real-life images in conjunction with the switching of the playback mode, similar to the second embodiment.
  • FIG 9 is a block diagram showing an example configuration of a playback processing device 11 to which the fourth embodiment of the playback mode switching is applied.
  • the playback processing device 11 in Figure 9 is common to the playback processing device 11 in Figure 7 in that it has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103.
  • the audio control unit 102 in Figure 9 has a coefficient reading unit 111, a convolution processing unit 112, an audio playback processing unit 113, a reverberation amount adjustment setting value reading unit 141, and a reverberation amount adjustment processing unit 142.
  • the display control unit 103 in FIG. 9 also includes a space information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a playback attitude information acquisition unit 126, a video rendering processing unit 127, a CG space rendering/video output switching unit 131, a hand space video acquisition unit 132, and a user operation information acquisition unit 133.
  • the audio control unit 102 in Fig. 9 is common to the audio control unit 102 in Fig. 7 in that it has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113.
  • the audio control unit 102 in Fig. 9 differs from the audio control unit 102 in Fig. 7 in that it newly has a reverberation amount adjustment setting value reading unit 141 and a reverberation amount adjustment processing unit 142.
  • the display control unit 103 in Fig. 9 is common to the display control unit 103 in Fig.
  • the display control unit 103 in FIG. 9 differs from the display control unit 103 in FIG. 7 in that it has a CG space rendering/video output switching unit 131 instead of the CG space rendering unit 125 in FIG. 7, and in that it newly has a hand space video acquisition unit 132 and a user operation information acquisition unit 133.
  • the reverberation adjustment setting value reading unit 141 obtains playback posture information of the user U, who is the producer at the time of audio playback (current time), from the playback posture information acquisition unit 126, and reads the reverberation adjustment setting value corresponding to the playback posture of the user U.
  • the reverberation adjustment setting value is a value that adjusts the RTF (room transfer function) or RIR (room impulse response) in binaural processing.
  • the reverberation adjustment setting value can be coefficient data (that generates reverberation) according to the RTF that is added to the coefficient data acquired from the BRTF file.
  • the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U.
  • a first playback mode in which the sound heard in the original sound field environment is reproduced by binaural processing, and a second playback mode in which the sound heard in the environment of the sound production work place is reproduced by binaural processing are switched depending on the playback posture of the user U.
  • a setting value is used as the reverberation adjustment setting value such that coefficient data that generates a large reverberation is added to the coefficient data acquired from the BRTF file.
  • a setting value is used such that coefficient data that generates a small reverberation is added.
  • a reverberation adjustment setting value according to the playback posture of the user U may be used.
  • a mixdown process may be performed instead of binaural processing.
  • the reverberation adjustment setting value can be a value (gain) that adjusts the magnitude of coefficient data acquired from the BRTF file that is greatly influenced by RTF (has a large influence on reverberation).
  • the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U. For example, as in the above, the first playback mode and the second playback mode are switched according to the playback posture of the user U.
  • a gain that generates a large reverberation is used as the reverberation adjustment setting value for the coefficient data acquired from the BRTF file.
  • a gain that generates a small reverberation is used.
  • a reverberation adjustment setting value according to the playback posture of the user U may be used.
  • a mixdown process may be performed instead of a binaural process.
  • the reverberation amount adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired from the BRTF file by the coefficient reading unit 111 and the reverberation adjustment setting value read by the reverberation amount adjustment setting value reading unit 141, thereby adjusting the coefficient data so that reverberation characteristics according to the reverberation adjustment setting value are added, and generates coefficients of an FIR filter that take into account the reverberation characteristics according to the reverberation adjustment setting value (playback posture).
  • the generated coefficients are set as coefficients of the FIR filter in the convolution processing unit 112, and binaural processing is performed.
  • the coefficient data acquired by the coefficient reading unit 111 from the BRTF file may be coefficient data associated with a measurement posture corresponding to the playback posture of the user U, as in the third embodiment, or may be coefficient data associated with fixed measurement posture information regardless of the playback posture of the user U.
  • the reverberation adjustment setting value may be a value corresponding to the line of sight of the user U, rather than a value corresponding to the playback posture of the user U.
  • the first playback mode may be selected when the user U faces a direction other than toward the hands (down), and the second playback mode may be selected when the user U faces toward the hands (down).
  • the reverberation adjustment setting value may be a value that generates almost no reverberation so that sound production work is easy, and when the user U faces up, a value that generates reverberation generated in the original sound field so that the production results can be properly confirmed.
  • the case where the user U faces up or toward the hands is not limited to the case where the front of the user U's head faces up or toward the hands, but may be the case where the line of sight of the user U faces up or toward the hands.
  • the CG space rendering/video output switching unit 131 like the CG space rendering unit 125 of the third embodiment, sets the position and orientation of the virtual camera based on the measurement position information acquired by the measurement position information reading unit 124 and the playback orientation information acquired by the playback orientation information acquisition unit 126, renders the CG space generated by the CG model acquisition unit 122, and generates a 2D CG image.
  • the generated CG image is supplied to the video rendering processing unit 127 (in the case of the first playback mode described above).
  • the CG space rendering/video output switching unit 131 when the CG space rendering/video output switching unit 131 detects that the user U has turned toward his/her hands based on the playback orientation information acquired by the playback orientation information acquisition unit 126, the CG space rendering/video output switching unit 131 switches from generating CG images to acquiring a live-action image of the user U's hand space acquired by the hand space image acquisition unit 132, and supplies it to the video rendering processing unit 127.
  • the hand space image acquisition unit 132 can acquire, for example, an image captured by an external camera installed in VR goggles or the like as a live-action image of the hand space.
  • the image rendering processing unit 127 generates an image signal for displaying the CG image or live-action image from the CG space rendering/image output switching unit 131 on a display device connected to the playback processing device 11, and outputs the signal to the display device.
  • the user operation information acquisition unit 133 acquires the operation content as user operation information.
  • the CG space rendering/video output switching unit 131 supplies live-action video to the video rendering processing unit 127 for display on the display device, it superimposes information indicating the user's operation content (operation information) on the live-action video based on the user operation information from the user operation information acquisition unit 133.
  • operation information is presented on the display device superimposed on the live-action video.
  • FIG. 10 is a diagram illustrating an example of operation information displayed superimposed on the live-action image of the console device 41 when the user U operates the console device 41.
  • the console device 41 is a live-action image.
  • Operation information 161 is displayed superimposed on this.
  • the operation information 161 includes an enlarged view (CG image) of the operated part of the console device 41 and information on the numerical value (edited value) changed by the operation.
  • the operation information as shown in FIG. 10 may be superimposed on the CG image and presented to the user U.
  • the operation information may be only the numerical value changed by the operation, and the operation information is not limited to the form shown in FIG. 10.
  • the operation information may be displayed superimposed on the live-action image of the operation device.
  • a CG image is used instead of a live-action image, and the operation information is displayed superimposed on the CG image of the console 41, and instead of the console 41, a flat surface, a box, a panel with unevenness, or the like that imitates the console 41 can be placed at the user U's hand.
  • the sensation of operating the faders, knobs, etc. of the console 41 can be reproduced by haptic reproduction technology. In this case, it is possible to make it appear as if the console 41 is not actually in the user U's hand.
  • the user U can enjoy the sensation of being in a studio or a workroom where sound production work is carried out, even if the sound production work place is not a studio or a workroom where sound equipment such as the actual console 41 is located.
  • the user U who is a producer performing audio production at the audio production work site, can listen to the binaurally processed audio that is the same as the audio that would be heard if the audio produced by the multi-channel audio playback software 24 were played in the original sound field.
  • the HRTF is adjusted to take into account the reverberation characteristics according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if they similarly changed their posture.
  • the user U is presented with automatic switching between the CG image viewed at the listening position in the original sound field and the live-action image of the space in front of him/her. Therefore, the image presented to the user U is automatically switched in conjunction with the automatic switching of the playback mode, so that the effort required for the user U to manually switch between the audio production work and the work of checking the production results can be reduced, and work efficiency is significantly improved.
  • step S45 the CG space drawing/video output switching unit 131 judges whether the output image is a CG image or a live-action image (hand image) based on the playback posture information acquired in step S44.
  • step S45 if it is judged that the output image is a live-action image, the process proceeds to step S49, and the CG space drawing/video output switching unit 131 acquires a live-action image of the hand space from the hand space image acquisition unit 132.
  • step S50 the CG space rendering/video output switching unit 131 acquires user operation information from the user operation information acquisition unit 133.
  • step S51 the video rendering processing unit 127 generates a video signal for displaying on the display device the live-action video acquired in step S49 or the live-action video on which the operation information has been superimposed in step S50, and outputs the video signal to the display device. The process proceeds from step S51 to step S52.
  • step S52 the reverberation adjustment setting value reading unit 141 reads the reverberation adjustment setting value corresponding to the playback posture of user U based on the playback posture information acquired in step S44.
  • step S53 the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of user U from the BRTF file acquired in step S41 based on the playback posture information acquired in step S44. Note that the coefficient data read by the coefficient reading unit 111 may be coefficient data corresponding to a specific playback posture regardless of the playback posture of user U.
  • step S54 the reverberation adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired in step S53 and the reverberation adjustment setting value acquired in step S52, and generates coefficients of an FIR filter that take into account the reverberation amount according to the playback posture.
  • step S55 the convolution processing unit 112 sets the coefficients generated in step S54 as the coefficients of an FIR filter, and performs convolution processing (binaural processing) using the FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1.
  • step S56 the audio playback processing unit 113 outputs the audio signal that has been binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1.
  • the switching of the playback mode in the playback processing device 11 may be performed in coordination with the operation of an arbitrary operation device connected to the playback processing device 11.
  • a controller 171 and a console machine 172 are illustrated as examples of operation devices connected to the playback processing device 11.
  • the sensing unit 22 in Fig. 1 acquires the tilt angle of a joystick 171A of the controller 171.
  • the trigger generation unit 23 outputs a trigger signal for switching the playback mode when a specific tilt angle of the joystick 171A is detected.
  • the tilt angle of the joystick 171A is linked to the distance from the center of the listener's head in the original sound field to the original sound source.
  • the trigger generation unit 23 sets the playback mode to mixdown processing, or outputs a trigger signal to change the HRTF applied to binaural processing.
  • the trigger generation unit 23 sets the playback mode to binaural processing, or outputs a trigger signal to change the HRTF applied to binaural processing.
  • the joystick 171A can also be used to change the coordinate position of the original sound source (or sound image) in the original sound field. For example, it can be used to manipulate the distance of the original sound source relative to an origin at a predetermined position in the original sound field, or to manipulate the horizontal or elevation angle of the original sound source relative to the origin.
  • the trigger generation unit 23 may output a trigger signal according to the distance between the original sound source and the origin.
  • the position of the slider 172A of the console device 172 may be linked to the distance from the center of the listener's head in the original sound field to the original sound source, and the trigger generation unit 23 may output a trigger signal according to the position of the slider 172A.
  • the above-mentioned series of processes can be executed by hardware or software.
  • the program constituting the software is installed in a computer.
  • the computer includes a computer built into dedicated hardware, and a general-purpose personal computer, for example, capable of executing various functions by installing various programs.
  • FIG. 13 is a block diagram showing an example of the hardware configuration of a computer that executes the above-mentioned series of processes using a program.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • an input/output interface 205 Further connected to the bus 204 is an input/output interface 205. Connected to the input/output interface 205 are an input unit 206, an output unit 207, a storage unit 208, a communication unit 209, and a drive 210.
  • the input unit 206 includes a keyboard, mouse, microphone, etc.
  • the output unit 207 includes a display, speaker, etc.
  • the storage unit 208 includes a hard disk, non-volatile memory, etc.
  • the communication unit 209 includes a network interface, etc.
  • the drive 210 drives removable media 211 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory.
  • the CPU 201 loads a program stored in the storage unit 208, for example, into the RAM 203 via the input/output interface 205 and the bus 204, and executes the program, thereby performing the above-mentioned series of processes.
  • the program executed by the computer (CPU 201) can be provided by being recorded on removable media 211, such as package media, for example.
  • the program can also be provided via a wired or wireless transmission medium, such as a local area network, the Internet, or digital satellite broadcasting.
  • a program can be installed in the storage unit 208 via the input/output interface 205 by inserting the removable medium 211 into the drive 210.
  • the program can also be received by the communication unit 209 via a wired or wireless transmission medium and installed in the storage unit 208.
  • the program can be pre-installed in the ROM 202 or storage unit 208.
  • the program executed by the computer may be a program in which processing is performed chronologically in the order described in this specification, or a program in which processing is performed in parallel or at the required timing, such as when called.
  • the processing performed by a computer according to a program does not necessarily have to be performed in chronological order according to the order described in the flowchart.
  • the processing performed by a computer according to a program also includes processing that is executed in parallel or individually (for example, parallel processing or processing by objects).
  • the program may be processed by one computer (processor), or may be distributed among multiple computers. Furthermore, the program may be transferred to a remote computer for execution.
  • a system refers to a collection of multiple components (devices, modules (parts), etc.), regardless of whether all the components are in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device in which multiple modules are housed in a single housing, are both systems.
  • the configuration described above as one device (or processing unit) may be divided and configured as multiple devices (or processing units).
  • the configurations described above as multiple devices (or processing units) may be combined and configured as one device (or processing unit).
  • configurations other than those described above may also be added to the configuration of each device (or processing unit).
  • part of the configuration of one device (or processing unit) may be included in the configuration of another device (or other processing unit).
  • this technology can be configured as cloud computing, in which a single function is shared and processed collaboratively by multiple devices via a network.
  • the above-mentioned program can be executed in any device.
  • the device has the necessary functions (functional blocks, etc.) and is able to obtain the necessary information.
  • each step described in the above flowchart can be executed by one device, or can be shared and executed by multiple devices.
  • one step includes multiple processes, the multiple processes included in that one step can be executed by one device, or can be shared and executed by multiple devices.
  • multiple processes included in one step can be executed as multiple step processes.
  • processes described as multiple steps can be executed collectively as one step.
  • processing of the steps that describe a program executed by a computer may be executed chronologically in the order described in this specification, or may be executed in parallel, or individually at the required timing, such as when a call is made. In other words, as long as no contradictions arise, the processing of each step may be executed in an order different from the order described above. Furthermore, the processing of the steps that describe this program may be executed in parallel with the processing of other programs, or may be executed in combination with the processing of other programs.
  • the present technology can also be configured as follows.
  • An information processing device having a playback signal generation unit that performs a playback process corresponding to a user's state from among multiple types of playback processes including 3D audio playback processing and non-3D audio playback processing on an input audio signal to generate an audio signal for playback.
  • the 3D sound reproduction process is a process for reflecting acoustic characteristics of a space in the input audio signal.
  • the non-3D sound reproduction process is a process for changing a number of channels of the input audio signal.
  • the information processing device executes a playback process selected based on sensing information indicating a state of the user.
  • the information processing device according to any one of (1) to (5), wherein the state of the user is an operation state of an operation member used for purposes other than switching the playback process executed by the playback signal generating section.
  • the 3D sound reproduction process is a process of convolving a transfer function corresponding to the acoustic characteristics with the input audio signal.
  • the information processing device (8) The information processing device according to (7), wherein the 3D sound reproduction process is a process using an FIR filter.
  • the information processing device is a head-related transfer function, a binaural transfer function, or a combination of a head-related transfer function and a room transfer function.
  • the information processing device uses the acoustic characteristics actually measured in the space.
  • the reproduction signal generation unit uses the acoustic characteristic corresponding to the current posture of the user among the acoustic characteristics actually measured in a plurality of postures.
  • the information processing device according to any one of (7) to (11), wherein the reproduction signal generation unit switches the reproduction process to be executed by changing or adjusting the transfer function.
  • the information processing device according to any one of (2) and (7) to (12), further comprising a display control unit that outputs a CG image of the space having the acoustic characteristics reflected by the 3D audio reproduction process.
  • the display control unit outputs the CG image in conjunction with execution of the 3D sound reproduction process by the reproduction signal generation unit.
  • the CG image is an image captured by a virtual camera of a CG space that reproduces the space having the acoustic characteristics that the 3D audio playback processing reflects on the input audio signal.
  • the CG image is an image captured by changing the posture of the virtual camera in accordance with the posture of the user.
  • the display control unit outputs a live-action video in conjunction with execution of the non-3D sound reproduction process by the reproduction signal generation unit.
  • the live-action image is an image captured by a camera around the user.
  • An information processing method of an information processing device having a playback signal generation unit the playback signal generation unit executing a playback process corresponding to a user's state from among multiple types of playback processes including 3D sound playback processing and non-3D sound playback processing on an input audio signal, to generate an audio signal for playback.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The present technology relates to an information processing device, an information processing method, and a program designed to enable automatic switching between multiple types of playback processing related to audio playback, including 3D sound playback. Out of multiple types of playback processing including 3D sound playback processing and non-3D sound playback processing, playback processing corresponding to the state of a user is executed on an inputted audio signal to generate an audio signal for playback.

Description

情報処理装置、情報処理方法、及び、プログラムInformation processing device, information processing method, and program
 本技術は、情報処理装置、情報処理方法、及び、プログラムに関し、特に、3D音響再生を含む音声再生に関する複数種の再生処理の切替えを自動的に行えるようにした情報処理装置、情報処理方法、及び、プログラムに関する。 This technology relates to an information processing device, an information processing method, and a program, and in particular to an information processing device, an information processing method, and a program that enable automatic switching between multiple types of playback processes for audio playback, including 3D audio playback.
 特許文献1には、音声信号に対して頭部伝達関数(HRTF:Head-Related Transfer Function)を適用することで、コンサートホールや映画館などの再生環境での原音場を、原音場とは時間又は空間が異なる再生音場で再現する3D音響再生の技術が開示されている。 Patent Document 1 discloses a 3D sound reproduction technology that applies a head-related transfer function (HRTF) to an audio signal to reproduce the original sound field in a reproduction environment such as a concert hall or movie theater in a reproduction sound field that differs in time or space from the original sound field.
特開2017-195581号公報JP 2017-195581 A
 音響制作や編集等で制作した音声を、実際の再生環境で聴取して確認したい場合や、再生環境に影響されていない状態で聴取したい場合などがあり、制作した音声に対して複数種の再生処理(フィルタ処理)を適用して聴取することがあるが、適用する再生処理を切り替える作業は音響制作の作業効率の低下を招く。 There are times when you want to listen to the audio produced during sound production or editing in an actual playback environment, or when you want to listen to it without being affected by the playback environment, and so on. In these cases, multiple types of playback processing (filter processing) may be applied to the produced audio before listening to it, but the task of switching the playback processing applied reduces the work efficiency of sound production.
 本技術はこのような状況に鑑みてなされたものであり、3D音響再生を含む音声再生に関する複数種の再生処理の切替えを自動的に行えるようにする。 This technology was developed in light of these circumstances, and makes it possible to automatically switch between multiple types of playback processing for audio playback, including 3D audio playback.
 本技術の情報処理装置、又は、プログラムは、入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する再生信号生成部を有する情報処理装置、又は、そのような情報処理装置として、コンピュータを機能させるためのプログラムである。 The information processing device or program of this technology is an information processing device having a playback signal generation unit that performs playback processing corresponding to the user's state from among multiple types of playback processing, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal to generate an audio signal for playback, or a program for causing a computer to function as such an information processing device.
 本技術の情報処理方法は、再生信号生成部を有する情報処理装置の前記再生信号生成部が、入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する情報処理方法である。 The information processing method of the present technology is an information processing method in which an information processing device having a playback signal generation unit performs a playback process corresponding to the user's state from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, on an input audio signal to generate an audio signal for playback.
 本技術の情報処理装置、情報処理方法、及び、プログラムにおいては、入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理が実行されて再生用の音声信号が生成される。 In the information processing device, information processing method, and program of the present technology, a playback process that corresponds to the user's state is executed from among multiple types of playback processes, including 3D sound playback processing and non-3D sound playback processing, for the input audio signal, and an audio signal for playback is generated.
本技術が適用された実施の形態に係る再生処理装置の構成例を示したブロック図である。1 is a block diagram showing an example configuration of a playback processing device according to an embodiment to which the present technology is applied. 再生処理装置の再生モード切替えの第1実施例を説明する図である。FIG. 2 is a diagram for explaining a first embodiment of switching of playback modes of a playback processing device. 再生処理装置の再生モード切替えの第2実施例を説明する図である。FIG. 11 is a diagram for explaining a second embodiment of playback mode switching of the playback processing device. 再生処理装置の再生処理切替えの第2実施例の処理手順例を示したフローチャートである。13 is a flowchart showing an example of a processing procedure of a second embodiment of playback process switching of a playback processing device; 再生モード切替えの第3及び第4実施例が適用される音響制作システムの構成例を示した図である。FIG. 13 is a diagram showing an example of the configuration of an audio production system to which third and fourth embodiments of the playback mode switching are applied. 測定環境における測定の流れを示す図である。FIG. 1 is a diagram showing a measurement flow in a measurement environment. 再生モード切替えの第3実施例が適用される再生処理装置の構成例を示したブロック図である。FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a third embodiment of playback mode switching is applied. 再生処理装置の再生処理切替えの第3実施例の処理手順例を示したフローチャートである。13 is a flowchart showing an example of a processing procedure of a third embodiment of playback process switching of a playback processing device; 再生モード切替えの第4実施例が適用される再生処理装置の構成例を示したブロック図である。FIG. 13 is a block diagram showing an example of the configuration of a playback processing device to which a fourth embodiment of playback mode switching is applied. コンソール機をユーザUが操作した際にコンソール機の実写映像に重ねて表示される操作情報を例示した図である。11 is a diagram illustrating an example of operation information that is displayed superimposed on an actual image of the console device when a user U operates the console device. 再生処理装置の再生処理切替えの第4実施例の処理手順例を示したフローチャートである。13 is a flowchart showing an example of a processing procedure of a fourth embodiment of playback process switching of a playback processing device; 再生処理装置に接続される操作装置での再生モードの切替え例を説明した図である。11A and 11B are diagrams illustrating an example of switching of a playback mode on an operation device connected to a playback processing device. 本技術を適用したコンピュータの一実施の形態の構成例を示すブロック図である。1 is a block diagram showing an example of the configuration of an embodiment of a computer to which the present technology is applied.
 以下、図面を参照しながら本技術の実施の形態について説明する。 Below, we will explain the implementation of this technology with reference to the drawings.
<<本実施の形態に係る再生処理装置>>
 図1は、本技術が適用された実施の形態に係る再生処理装置の構成例を示したブロック図である。
<<Reproduction Processing Device According to the Present Embodiment>>
FIG. 1 is a block diagram showing an example of the configuration of a playback processing device according to an embodiment to which the present technology is applied.
 図1において、本実施の形態に係る再生処理装置11は、映画などのコンテンツの音声の制作又は編集(以下、編集も含む意味で音響制作という)に用いられる。再生処理装置11には、ヘッドフォンやスピーカなどの音声再生装置12が接続される。再生処理装置11は、再生信号生成部21、センシング部22、トリガ生成部23、多ch(チャンネル)音声再生ソフト24を有する。 In FIG. 1, a playback processing device 11 according to this embodiment is used for producing or editing the sound of content such as movies (hereinafter, the term "sound production" includes editing). An audio playback device 12 such as headphones or speakers is connected to the playback processing device 11. The playback processing device 11 has a playback signal generation unit 21, a sensing unit 22, a trigger generation unit 23, and multi-channel audio playback software 24.
 再生信号生成部21は、多ch音声再生ソフト24からの多chの音声信号に対して、バイノーラル処理(3D音響再生処理)、2chミックスダウン/ステレオ再生処理(非3D音響再生処理)、及び、パススルー処理(非3D音響再生処理)のうちのいずれかの再生処理を切り替えて実行する。これらの再生処理の切り替えは、ユーザの状態に基づいて行われ、具体的には、トリガ生成部23からのトリガ信号に基づいて行われる。再生信号生成部21は、再生処理によって再生用の音声信号を生成して音声再生装置12に供給する。 The playback signal generation unit 21 switches between playback processes among binaural processing (3D audio playback processing), 2ch mixdown/stereo playback processing (non-3D audio playback processing), and pass-through processing (non-3D audio playback processing) for the multi-channel audio signals from the multi-channel audio playback software 24. These playback processes are switched based on the user's state, and more specifically, based on a trigger signal from the trigger generation unit 23. The playback signal generation unit 21 generates an audio signal for playback by the playback process and supplies it to the audio playback device 12.
 また、再生信号生成部21は、音声に関連する映像やGUI等のマルチメディア情報を生成して音声再生装置12の接続された映像出力装置に供給する機能を有していてもよい。映像出力装置としては、モニタ、VRゴーグル(HMD:Head Mounted Display)、ARゴーグル等の表示装置が該当する。センシング部22は、再生信号生成部21における再生処理の切替えを判定するためのユーザの状態を示す情報を検出(取得)する。センシング部22が検出する情報としては、例えば、カメラで撮影されたユーザ等の画像、ヘッドトラッカにより検出されたユーザの頭部の姿勢、アイトラッカにより検出されたユーザの視線方向、GUI(Graphical User Interface)に対するユーザ操作、スイッチやボタン等に対するユーザ操作等の情報が該当する。センシング部22で取得された情報(センシング情報)はトリガ生成部23に供給される。なお、センシング部22は、再生処理の切替えの判定に必要な情報であって、かつ、ユーザに関わる対象の状態の示す任意の情報をセンシング情報として検出(取得)することとする。センシング情報としては、生体センシング情報(ヘッドトラッキング、視線方向、焦点位置、姿勢および位置トラッキング)、外観情報(カメラから得た画像を用いた人物認識、顔認識、ヘッドフォン認識)、GPSや超音波を使用した測位情報、機器の入力情報(GUI情報、キーボードやコントローラーなどの機材、ヘッドフォン種別情報)等が適宜使用され得る。 The playback signal generating unit 21 may also have a function of generating multimedia information such as video and GUI related to the audio and supplying it to a video output device connected to the audio playback device 12. Examples of video output devices include display devices such as monitors, VR goggles (HMD: Head Mounted Display), and AR goggles. The sensing unit 22 detects (acquires) information indicating the state of the user for determining whether to switch the playback process in the playback signal generating unit 21. Examples of information detected by the sensing unit 22 include an image of the user captured by a camera, the user's head posture detected by a head tracker, the user's line of sight detected by an eye tracker, user operations on a GUI (Graphical User Interface), and user operations on switches, buttons, etc. The information (sensing information) acquired by the sensing unit 22 is supplied to the trigger generating unit 23. The sensing unit 22 detects (acquires) any information necessary for determining whether to switch the playback process and indicating the state of an object related to the user as sensing information. Sensing information that can be used appropriately includes biometric sensing information (head tracking, gaze direction, focal position, posture and position tracking), appearance information (person recognition, face recognition, headphone recognition using images obtained from a camera), positioning information using GPS or ultrasound, device input information (GUI information, equipment such as keyboards and controllers, headphone type information), etc.
 トリガ生成部23は、センシング部22からのセンシング情報に基づいて、再生処理の切替えを指示するトリガ信号を生成し、再生信号生成部21に供給する。トリガ生成部23は、例えば、センシング情報が事前に決められた条件に合致した場合に、トリガ信号を再生信号生成部21に供給する。トリガ生成部23は、センシング情報を検出するセンサと同一ハードウェアに組み込まれていてもよいし、センシング情報を受け取った別ハードウェア又はソフトウェアに組み込まれていてもよいし、再生信号生成部21のトリガ受信部31に含めてもよい。なお、トリガ信号は、再生信号生成部21において実行可能な複数種の再生処理のうち、再生信号生成部21が実行する再生処理を指定する信号に相当する。 The trigger generation unit 23 generates a trigger signal that instructs switching of the playback process based on the sensing information from the sensing unit 22, and supplies it to the playback signal generation unit 21. For example, when the sensing information matches a predetermined condition, the trigger generation unit 23 supplies the trigger signal to the playback signal generation unit 21. The trigger generation unit 23 may be incorporated in the same hardware as the sensor that detects the sensing information, or may be incorporated in separate hardware or software that receives the sensing information, or may be included in the trigger receiving unit 31 of the playback signal generation unit 21. The trigger signal corresponds to a signal that specifies the playback process to be executed by the playback signal generation unit 21 from among multiple types of playback processes that can be executed by the playback signal generation unit 21.
 多ch音声再生ソフト24は、DAW(Digital Audio Workstaion)等の多ch音声再生ソフトを実行する処理部を表し、多ch(1chの場合も含む)の音声信号を生成(又は編集)する。生成された多チャンネル(多ch)の音声信号は再生信号生成部21に供給される。なお、多ch音声再生ソフトはDAW等上で動作するプラグインソフトウェアであってもよいし、DAW等とは切り離し、スタンドアローンアプリケーションとして動作してもよい。この場合、DAW等から多chの音声信号を出力し、スタンドアローンアプリケーションに入力する。また、多ch音声再生ソフトとしては、多chの音声信号を出力するソフトウェアであればDAW以外のソフトウェア・ルーチンを用いてもよい(オブジェクトオーディオのレンダリング後データなど) The multi-channel audio playback software 24 represents a processing unit that executes multi-channel audio playback software such as a DAW (Digital Audio Workstation), and generates (or edits) multi-channel (including 1-channel) audio signals. The generated multi-channel (multi-channel) audio signals are supplied to the playback signal generation unit 21. The multi-channel audio playback software may be plug-in software that runs on a DAW, or may be separated from a DAW and run as a standalone application. In this case, the multi-channel audio signals are output from the DAW and input to the standalone application. In addition, software routines other than DAWs may be used as the multi-channel audio playback software as long as they output multi-channel audio signals (such as data after rendering of object audio).
 再生信号生成部21は、トリガ受信部31、切替処理部32、バイノーラル処理部33、2chミックスダウン/ステレオ再生処理部34、及び、パススルー処理部35を有する。トリガ受信部31は、トリガ生成部23からのトリガ信号を受信し、トリガ生成部23からトリガ信号が供給されたか否かを判定する。切替処理部32は、トリガ受信部31がトリガ信号を受信した際に、バイノーラル処理部33、2chミックスダウン/ステレオ再生処理部34、及び、パススルー処理部35のうち、多ch音声再生ソフト24からの多chの音声信号に対して再生処理を実行する処理部を切り替える。 The playback signal generation unit 21 has a trigger receiving unit 31, a switching processing unit 32, a binaural processing unit 33, a 2ch mixdown/stereo playback processing unit 34, and a pass-through processing unit 35. The trigger receiving unit 31 receives a trigger signal from the trigger generation unit 23 and determines whether or not a trigger signal has been supplied from the trigger generation unit 23. When the trigger receiving unit 31 receives a trigger signal, the switching processing unit 32 switches between the binaural processing unit 33, the 2ch mixdown/stereo playback processing unit 34, and the pass-through processing unit 35, whichever processing unit performs playback processing on the multi-channel audio signals from the multi-channel audio playback software 24.
 バイノーラル処理部33は、3D音響再生方式の1つであるバイノーラル再生の処理(バイノーラル処理)を行う。3D音響再生方式とは、コンサートホールや映画館などの原音場(仮想の原音場を含む)の聴取者の両耳の入力信号を、原音場とは時間又は空間が異なる再生音場の聴取者の外耳道入口にヘッドフォンやスピーカで再現する音声再生技術である。バイノーラル処理部33は、バイノーラル処理として多ch音声再生ソフト24からの音声信号に対して所定の空間(原音場)の伝達特性を反映させるために、頭部伝達関数(HRTF:Head-Related Transfer Function)を畳み込むフィルタ処理を行う。ところで、バイノーラル再生は、ヘッドフォンによる再生を前提とするが、バイノーラル処理部33は、バイノーラル再生に限らず、3D音響再生方式として分類される任意の再生処理(3D音響再生処理)を行う場合も含む。3D音響再生方式に分類される再生処理として、例えば、バイノーラル再生以外に、2個のスピーカによる再生を前提としたトランスオーラル再生の処理(トランスオーラル処理)がある。トランスオーラル処理では、バイノーラル処理に加えてクロストークを除去する処理等が含まれるが、バイノーラル処理部33がトランスオーラル処理を行う場合であってもよい。また、バイノーラル処理部33は、再生処理装置11に接続される音声再生装置12の種類に応じて適切な3D音響再生方式の処理を行うようにしてよい。 The binaural processing unit 33 performs binaural playback processing (binaural processing), which is one of the 3D sound playback methods. The 3D sound playback method is an audio playback technology that reproduces input signals to both ears of a listener in an original sound field (including a virtual original sound field) such as a concert hall or movie theater, using headphones or speakers at the entrance of the listener's ear canal in a playback sound field that is different in time or space from the original sound field. The binaural processing unit 33 performs filtering processing to convolve a head-related transfer function (HRTF) in order to reflect the transfer characteristics of a specific space (original sound field) in the audio signal from the multi-channel audio playback software 24 as binaural processing. Incidentally, binaural playback is premised on playback using headphones, but the binaural processing unit 33 is not limited to binaural playback and also includes cases where any playback processing (3D audio playback processing) classified as a 3D sound playback method is performed. Examples of playback processing classified as 3D sound playback methods include, in addition to binaural playback, transaural playback processing (transaural processing) that assumes playback using two speakers. In addition to binaural processing, transaural processing includes processing to remove crosstalk, but the binaural processing unit 33 may also perform transaural processing. In addition, the binaural processing unit 33 may perform processing of an appropriate 3D sound playback method depending on the type of audio playback device 12 connected to the playback processing device 11.
 また、バイノーラル処理部33のバイノーラル処理(3D音響再生処理)、又は、バイノーラル処理に代わる3D音響再生処理に適用され得る音声の伝達特性(伝達関数又はインパルス応答)としては、HRTF(Head-Related Transfer Function:頭部伝達関数)、HRIR(Head-Related Impulse Response:頭部インパルス応答)、HRTFとRTF(Room transfer function:室内伝達関数)との組合せ、HRIRとRIR(Room Impluse Response:室内インパルス応答)との組合せ、BRTF(Binaural Room Transfer Function:バイノーラル室内伝達関数)、BRIR(Binaural Room Impluse Response:バイノーラル室内インパルス応答)、及び、ヘッドフォンから鼓膜(外耳道入口)までの伝達関数又はそのインパルス応答のうちのいずれか、又は、これらの組合せがある。本技術の説明では、バイノーラル処理部33のバイノーラル処理(3D音響再生処理)としてHRTFが適用されることとして説明する。 In addition, the transfer characteristics of sound (transfer functions or impulse responses) that can be applied to the binaural processing (3D sound reproduction processing) of the binaural processing unit 33, or to 3D sound reproduction processing instead of binaural processing, include HRTF (Head-Related Transfer Function), HRIR (Head-Related Impulse Response), a combination of HRTF and RTF (Room transfer function), a combination of HRIR and RIR (Room Impulse Response), BRTF (Binaural Room Transfer Function), BRIR (Binaural Room Impulse Response), and any of the transfer functions or their impulse responses from the headphones to the eardrum (entrance of the ear canal), or a combination of these. In the explanation of this technology, it is assumed that HRTF is applied as the binaural processing (3D sound reproduction processing) of the binaural processing unit 33.
 2chミックスダウン/ステレオ再生処理部34は、多ch音声再生ソフト24からの多chの音声信号に対してミックスダウンの再生処理(ミックスダウン処理)を行い、2chのステレオ再生の音声信号を生成する。また、2chミックスダウン/ステレオ再生処理部34は、多ch音声再生ソフト24からの音声信号が1ch(モノラル)の音声信号の場合には、2chのステレオ再生の音声信号を生成する。なお、以下において、多ch音声再生ソフト24からのモノラルの音声信号が供給される場合については考慮せずに、2chミックスダウン/ステレオ再生処理部34は、ミックスダウン処理のみを行うこととする。 The 2ch mixdown/stereo playback processor 34 performs mixdown playback processing (mixdown processing) on the multi-channel audio signal from the multi-channel audio playback software 24 to generate an audio signal for 2ch stereo playback. Furthermore, if the audio signal from the multi-channel audio playback software 24 is a 1ch (monaural) audio signal, the 2ch mixdown/stereo playback processor 34 generates an audio signal for 2ch stereo playback. Note that in the following, the 2ch mixdown/stereo playback processor 34 will only perform mixdown processing, without considering the case where a mono audio signal is supplied from the multi-channel audio playback software 24.
 パススルー処理部35は、多ch音声再生ソフト24からの多chの音声信号に対して、ミックスダウン処理などを行わずに、多ch音声再生ソフト24からの多chの音声信号をそのまま音声再生装置12の対応するchに供給する。音声再生装置12に存在しないchに対応する多ch音声再生ソフト24からの音声信号は、パススルー処理部35から音声再生装置12に供給されない。ただし、パススルー処理部35は、信号ルーティングの処理も実行し得る。この場合に、多ch音声再生ソフト24からの多chの音声信号がそれぞれ音声再生装置12の指定されたchに供給される。 The pass-through processing unit 35 supplies the multi-channel audio signals from the multi-channel audio playback software 24 directly to the corresponding channels of the audio playback device 12 without performing mix-down processing or the like on the multi-channel audio signals from the multi-channel audio playback software 24. Audio signals from the multi-channel audio playback software 24 corresponding to channels that do not exist in the audio playback device 12 are not supplied from the pass-through processing unit 35 to the audio playback device 12. However, the pass-through processing unit 35 can also perform signal routing processing. In this case, the multi-channel audio signals from the multi-channel audio playback software 24 are each supplied to the specified channels of the audio playback device 12.
 なお、以下において、再生信号生成部21は、トリガ生成部23からのトリガ信号に基づいて、多ch音声再生ソフト24からの多chの音声信号に対する再生信号生成部21での再生処理(音響処理)が、バイノーラル処理部33でのバイノーラル処理と、2chミックスダウン/ステレオ再生処理部34でのミックスダウン処理とで切り替えられることとする。また、再生信号生成部21で実行される再生処理の方式を再生処理装置11又は再生信号生成部21の再生モード(又は音声再生モード)といい、再生モードはバイノーラル処理とミックスダウン処理とで切り替えられることとする。また、バイノーラル処理部33でのバイノーラル処理に適用されるHRTFが切り替えれる場合や適用されるHRTF(パラメータ)の調整が行われる場合も再生モードの切替えに該当するものとする。また、再生処理装置11又は再生信号生成部21の再生モードを単に再生モードとも称する。本技術は、バイノーラル処理のような3D音響再生処理と、2chミックスダウン処理やパススルー処理のような非3D音響再生処理とを含む複数種の再生処理をユーザの状態に応じて切り替えて実行する場合に適用することができ、切り替えて実行可能とする再生処理の種類や数は特に限定されない。 In the following, the playback signal generating unit 21 switches the playback process (acoustic process) of the multi-channel audio signal from the multi-channel audio playback software 24 between binaural processing in the binaural processing unit 33 and mixdown processing in the 2ch mixdown/stereo playback processing unit 34 based on a trigger signal from the trigger generating unit 23. The playback process method executed by the playback signal generating unit 21 is referred to as the playback mode (or audio playback mode) of the playback processing device 11 or the playback signal generating unit 21, and the playback mode is switched between binaural processing and mixdown processing. Switching the HRTF applied to the binaural processing in the binaural processing unit 33 or adjusting the applied HRTF (parameter) also corresponds to switching the playback mode. The playback mode of the playback processing device 11 or the playback signal generating unit 21 is also simply referred to as the playback mode. This technology can be applied to cases where multiple types of playback processes, including 3D sound playback processes such as binaural processing and non-3D sound playback processes such as 2ch mixdown processing and pass-through processing, are switched and executed according to the user's state, and there is no particular limit to the types and number of playback processes that can be switched and executed.
<再生処理装置11の再生モード切替えの第1実施例>
 図2は、再生処理装置11の再生モード切替えの第1実施例を説明する図である。図2には、再生処理装置11を含む音響制作システムの各種周辺機器と、ユーザUとが例示されている。ユーザUは、再生処理装置11を使用して映画などのコンテンツの音声の制作を行う制作者である。コンソール機41は、再生処理装置11に接続され、多ch音声再生ソフト24に対するユーザUの操作を入力する装置である。コンソール機41としては、例えば、ミキシングコンソール、キーボード、マウスなどの操作装置が該当する。
<First embodiment of switching of playback mode of playback processing device 11>
Fig. 2 is a diagram for explaining a first embodiment of switching of the playback mode of the playback processing device 11. Fig. 2 illustrates various peripheral devices of an audio production system including the playback processing device 11, and a user U. The user U is a producer who uses the playback processing device 11 to produce the audio of content such as a movie. The console machine 41 is a device connected to the playback processing device 11 and inputs the user U's operations on the multi-channel audio playback software 24. The console machine 41 may be, for example, an operating device such as a mixing console, a keyboard, or a mouse.
 モニタ42A及び42Bは、再生処理装置11に接続され、多ch音声再生ソフト24に連携してGUI情報や制作画面等の画像をユーザUに表示する。モニタ42A及び42Bは2台に限らず、1台又は3台以上であってよい。カメラ43は、再生処理装置11に接続され、主にユーザUを撮影した画像をセンシング部22が検出する画像として再生処理装置11に供給する。スピーカ45は、再生処理装置11に接続され、再生処理装置11から供給された音声信号を音波として出力する。 Monitors 42A and 42B are connected to the playback processing device 11, and work in conjunction with the multi-channel audio playback software 24 to display images such as GUI information and production screens to the user U. The number of monitors 42A and 42B is not limited to two, and may be one or three or more. The camera 43 is connected to the playback processing device 11, and mainly supplies images of the user U to the playback processing device 11 as images detected by the sensing unit 22. The speaker 45 is connected to the playback processing device 11, and outputs the audio signal supplied from the playback processing device 11 as sound waves.
 ヘッドフォン44は、再生処理装置11に接続され、ユーザUの頭部に装着される。ヘッドフォン44は、再生処理装置11から供給される2chの音声信号をそれぞれ左右の耳(外耳導入口)付近で音波として出力する。スピーカ45は、ヘッドフォン44の代わりに、又は、ヘッドフォン44と併せて再生処理装置11から供給された音声信号を出力する。 The headphones 44 are connected to the playback processing device 11 and are worn on the head of the user U. The headphones 44 output the 2ch audio signals supplied from the playback processing device 11 as sound waves near the left and right ears (external ear inlets). The speaker 45 outputs the audio signals supplied from the playback processing device 11 instead of the headphones 44 or in addition to the headphones 44.
 図2のような再生処理装置11を含む音響制作システムにおいて、ユーザUは再生モード(再生処理)の切替えに関して事前に以下のa乃至dの設定情報を登録しておく。 In an audio production system including a playback processing device 11 as shown in FIG. 2, a user U registers the following setting information a through d in advance regarding switching of playback modes (playback processes).
(a乃至dの設定情報)
a.再生モードの自動切換えのON/OFF
b.デフォルトの再生モード
c.再生モードをミックスダウン処理に設定するときのユーザの状態(注視点の位置、顔の向き、選択ウインドウの種類等)
d.再生モードをバイノーラル処理に設定するときのユーザの状態(注視点の位置、顔の向き、選択ウインドウの種類等)
(Setting information for a to d)
a. Automatic switching of playback modes ON/OFF
b) Default Playback Mode
c. The state of the user when setting the playback mode to mixdown processing (position of the gaze point, face direction, type of selected window, etc.)
d. The user's state when setting the playback mode to binaural processing (position of gaze point, face direction, type of selected window, etc.)
 aの設定情報は、b乃至dの設定情報に従って再生モードの自動切替えを有効にするか否かを設定する。再生モードの自動切換えがONに設定された場合にのみb乃至dの設定情報に従って再生モードが切り替えられる。bの設定情報は、初期に設定される再生モード(再生処理の種類)、及び、c又はdによる再生モードの設定が行われていないときに設定される再生モード(再生処理の種類)である。cの設定情報は、再生モードをミックスダウン処理(cの再生処理)に設定する条件(cの条件)であり、dの設定情報は、再生モードをバイノーラル処理(dの再生処理)に設定する条件(dの条件)である。cの条件及びdの条件としては、再生処理装置11のセンシング部22が取得するセンシング情報から特定可能なユーザUの状態(動作、操作等も含む)であって、かつ、ユーザが再生モードの切替えのみを意図した操作等以外の特定の状態が設定される。例えば、ユーザUが特定の位置を注視した場合に、c又はdの条件が満たされたこととし、その特定の位置がc又はdの条件を決定する注視点の位置として設定される。ユーザUが注視している位置は、センシング部22が取得するヘッドトラッカ及びアイトラッカの情報やカメラの撮像画像等に基づいて特定され得る。または、ユーザが顔(頭部正面)を特定の向きに向けた場合に、c又はdの条件が満たされたこととし、その特定の向きがc又はdの条件を決定する顔の向きとして設定される。ユーザUの顔の向きは、センシング部22が取得するヘッドトラッカ及びアイトラッカの情報やカメラの撮像画像等に基づいて特定され得る。例えば、図2のように2つのモニタ42A及び42Bが使用されている場合に、ユーザUの注視点の位置又は顔の向きが、一方のモニタ(例えばモニタ42A)である状態を、再生モードがミックスダウン処理に設定されるcの設定情報として登録され得る。ユーザUの注視点の位置又は顔の向きが、他方のモニタ(例えばモニタ42B)である状態を、再生モードがバイノーラル処理に設定されるdの設定情報として登録され得る。また、ユーザUがモニタ42A又は42Bの画面上に表示された1又は複数のウインドウのうち、特定のウインドウを選択した場合に、c又はdの条件が満たされたこととし、その特定のウインドウがc又はdの条件を決定する選択ウインドウの種類として設定される。ユーザUが選択した選択ウインドウの種類は、センシング部22が取得するGUI情報やユーザ操作の情報等から特定され得る。これらの他に、ユーザUの視線方向が特定の方向(複数のディスプレイの間、ディスプレイの画面外、部屋の中の特定の方向)がc又はdの設定条件として登録される場合であってよいし、ユーザUに対するヘッドフォンの着脱状態(装着状態又は離脱状態)がc又はdの設定条件として登録される場合であってよい。また、b乃至dの設定状態は、ユーザが設定するのではなく、予め設定されている場合であってもよい。 The setting information a sets whether or not automatic switching of the playback mode is enabled according to the setting information b to d. The playback mode is switched according to the setting information b to d only when automatic switching of the playback mode is set to ON. The setting information b is the playback mode (type of playback processing) that is initially set, and the playback mode (type of playback processing) that is set when the playback mode is not set by c or d. The setting information c is the condition (condition c) for setting the playback mode to mixdown processing (playback processing of c), and the setting information d is the condition (condition d) for setting the playback mode to binaural processing (playback processing of d). As the condition c and the condition d, a specific state is set that is a state (including actions, operations, etc.) of the user U that can be specified from the sensing information acquired by the sensing unit 22 of the playback processing device 11, and is other than an operation, etc., that the user intends only to switch the playback mode. For example, when the user U gazes at a specific position, it is determined that the condition c or d is satisfied, and the specific position is set as the position of the gaze point that determines the condition c or d. The position where the user U is gazing at may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like. Alternatively, when the user turns his/her face (head front) in a specific direction, it is determined that the condition c or d is satisfied, and the specific direction is set as the facial direction that determines the condition c or d. The facial direction of the user U may be specified based on the information of the head tracker and eye tracker acquired by the sensing unit 22, the captured image of the camera, and the like. For example, when two monitors 42A and 42B are used as shown in FIG. 2, the position of the gaze point or the facial direction of the user U on one monitor (e.g., monitor 42A) may be registered as the setting information of c in which the playback mode is set to mixdown processing. The position of the gaze point or the facial direction of the user U on the other monitor (e.g., monitor 42B) may be registered as the setting information of d in which the playback mode is set to binaural processing. Furthermore, when the user U selects a specific window from one or more windows displayed on the screen of the monitor 42A or 42B, it is determined that the condition c or d is satisfied, and the specific window is set as the type of selected window that determines the condition c or d. The type of selected window selected by the user U can be identified from GUI information acquired by the sensing unit 22, information on user operation, and the like. In addition to these, the direction of the user U's line of sight may be a specific direction (between multiple displays, outside the display screen, a specific direction in the room) that is registered as the setting condition of c or d, or the state of the headphones attached or detached (worn or removed) for the user U may be registered as the setting condition of c or d. Furthermore, the setting states b to d may be preset, rather than being set by the user.
 ここで、再生モードの切替えは、例えば、次のような形態が採用され得る。第1の形態としては、cの再生処理は、cの条件が満たされているときに再生モードとして設定され、dの再生処理は、dの条件が満たされているときに再生モードとして設定されることとする。この場合、c及びdのいずれの条件も満たされていないときには再生モードが、bで設定されたデフォルトの再生モードに設定される。第2の形態としては、cの条件が満たされた後、dの条件が満たされるまではcの再生処理が再生モードとして設定され、dの条件が満たされた後、cの条件が満たされるまではdの再生処理が再生モードとして設定される。第3の形態としては、cとdとの設定情報のうちのいずれか一方のみが設定される。例えばcの設定情報が設定されたとする。このとき、cの条件が満たされているときにcの再生処理が再生モードとして設定され、cの条件が満たされていないときにはdの再生処理が再生モードとして設定される。 Here, the following modes can be adopted for switching the playback mode, for example. In a first mode, the playback process of c is set as the playback mode when the condition of c is satisfied, and the playback process of d is set as the playback mode when the condition of d is satisfied. In this case, when neither the condition of c nor the condition of d is satisfied, the playback mode is set to the default playback mode set in b. In a second mode, after the condition of c is satisfied, the playback process of c is set as the playback mode until the condition of d is satisfied, and after the condition of d is satisfied, the playback process of d is set as the playback mode until the condition of c is satisfied. In a third mode, only one of the setting information of c and d is set. For example, it is assumed that the setting information of c is set. In this case, when the condition of c is satisfied, the playback process of c is set as the playback mode, and when the condition of c is not satisfied, the playback process of d is set as the playback mode.
 以上の第1実施例によれば、3D音響再生を含む音声再生に関する複数種の再生処理が予め設定された条件に基づいて自動的に切り替えられるのでユーザが手動で再生処理を切り替える手間が不要となる。例えば、音響制作の作業に関しては、制作者は、3D音響再生処理が行われない音声を聴取しながらの方が作業し易い。制作した音声の確認に関しては、実際に再生される環境(原音場)で聴取される音声を再現した3D音響再生処理された音声を聴取した方が適切な良否判断が行い易い。製作者は、音響制作の作業と確認の作業とを何度も繰り返すことが多く、その度に、制作者が手動で再生処理(再生モード)を切り替えるのは煩雑な手間を要し、作業効率が悪い。第1実施例では予め設定された条件にしたがって、自動的に再生モードが切り替えられる。したがって、音響制作の作業を行う際の制作者の状態が例えば制作者が下(手元)を向いた状態(手元の制作(編集)機器を注視している状態)であるとすると、その状態であることを条件として、再生モードがミックスダウン処理に設定されるようにする。制作結果の確認を行う際の制作者の状態が例えば上を向いた状態(手元の制作(編集)機器を注視していない情報を)であるとすると、その状態であることを条件として、再生モードがバイノーラル処理に設定されるようにする。これによって、制作者は,再生モードの切替えを手動で行う必要がなく、かつ、音響制作の作業や制作結果の確認の作業に応じて適切な再生処理が行われる。また、ユーザの状態を検出するためのセンシング情報としてヘッドトラッカの情報(トラッキングデータ)を使用する場合に、トラッキングデータをオブジェクトオーディオの絶対位置にも使えるため組み合わせて利用できる。なお、ユーザUの向き(頭部正面の向き)ではなくユーザの視線方向や注視点の位置等の他の状態に応じて再生モードが切り替えられるようにしてもよい。また、頭部正面(顔)の向き、視線方向、注視点の位置などの判定対象の向きや位置が、どの範囲の向きや位置に該当するかで再生モードが切り替えられる場合に、それぞれの範囲の境界となる向きや位置が予め決められていることとする。判定対象の向きや位置は、それらの範囲の境界となる向きや位置と比較されることで、どの範囲の向きや位置に該当するかが判定される。本技術の説明においては、範囲の境界となる向きや位置については言及しない。 According to the first embodiment described above, since multiple types of playback processes related to audio playback, including 3D audio playback, are automatically switched based on preset conditions, the user does not need to manually switch between playback processes. For example, when it comes to audio production work, it is easier for the producer to work while listening to audio that has not been subjected to 3D audio playback processing. When checking the created audio, it is easier to make an appropriate judgment on the quality of the audio by listening to audio that has been subjected to 3D audio playback processing, which reproduces the audio heard in the actual playback environment (original sound field). The producer often repeats the audio production work and the confirmation work many times, and it is troublesome for the producer to manually switch the playback process (playback mode) each time, which is inefficient. In the first embodiment, the playback mode is automatically switched according to preset conditions. Therefore, if the state of the producer when performing the audio production work is, for example, a state in which the producer is facing down (at his/her hands) (a state in which the producer is gazing at the production (editing) equipment at hand), the playback mode is set to the mixdown process under the condition that the producer is in that state. For example, if the state of the producer when checking the production result is a state in which the producer is looking up (information indicating that the producer is not gazing at the production (editing) equipment at hand), the playback mode is set to binaural processing under the condition of this state. This eliminates the need for the producer to manually switch the playback mode, and appropriate playback processing is performed according to the work of audio production and the work of checking the production result. In addition, when head tracker information (tracking data) is used as sensing information for detecting the user's state, the tracking data can also be used for the absolute position of object audio, so that it can be used in combination. Note that the playback mode may be switched depending on other states such as the user's line of sight and the position of the gaze point, rather than the direction of the user U (direction of the head in front). In addition, when the playback mode is switched depending on the range of orientations and positions of the determination target, such as the direction of the head in front (face), the direction of sight, and the position of the gaze point, the orientations and positions that are the boundaries of each range are determined in advance. The orientation and position of the determination target are compared with the orientations and positions that are the boundaries of those ranges to determine which range of orientations and positions the determination target falls into. In explaining this technology, we will not mention the orientation or position of the range boundaries.
<再生処理装置11の再生モード切替えの第2実施例>
 図3は、再生処理装置11の再生モード切替えの第2実施例を説明する図である。第2実施例では、音声再生モードの切替えと連動してユーザに提示される映像の切替えが行われる。図3のA及びBにおいて、ユーザUは、音響制作の作業を行う場所(制作作業場所)の制作者であり、再生音場の音声を聴取する聴取者である。ユーザUは、ヘッドフォン付きのVRゴーグル(HMD)51を装着している。VRゴーグル51は、再生処理装置11に接続され、再生処理装置11は、VRゴーグル51のヘッドフォンにより再生する2chの音声信号と、VRゴーグル51で表示する映像信号とをVRゴーグル51に供給する。再生処理装置11からVRゴーグル51に供給される映像信号は、図3のAのように再生処理装置11で生成されたCG(Computer Graphics)映像の映像信号と、図3のBのようにカメラ(VR外界カメラ)で撮影された実写映像の映像信号とで切り替えられる。
<Second embodiment of switching of playback mode of playback processing device 11>
FIG. 3 is a diagram for explaining a second embodiment of the switching of the playback mode of the playback processing device 11. In the second embodiment, the switching of the image presented to the user is performed in conjunction with the switching of the audio playback mode. In A and B of FIG. 3, the user U is a producer at a place where the work of sound production is performed (production work place) and a listener who listens to the sound of the playback sound field. The user U wears VR goggles (HMD) 51 with headphones. The VR goggles 51 are connected to the playback processing device 11, and the playback processing device 11 supplies the VR goggles 51 with a 2ch audio signal to be played by the headphones of the VR goggles 51 and a video signal to be displayed on the VR goggles 51. The video signal supplied from the playback processing device 11 to the VR goggles 51 is switched between a video signal of a CG (Computer Graphics) image generated by the playback processing device 11 as shown in A of FIG. 3 and a video signal of a real-life image captured by a camera (VR outside world camera) as shown in B of FIG. 3.
 図3のAは、多ch音声再生ソフト24を用いて制作される音声の原音場の空間をCGで再現した仮想空間(CG空間)のCG映像であり、例えば、映画館のスクリーンを所定位置の座席の聴取者の視点で再現したCG映像である。なお、図3のAは、CG映像ではなく、原音場の空間を撮影して取得された実写映像であってもよい。図3のBは、再生処理装置11で音響制作を行うユーザUの頭部の正面方向(又は視線方向)をVRゴーグル51のVR外界カメラが撮影した実写映像である。その実写映像には、例えば、図2に示されたコンソール機41やモニタ42A及び42B等の制作作業場所に配置された音響制作システムの制作機器(再生処理装置11に接続された周辺機器等)が映る。なお、図3のBは、制作作業場所の実写映像ではなく、制作作業場所を模倣したCG映像であってもよい。CG映像として模倣する制作作業場所は、実際の制作作業場所に限らず、仮想上の制作作業場所であってもよく、仮想上の制作作業場所とする場合に音響制作システムに用いられる制作機器(モニタや入力装置等)も仮想上の機器(実在しない機器)であってよい。 A in FIG. 3 is a CG image of a virtual space (CG space) that reproduces the space of the original sound field of the sound produced using the multi-channel sound reproduction software 24 with CG, for example, a CG image that reproduces a movie theater screen from the viewpoint of a listener in a specified seat. Note that A in FIG. 3 may not be a CG image, but may be a live-action image obtained by photographing the space of the original sound field. B in FIG. 3 is a live-action image captured by the VR external camera of the VR goggles 51 in the front direction (or line of sight) of the head of the user U who produces sound with the playback processing device 11. The live-action image shows, for example, the production equipment of the sound production system (peripheral equipment connected to the playback processing device 11, etc.) that is placed in the production work area, such as the console machine 41 and monitors 42A and 42B shown in FIG. 2. Note that B in FIG. 3 may not be a live-action image of the production work area, but may be a CG image that imitates the production work area. The production work location simulated as a CG image is not limited to an actual production work location, but may be a virtual production work location, and when a virtual production work location is used, the production equipment (monitors, input devices, etc.) used in the audio production system may also be virtual equipment (equipment that does not actually exist).
 これらのVRゴーグル51に表示される映像は、音声再生モードの切替えと連動して自動で切り替えられる。例えば、ユーザUが上を向いたときと、手元(下)を向いたときとで再生処理装置11が異なる再生モードに切り替えられるとする。具体的には、ユーザUが上を向いたときには、再生モードがバイノーラル処理に設定され、ユーザUが手元を向いたときには、再生モードがミックスダウン処理に設定されるとする。なお、ユーザUが上又は手元を向いたときとは、ユーザUの顔(頭部正面)又は視線(注視点)が上又は手元(下)を向いたときであり、センシング部22が取得するセンシング情報に基づいてユーザUが上を向いたか手元を向いたかが判定される。 The images displayed on these VR goggles 51 are automatically switched in conjunction with the switching of the audio playback mode. For example, the playback processing device 11 is switched to different playback modes when the user U looks up and when the user U looks toward his/her hands (down). Specifically, when the user U looks up, the playback mode is set to binaural processing, and when the user U looks toward his/her hands, the playback mode is set to mixdown processing. Note that when the user U looks up or toward his/her hands, this refers to when the user U's face (front of the head) or line of sight (point of gaze) looks up or toward his/her hands (down), and it is determined whether the user U looks up or toward his/her hands based on the sensing information acquired by the sensing unit 22.
 このような再生処理装置11の再生モードの切替えと連動して、VRゴーグル51に表示される映像が、図3のAとBとの映像に切り替えられる。具体的には、ユーザUが上を向いたときには、VRゴーグル51には、原音場の空間をCGで再現した図3のAのようなCG映像が表示される。ユーザUが手元を向いたときには、制作作業場所のユーザUの手元(周囲)を撮影した図3のBのような実写映像が表示される。 In conjunction with switching the playback mode of the playback processing device 11 in this way, the image displayed on the VR goggles 51 is switched between images A and B in Fig. 3. Specifically, when the user U looks up, a CG image like that of Fig. 3A, which reproduces the original sound field space using CG, is displayed on the VR goggles 51. When the user U looks toward his or her hands, a live-action image like that of Fig. 3B, which is a photograph of the user U's hands (surroundings) at the production work location, is displayed.
 第2実施例によれば、第1実施例と同様に3D音響再生を含む音声再生に関する複数種の再生処理が予め設定された条件に基づいて自動的に切り替えれるのでユーザが手動で再生処理を切り替える手間が不要となる。また、ユーザUが音響制作作業の実施等のために手元の周辺機器が存在する下を向いた際には、ユーザUは、VRゴーグル51に表示された手元の実写映像を視認することができ、周辺機器の操作等を実写映像を介して容易に行うことができる。このとき、ユーザUは、3D音響再生処理が行われていないミックスダウン処理された音声をヘッドフォンから聴取することができ、音響制作に適した音声を聴きながら制作作業を行うことができる。一方、ユーザUが制作結果の確認等のために上を向いた際には、ユーザUは、VRゴーグル51に表示された原音場の空間のCG映像を視認することができ、原音場の空間を視覚的に認識することができる。このとき、ユーザUは、3D音響再生処理であるバイノーラル処理された音声をヘッドフォンから聴取することができる。即ち、ユーザUは、制作した音声を原音場の環境下で再生したときに聴取される音声によりで確認することができる。したがって、ユーザUは、制作した音声を原音場の環境で再生したときの音声をバイノーラル処理された音声により聴取しながら、かつ、原音場の空間をCG映像により視覚的に把握しながら、制作した音声の良否を適切に判断することができる。 According to the second embodiment, as in the first embodiment, multiple types of playback processes related to audio playback, including 3D audio playback, are automatically switched based on preset conditions, eliminating the need for the user to manually switch between playback processes. In addition, when the user U looks down to perform audio production work, etc., where peripheral devices are present, the user U can view the live-action video displayed on the VR goggles 51, and can easily operate the peripheral devices through the live-action video. At this time, the user U can listen to the mixdown-processed audio that has not been subjected to 3D audio playback processing through headphones, and can perform audio production work while listening to audio suitable for audio production. On the other hand, when the user U looks up to check the production results, etc., the user U can view the CG video of the original sound field space displayed on the VR goggles 51, and can visually recognize the space of the original sound field. At this time, the user U can listen to the binaurally processed audio, which is a 3D audio playback process, through headphones. In other words, the user U can check the created audio by the audio heard when it is played back in the original sound field environment. Therefore, user U can properly judge the quality of the created audio while listening to the binaurally processed audio when the created audio is played in the original sound field environment and visually grasping the space of the original sound field through the CG image.
 例えば、原音場である映画館の所定の位置の座席で映画を視聴している聴取者を想定し、映画館でその聴取者が聴取する音声を、制作作業場所の制作者であるユーザUが聴取する再生音場の音声としてバイノーラル処理により再現したとする。その場合に、VRゴーグル51に表示されてユーザUに提示される図3のAのCG映像として、映画館において想定された聴取者からみた空間のCG映像が表示されるようにする。これにより、ユーザUは映画館における聴取者の視聴位置や視聴位置に対するスクリーンやスピーカの配置等の原音場の空間の状態を視覚的に把握することができる。そして、ユーザUは、映画館のスピーカから出力されて聴取者が聴取する音声を、バイノーラル処理された再生音場の音声により聴取することができる。従って、ユーザUは、多ch音声再生ソフト24を用いて制作した音声が、CG映像により認識される映画館における視聴位置で聴取した場合に適切か否かを確認することができる。ユーザUは、確認した音声が適切でなければ適切となるまで多ch音声再生ソフト24を用いた音響制作(編集)の作業を繰り返す。このような音響制作の作業と制作結果の確認とを繰り返す際に、ユーザUが視聴する音声及び映像が、音響制作の作業に適したミックスダウン処理された音声及び映像(制作作業場所の映像)と、制作結果の確認に適したバイノーラル処理された音声及び映像(映画館の映像)とに自動的に切り替えられるので作業効率が格段に向上する。 For example, let us assume that a listener is watching a movie in a seat at a specific position in a movie theater, which is the original sound field, and the sound heard by the listener in the movie theater is reproduced by binaural processing as the sound of the playback sound field heard by user U, the producer at the production work site. In this case, the CG image of the space seen by the listener in the movie theater is displayed as the CG image of A in FIG. 3, which is displayed on the VR goggles 51 and presented to user U. This allows user U to visually grasp the spatial state of the original sound field, such as the listening position of the listener in the movie theater and the arrangement of the screen and speakers relative to the listening position. User U can then listen to the sound output from the movie theater speakers and heard by the listener as the sound of the playback sound field that has been binaurally processed. Thus, user U can confirm whether the sound produced using the multi-channel sound playback software 24 is appropriate when listened to at the listening position in the movie theater recognized by the CG image. If the confirmed audio is not appropriate, the user U repeats the audio production (editing) work using the multi-channel audio playback software 24 until it is appropriate. When repeating such audio production work and confirmation of the production results, the audio and video viewed by the user U are automatically switched between mixdown processed audio and video (video from the production work site) suitable for the audio production work, and binaural processed audio and video (video from the movie theater) suitable for confirming the production results, thus dramatically improving work efficiency.
 以上のように再生処理装置11における再生モードの自動切替えと連動してユーザUに提示される映像も自動で切り替えられるので、ユーザUが音響制作の作業と制作結果の確認作業とでそれらを手動で切り替える手間を低減することができ、作業効率が格段に向上する。 As described above, the video presented to user U can be automatically switched in conjunction with the automatic switching of the playback mode in the playback processing device 11, which reduces the effort required for user U to manually switch between audio production work and checking the production results, significantly improving work efficiency.
<再生処理装置11の再生モード切替えの第2実施例の処理手順>
 図4は、再生処理装置11の再生処理切替えの第2実施例の処理手順例を示したフローチャートである。ステップS1では、センシング部22は、ユーザの状態を示すセンシング情報を取得する。ステップS2では、トリガ生成部23は、ステップS1で取得されたセンシング情報と、事前に決められた設定情報とに基づいて、バイノーラル処理とミックスダウン処理とのいずれか一方の再生処理から他方の再生処理への切替えの条件が満たされたことを検出すると、その旨を示すトリガ信号を再生信号生成部21に供給する。再生信号生成部21のトリガ受信部31は、トリガ信号を受信すると、トリガ信号が示す再生処理への切替えを切替処理部32に供給する。切替処理部32は、トリガ受信部31からの情報と、現在の再生モード(再生処理)とに基づいて、設定すべき再生モードを判定する。
<Processing Procedure of the Second Example of Playback Mode Switching of the Playback Processing Device 11>
4 is a flow chart showing an example of a processing procedure of the second embodiment of the playback processing switching of the playback processing device 11. In step S1, the sensing unit 22 acquires sensing information indicating the state of the user. In step S2, when the trigger generation unit 23 detects that a condition for switching from one of the playback processing of the binaural processing and the mixdown processing to the other playback processing is satisfied based on the sensing information acquired in step S1 and the setting information determined in advance, the trigger generation unit 23 supplies a trigger signal indicating that to the playback signal generation unit 21. When the trigger reception unit 31 of the playback signal generation unit 21 receives the trigger signal, the trigger reception unit 31 supplies a switching to the playback processing indicated by the trigger signal to the switching processing unit 32. The switching processing unit 32 determines the playback mode to be set based on the information from the trigger reception unit 31 and the current playback mode (playback processing).
 ステップS3では、切替処理部32は、設定すべき再生モードがバイノーラル処理か否かを判定する。ステップS3で肯定された場合には処理はステップS4に進み、否定された場合には処理はステップS7に進む。ステップS4では、切替処理部32は、バイノーラル処理部33でのバイノーラル処理を有効する。バイノーラル処理部33は、多ch音声再生ソフト24から供給される多chの音声信号に対してバイノーラル処理を行う。ステップS5では、バイノーラル処理部33は、バイノーラル処理後の音声信号に基づいて音声再生装置12での再生用の音声信号(2chの音声信号)を生成し、音声再生装置12に出力する。ステップS6では、再生信号生成部21は、原音源の空間のCG映像を生成し、生成したCG映像の映像信号をユーザが視認するVRゴーグル等の表示装置に出力する。ステップS6の後、本フローチャートの処理が終了する。なお、本フローチャートの処理は繰り返し実行される。 In step S3, the switching processing unit 32 judges whether the playback mode to be set is binaural processing or not. If the answer is yes in step S3, the process proceeds to step S4, and if the answer is no, the process proceeds to step S7. In step S4, the switching processing unit 32 enables binaural processing in the binaural processing unit 33. The binaural processing unit 33 performs binaural processing on the multi-channel audio signal supplied from the multi-channel audio playback software 24. In step S5, the binaural processing unit 33 generates an audio signal (2-channel audio signal) for playback in the audio playback device 12 based on the audio signal after binaural processing, and outputs it to the audio playback device 12. In step S6, the playback signal generation unit 21 generates a CG image of the space of the original sound source, and outputs the video signal of the generated CG image to a display device such as VR goggles that the user views. After step S6, the process of this flowchart ends. Note that the process of this flowchart is executed repeatedly.
 ステップS3の判定で否定された場合のステップS7では、切替処理部32は、2chミックスダウン/ステレオ再生処理部34でのミックスダウン処理を有効する。2chミックスダウン/ステレオ再生処理部34は、多ch音声再生ソフト24から供給される多chの音声信号に対してミックスダウン処理を行う。ステップS8では、2chミックスダウン/ステレオ再生処理部34は、ミックスダウン処理後の2chの音声信号を音声再生装置12での再生用の音声信号として音声再生装置12に出力する。ステップS9では、再生信号生成部21は、制作作業場所(手元空間)の実写映像をセンシング部22(カメラ)から取得し、実写映像の映像信号をユーザが視認するVRゴーグル等の表示装置に出力する。ステップS9の後、本フローチャートの処理が終了する。なお、再生処理装置11の再生処理切替えの第1実施例では、ステップS6及びステップS9の処理が実施されない点で第2実施例と相違する。 In step S7, if the judgment in step S3 is negative, the switching processing unit 32 enables the mixdown processing in the 2ch mixdown/stereo playback processing unit 34. The 2ch mixdown/stereo playback processing unit 34 performs mixdown processing on the multi-ch audio signal supplied from the multi-ch audio playback software 24. In step S8, the 2ch mixdown/stereo playback processing unit 34 outputs the 2ch audio signal after the mixdown processing to the audio playback device 12 as an audio signal for playback in the audio playback device 12. In step S9, the playback signal generating unit 21 acquires live-action video of the production work location (hands-on space) from the sensing unit 22 (camera) and outputs the video signal of the live-action video to a display device such as VR goggles that the user views. After step S9, the processing of this flowchart ends. Note that the first embodiment of the playback processing switching of the playback processing device 11 differs from the second embodiment in that the processing in steps S6 and S9 is not performed.
<再生処理装置11の再生モード切替えの第3及び第4実施例>
 再生処理装置11の再生モード切替えの第3及び第4実施例では、ユーザの状態を示すセンシング情報に基づいて、再生モードをバイノーラル処理とミックスダウン処理とで切り替えるだけでなく、バイノーラル処理に適用するHRTFの切替え又は調整を行う。また、第3及び第4実施例では、再生モードとしてミックスダウン処理に切り替えられることが必ずしも要件ではない。したがって、第3及び第4実施例の説明では、再生処理装置11は再生モード切替えとしてバイノーラル処理に適用されるHRTFの切替え又は調整のみが行われることとする。ただし、第3及び第4実施例において、第1又は第2実施例での処理を組合せて、再生モードとしてミックスダウン処理に切り替えられるようにしもてよい。
<Third and fourth embodiments of switching of playback mode of playback processing device 11>
In the third and fourth embodiments of the playback mode switching of the playback processing device 11, not only is the playback mode switched between binaural processing and mixdown processing based on sensing information indicating the state of the user, but also the HRTF applied to the binaural processing is switched or adjusted. Moreover, in the third and fourth embodiments, it is not necessarily required that the playback mode be switched to the mixdown processing. Therefore, in the explanation of the third and fourth embodiments, it is assumed that the playback processing device 11 only switches or adjusts the HRTF applied to the binaural processing as the playback mode switching. However, in the third and fourth embodiments, the processing in the first or second embodiment may be combined to switch the playback mode to the mixdown processing.
<第3及び第4実施例が適用される音響制作システム>
 図5は、再生モード切替えの第3及び第4実施例が適用される音響制作システムの構成例を示した図である。図5において、測定環境は、再生処理装置11のバイノーラル処理に適用されるHRTFを事前に実測により取得する際の測定環境を表す。なお、上述のようにHRTFに代えてBRTF等の音声の伝達特性をバイノーラル処理に適用することができ、その場合には測定環境においてHRTFの代わりにバイノーラル処理に適用する伝達特性が測定される場合であってよい。
<Audio Production System to which the Third and Fourth Embodiments are Applied>
Fig. 5 is a diagram showing an example of the configuration of an audio production system to which the third and fourth embodiments of playback mode switching are applied. In Fig. 5, the measurement environment represents the measurement environment when the HRTF applied to the binaural processing of the playback processing device 11 is actually measured in advance. As described above, the transfer characteristic of audio such as BRTF can be applied to the binaural processing instead of the HRTF, and in that case, the transfer characteristic applied to the binaural processing instead of the HRTF may be measured in the measurement environment.
 測定環境には、例えば、原音場として映画館が例示されている。原音場としての映画館は、ダビングステージなどと呼ばれ、音響制作において再生音場の音声として再現する原音場の空間である。再生環境は、音響制作の際に使用する音響製作場所において、原音場の音声を再生音場の音声として再現する再生環境を表す。音響制作場所は、スタジオや制作者の自宅などの原音場とは異なる場所であるが、原音場と同じ場所であってもよい。測定環境に示された測定処理装置81は、映画館等の原音場の音響特性に応じたHRTFを取得し、BRTFファイル(後述)を生成する。また、測定処理装置81は、HRTFの測定時の条件を示す条件情報を取得し、HRTFとともに条件情報をBRTFファイルに格納する。 The measurement environment is exemplified by a movie theater as an original sound field. A movie theater as an original sound field is also called a dubbing stage, and is the space of the original sound field that is reproduced as the sound of the playback sound field in sound production. The playback environment represents the playback environment in which the sound of the original sound field is reproduced as the sound of the playback sound field in the sound production location used in sound production. The sound production location is a location different from the original sound field, such as a studio or the producer's home, but it may be the same location as the original sound field. The measurement processing device 81 shown in the measurement environment acquires an HRTF corresponding to the acoustic characteristics of the original sound field, such as a movie theater, and generates a BRTF file (described later). The measurement processing device 81 also acquires condition information indicating the conditions when the HRTF was measured, and stores the condition information in the BRTF file together with the HRTF.
 再生環境における再生処理装置11は、図1の再生処理装置11に相当し、ヘッドフォン44は、再生処理装置11に接続された図1の音声再生装置12の一形態である。なお、ヘッドフォン44は、図3のVRゴーグル51に付属するヘッドフォンであってもよいし、その他の音声再生装置であってもよい。再生処理装置11は、測定処理装置81により生成されたBRTFファイルを取得し、BRTFファイルのデータに基づいて、バイノーラル処理で用いるパラメータを設定する。BRTFファイルは、インターネットなどのネットワークを介して再生処理装置11が取得できるようにしてもよいし、フラッシュメモリなどの記録媒体を用いて再生処理装置11が取得できるようにしてもよい。 The playback processing device 11 in the playback environment corresponds to the playback processing device 11 in FIG. 1, and the headphones 44 are one form of the audio playback device 12 in FIG. 1 connected to the playback processing device 11. The headphones 44 may be headphones attached to the VR goggles 51 in FIG. 3, or may be other audio playback devices. The playback processing device 11 acquires the BRTF file generated by the measurement processing device 81, and sets parameters to be used in binaural processing based on the data in the BRTF file. The playback processing device 11 may be able to acquire the BRTF file via a network such as the Internet, or may be able to acquire the BRTF file by using a recording medium such as a flash memory.
 図6は、測定環境における測定の流れを示す図である。HRTFの測定は、被測定者が映画館の所定の座席に座り、耳穴にマイクロフォンを取り付けた状態で行われる。この状態で、映画館のスピーカ91から再生音が出力され、スピーカ91から耳(例えば耳穴位置、鼓膜位置)までのHRTFが測定される。 Figure 6 shows the measurement flow in the measurement environment. HRTF measurements are taken with the subject sitting in a designated seat in the cinema, with a microphone attached to their ear. In this state, playback sound is output from the cinema's speaker 91, and the HRTF from the speaker 91 to the ear (e.g., ear canal position, eardrum position) is measured.
 例えば、図6の吹き出し#1に示すように、被測定者が各姿勢1乃至3で位置Aの座席に座った状態でHRTFの測定が行われたとする。また、吹き出し#2に示すように、被測定者が各姿勢1乃至3で位置Bの座席に座った状態でHRTFの測定が行われたとする。さらに、被測定者が各姿勢1乃至3で位置Cの座席に座った状態でHRTFの測定が行われたとする。 For example, as shown in speech bubble #1 in Figure 6, assume that HRTF measurements are taken with the subject sitting in a seat at position A in each of postures 1 to 3. Also, as shown in speech bubble #2, assume that HRTF measurements are taken with the subject sitting in a seat at position B in each of postures 1 to 3. Furthermore, assume that HRTF measurements are taken with the subject sitting in a seat at position C in each of postures 1 to 3.
 吹き出し#4に示すように、映画館の形状を示す空間形状データが条件情報として取得される。例えば、映画館の形状を示す最小の要素として、映画館の幅、高さ、および奥行きの長さが、空間形状データとして記録される。なお、頂点情報やポイントクラウドなどの、より詳細な形状を示す情報が空間形状データとして記録されるようにしてもよい。 As shown in speech bubble #4, spatial shape data indicating the shape of the movie theater is acquired as condition information. For example, the width, height, and depth of the theater are recorded as spatial shape data as the smallest elements that indicate the shape of the theater. Note that information indicating more detailed shapes, such as vertex information or point clouds, may also be recorded as spatial shape data.
 吹き出し#5に示すように、HRTFの測定に用いられる測定音源(原音源)であるスピーカ91の位置情報が条件情報として取得される。例えば、映画館内のスピーカ91の位置を示す座標、および、その座標の原点に相当する、映画館の空間形状データ上の位置が、スピーカ91の位置情報として記録される。 As shown in speech bubble #5, position information of speaker 91, which is the measurement sound source (original sound source) used in measuring HRTF, is acquired as condition information. For example, coordinates indicating the position of speaker 91 in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as position information of speaker 91.
 吹き出し#6に示すように、HRTFの測定時の被測定者の位置(測定位置)を示す測定位置情報と、姿勢(測定姿勢)を示す測定姿勢情報が条件情報として取得される。例えば、映画館内の被測定者の位置を示す座標、および、その座標の原点に相当する、映画館の空間形状データ上の位置が、測定位置情報として記録される。例えば、被測定者の頭部のオイラー角が測定姿勢情報として記録される。 As shown in speech bubble #6, measurement position information indicating the subject's position (measurement position) when the HRTF is measured and measurement posture information indicating the posture (measurement posture) are acquired as condition information. For example, the coordinates indicating the subject's position in the movie theater and the position on the spatial shape data of the theater that corresponds to the origin of those coordinates are recorded as measurement position information. For example, the Euler angles of the subject's head are recorded as measurement posture information.
 測定処理装置81は、以上のようにして測定されたHRTFや条件情報をBRTFファイルに格納する。BRTFファイルには、例えば、各位置A乃至Cと各姿勢1乃至3との組合せごとに、同一の種類のデータからなるグループデータが格納される。各組合せのグループデータには、空間形状データ、測定音源(原音源)の位置情報、測定位置情報、測定姿勢情報、ヘッドフォン44から耳までの伝達特性データ、および、被測定者が各測定姿勢で各位置の座席に座った状態で測定されたHRTFの測定データが含まれる。だたし、空間形状データや、測定音源の位置情報、及び、ヘッドフォン44から耳までの伝達特性データは、各位置A乃至Cと各姿勢1乃至3との組合せに関わらず共通するので、グループデータ外のデータとしてBRTFファイルに格納されてもよい。 The measurement processing device 81 stores the HRTF and condition information measured in the above manner in a BRTF file. The BRTF file stores group data consisting of the same type of data for each combination of positions A to C and postures 1 to 3, for example. The group data for each combination includes spatial shape data, position information of the measurement sound source (original sound source), measurement position information, measurement posture information, transfer characteristic data from the headphones 44 to the ears, and HRTF measurement data measured with the subject sitting in a seat at each position in each measurement posture. However, the spatial shape data, position information of the measurement sound source, and transfer characteristic data from the headphones 44 to the ears are common regardless of the combination of positions A to C and postures 1 to 3, so they may be stored in the BRTF file as data outside the group data.
<再生処理装置11の再生モード切替えの第3実施例>
 再生処理装置11の再生モード切替えの第3実施例では、ユーザUの状態を示すセンシング情報に基づいて、バイノーラル処理に適用されるHRTFが切り替えられ、音声再生モードの切替が行われる。また、第3実施例では、ユーザUにVRゴーグル51で提示される映像はCG映像のみであるとする。なお、ユーザUの状態に応じて実写映像(制作作業場所の周辺機器を撮影した手元画像等)が表示される場合であってもよい。
<Third embodiment of switching of playback mode of playback processing device 11>
In a third embodiment of the playback mode switching of the playback processing device 11, the HRTF applied to the binaural processing is switched and the audio playback mode is switched based on sensing information indicating the state of the user U. In addition, in the third embodiment, it is assumed that the images presented to the user U through the VR goggles 51 are only CG images. Note that live-action images (such as images of peripheral devices at the production work site) may be displayed depending on the state of the user U.
 図7は、再生モード切替えの第3実施例が適用される再生処理装置11の構成例を示したブロック図である。図7には、図1の再生処理装置11のブロック図に示されていないブロックが示されているが、図7のブロックの一部は、図1に示されたブロックを細分化して表したものであり、また、図1に示されたブロックの一部は図7では省略されている。図7において、再生処理装置11は、BRTFファイル取得部101、音声制御部102、及び、表示制御部103を有する。なお、音声制御部102と表示制御部103は、一部を除いて、図1の再生信号生成部21に含まれる。 FIG. 7 is a block diagram showing an example configuration of a playback processing device 11 to which a third embodiment of playback mode switching is applied. FIG. 7 shows blocks that are not shown in the block diagram of the playback processing device 11 in FIG. 1, but some of the blocks in FIG. 7 are subdivisions of the blocks shown in FIG. 1, and some of the blocks shown in FIG. 1 are omitted in FIG. 7. In FIG. 7, the playback processing device 11 has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103. Note that the audio control unit 102 and the display control unit 103 are, with some exceptions, included in the playback signal generation unit 21 in FIG. 1.
 BRTFファイル取得部101は、図4の測定処理装置81により生成されたBRTFファイルを取得する。取得するBRTFファイルは、再生処理装置11を使用して音響制作を行う制作者(ユーザU)を被測定者として測定された測定データを格納したファイルであることが望ましいが、これに限らない。BRTFファイルには、係数データ、空間情報、及び、測定姿勢情報が含まれる。係数データは、HRTFの測定データに対応する。バイノーラル処理は、FIR(Finite Impulse Response)フィルタを用いた畳み込み処理により行うことができる。その際、FIRフィルタの係数は、バイノーラル処理に適用するHRTFの特性に基づいて設定される。BRTFファイルの係数データは、HRTFの測定データを、FIRフィルタの係数のデータとして表したものであり、HRTFの測定データからFIRフィルタの係数を算出する処理が、BRTFファイルからHRTFの測定データを読み出した後に音声制御部102等で行われる場合であってもよい。空間情報は、空間形状データ、測定音源(原音源)の位置情報、測定位置情報を含む。係数データ、空間情報、及び、測定姿勢情報には、それぞれ、図6で説明したように複数の測定位置(図6では位置A乃至C)と複数の測定姿勢(図6では姿勢1乃至3)とで測定された複数のデータが、測定位置及び測定姿勢に紐付けれて(対応付けれて)記録されている。なお、再生処理装置11は、バイノーラル処理の内容を特定するHRTFの測定データとしてFIRフィルタの係数データをBRTFファイルから取得することとしているが、バイノーラル処理の内容を特定するために取得する測定データは、FIRフィルタの係数データでなくてもよい。バイノーラル処理の内容は、原音場におけるHRTF等の音響特性(伝達特性)により特定され得るので、再生処理装置11は、原音場における音響特性の情報を取得する場合であってよい。また、再生処理装置11は、実測により得られた原音場における音響特性の情報を、BRTFファイルから取得するのではなく、空間形状、原音源の位置、測定位置(聴取位置)等に基づいて、原音場における音響特性の情報を理論的に算出する場合であってもよい。 The BRTF file acquisition unit 101 acquires a BRTF file generated by the measurement processing device 81 of FIG. 4. The acquired BRTF file is preferably a file that stores measurement data measured using a producer (user U) who produces audio using the playback processing device 11 as the subject, but is not limited to this. The BRTF file includes coefficient data, spatial information, and measurement posture information. The coefficient data corresponds to the HRTF measurement data. Binaural processing can be performed by convolution processing using an FIR (Finite Impulse Response) filter. At that time, the coefficient of the FIR filter is set based on the characteristics of the HRTF to be applied to the binaural processing. The coefficient data of the BRTF file represents the HRTF measurement data as coefficient data of the FIR filter, and the process of calculating the coefficient of the FIR filter from the HRTF measurement data may be performed by the audio control unit 102 or the like after reading the HRTF measurement data from the BRTF file. The spatial information includes spatial shape data, position information of the measurement sound source (original sound source), and measurement position information. As described in FIG. 6, the coefficient data, spatial information, and measurement posture information each include a plurality of data measured at a plurality of measurement positions (positions A to C in FIG. 6) and a plurality of measurement postures (postures 1 to 3 in FIG. 6) that are linked (associated) with the measurement positions and measurement postures. Note that the reproduction processing device 11 acquires FIR filter coefficient data from the BRTF file as measurement data of HRTF that specifies the content of binaural processing, but the measurement data acquired to specify the content of binaural processing does not have to be FIR filter coefficient data. Since the content of binaural processing can be specified by acoustic characteristics (transfer characteristics) such as HRTF in the original sound field, the reproduction processing device 11 may acquire information on the acoustic characteristics in the original sound field. Also, the reproduction processing device 11 may theoretically calculate information on the acoustic characteristics in the original sound field based on the spatial shape, the position of the original sound source, the measurement position (listening position), etc., instead of acquiring information on the acoustic characteristics in the original sound field obtained by actual measurement from the BRTF file.
 音声制御部102は、図1のバイノーラル処理部33を含む。音声制御部102は、係数読込部111、畳み込み処理部112、及び、音声再生処理部113を有する。係数読込部111は、音声再生時(現時点)における制作者であるユーザUの姿勢(再生姿勢)の情報(再生姿勢情報)を再生姿勢情報取得部126から取得し、ユーザUの再生姿勢に対応する係数データ(HRTFの測定データ)をBRTFファイルから読み込む。再生姿勢に対応する係数データとは、再生姿勢に近い測定姿勢で測定されたHRTFに対応する係数データである。なお、BRTFファイルの係数データとして、複数の測定位置で測定された複数の係数データが含まれる場合に、係数読込部111が取得する係数データは、例えば事前にユーザUが指定した測定位置で測定された係数データとする。また、本第3実施例では、BRTFファイルには、1つの測定位置で測定されたデータのみが含まれる場合であってもよい。その場合には、その測定位置で測定されたHRTFに対応する係数データのうち、姿勢に近い測定姿勢で測定された係数データが係数読込部111により読み込まれる。 The audio control unit 102 includes the binaural processing unit 33 of FIG. 1. The audio control unit 102 has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113. The coefficient reading unit 111 obtains information (playback posture information) on the posture (playback posture) of the user U, who is the producer, at the time of audio playback (current time) from the playback posture information acquisition unit 126, and reads coefficient data (HRTF measurement data) corresponding to the playback posture of the user U from the BRTF file. The coefficient data corresponding to the playback posture is coefficient data corresponding to the HRTF measured in a measurement posture close to the playback posture. Note that when multiple coefficient data measured at multiple measurement positions are included as coefficient data in the BRTF file, the coefficient data obtained by the coefficient reading unit 111 is, for example, coefficient data measured at a measurement position specified in advance by the user U. In addition, in this third embodiment, the BRTF file may include only data measured at one measurement position. In that case, the coefficient reading unit 111 reads the coefficient data measured in a measurement posture close to the posture among the coefficient data corresponding to the HRTF measured at that measurement position.
 畳み込み処理部112は、係数読込部111により読み込まれた係数データにより、FIRフィルタの係数を設定する。畳み込み処理部112は、図1の多ch音声再生ソフト24から供給される音声信号に対して、FIRフィルタを用いた畳み込み処理を行う。これによって、多ch音声再生ソフト24から供給される音声信号に対してユーザUの姿勢に応じたHRTFを適用したバイノーラル処理が行われる。音声再生処理部113は、畳み込み処理部112によりバイノーラル処理された音声信号を図1の音声再生装置12に出力する。 The convolution processing unit 112 sets the coefficients of the FIR filter based on the coefficient data read by the coefficient reading unit 111. The convolution processing unit 112 performs convolution processing using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. This performs binaural processing on the audio signal supplied from the multi-channel audio playback software 24, applying an HRTF according to the posture of the user U. The audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1.
 表示制御部103は、VRゴーグル等の表示装置に表示するCG映像を生成する。表示制御部103は、空間情報読込部121、CGモデル取得部122、CGデータ記憶部123、測定位置情報読込部124、CG空間描画部125、再生姿勢情報取得部126、及び、映像描画処理部127を有する。空間情報読込部121は、BRTFファイル取得部101で取得されたBRTFファイルから空間情報に含まれる空間形状データと測定音源(原音源)の位置情報とを読み込む。 The display control unit 103 generates CG images to be displayed on a display device such as VR goggles. The display control unit 103 has a spatial information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a CG space drawing unit 125, a playback attitude information acquisition unit 126, and an image drawing processing unit 127. The spatial information reading unit 121 reads the spatial shape data contained in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired by the BRTF file acquisition unit 101.
 CGモデル取得部122は、空間情報読込部121により空間形状データと測定音源(原音源)の位置情報とに基づいて、CGデータ記憶部123から原音場に存在する物体(壁、床、天井、スピーカ、スクリーン、座席等)に対応する3Dモデルの素材データを取得し、原音場の空間を模倣したCGモデルを仮想空間(CG空間)上に生成する。 The CG model acquisition unit 122 acquires material data of a 3D model corresponding to objects (walls, floors, ceilings, speakers, screens, seats, etc.) present in the original sound field from the CG data storage unit 123 based on the spatial shape data and the positional information of the measured sound source (original sound source) from the spatial information reading unit 121, and generates a CG model that mimics the space of the original sound field in a virtual space (CG space).
 測定位置情報読込部124は、BRTFファイル取得部101で取得されたBRTFファイルから空間情報に含まれる測定位置情報を読み込む。CG空間描画部125は、CGモデル取得部122に生成されたCG空間のレンダリングを行い、2DのCG映像を生成する。レンダリングの際の仮想カメラ(視点)の位置は、測定位置情報読込部124により取得された測定位置情報に基づいて、原音場における測定位置に対応するCG空間上の位置に設定される。レンダリングの際の仮想カメラ(視点)の姿勢は、再生姿勢情報取得部126により取得された現時点のユーザUの再生姿勢に基づいて、制作作業場所におけるユーザUの姿勢に対応する姿勢に設定される。なお、BRTFファイルの空間情報として複数の測定位置情報が含まれる場合に、例えば事前にユーザUが指定した測定位置情報が測定位置情報読込部124により読み込まれ、その測定位置情報に対応するCG空間上の位置にレンダリングの際の仮想カメラが設定される。CG空間上の仮想カメラの位置として参照される測定位置情報は、係数読込部111が取得する係数データに対応付けられた測定位置情報と同じである。CG空間描画部125は、CG空間に設定した仮想カメラで撮影した2DのCG映像をレンダリングにより生成する。 The measurement position information reading unit 124 reads the measurement position information included in the spatial information from the BRTF file acquired by the BRTF file acquisition unit 101. The CG space drawing unit 125 renders the CG space generated by the CG model acquisition unit 122 to generate a 2D CG image. The position of the virtual camera (viewpoint) during rendering is set to a position in the CG space corresponding to the measurement position in the original sound field based on the measurement position information acquired by the measurement position information reading unit 124. The attitude of the virtual camera (viewpoint) during rendering is set to a posture corresponding to the posture of the user U at the production work site based on the playback posture of the user U at the current time acquired by the playback posture information acquisition unit 126. Note that when multiple measurement position information is included as spatial information of the BRTF file, for example, measurement position information specified by the user U in advance is read by the measurement position information reading unit 124, and the virtual camera during rendering is set to a position in the CG space corresponding to the measurement position information. The measurement position information referenced as the position of the virtual camera in the CG space is the same as the measurement position information associated with the coefficient data acquired by the coefficient reading unit 111. The CG space rendering unit 125 generates, by rendering, a 2D CG image captured by a virtual camera set in the CG space.
 再生姿勢情報取得部126は、図1のセンシング部22のセンシング情報に基づいて、再生処理装置11による音声再生時(現時点)におけるユーザUの再生姿勢情報を取得する。ユーザUの再生姿勢情報は、例えば、ユーザUの頭部の姿勢である。ユーザUの頭部の姿勢は、センシング部22が取得するヘッドトラッカの情報から認識することができる。ヘッドトラッカは、ユーザUがVRゴーグル(HMD)を装着している場合には、VRゴーグルに設置されたIMU(Inertial Measurement Unit:慣性計測ユニット)によりユーザUの頭部の姿勢を検出することができる。センシング部22がユーザUを撮影するカメラの撮像画像を取得している場合には、その撮像画像からユーザUの頭部の姿勢を検出してもよい。 The playback posture information acquisition unit 126 acquires playback posture information of the user U at the time of audio playback by the playback processing device 11 (current time) based on the sensing information of the sensing unit 22 in FIG. 1. The playback posture information of the user U is, for example, the posture of the user U's head. The posture of the user U's head can be recognized from head tracker information acquired by the sensing unit 22. When the user U is wearing VR goggles (HMD), the head tracker can detect the posture of the user U's head using an IMU (Inertial Measurement Unit) installed in the VR goggles. When the sensing unit 22 acquires an image captured by a camera that captures the user U, the posture of the user U's head may be detected from the captured image.
 映像描画処理部127は、CG空間描画部125より生成された2DのCG映像を再生処理装置11に接続されたVRゴーグル等の表示装置に表示するための映像信号を生成し、表示装置に出力する。 The video rendering processing unit 127 generates a video signal for displaying the 2D CG video generated by the CG space rendering unit 125 on a display device such as VR goggles connected to the playback processing device 11, and outputs the signal to the display device.
 再生処理装置11の再生モード切替えの第3実施例によれば、音響制作作業場所で音響制作を行っている制作者であるユーザUは、多ch音声再生ソフト24により制作した音声を原音場で再生した場合に聴取する音声を、バイノーラル処理された音声により聴取することができる。ユーザUが再生姿勢を変えることで、再生姿勢に応じたHRTFに変更され、原音場の聴取者が同様に姿勢を変えた場合に聴取する音声を聴取することができる。また、ユーザUには原音場での聴取位置で視認するCG映像が提示され、ユーザUが再生姿勢を変えると、原音場の聴取者が同様に姿勢を変えた場合に聴取者が視認する原音場の空間がCG映像として提示される。したがって、ユーザUはリアリティのある音声とCG映像とを視聴しながら、音響制作の作業と、制作結果の確認と行うことができる。かつ、再生姿勢の変更に伴い再生処理(再生モード)の切替えが自動的に行われるので、ユーザUは再生モードを切り替える手間を低減することができ、作業効率が格段に向上する。 According to the third embodiment of the playback mode switching of the playback processing device 11, the user U, who is a producer performing audio production at the audio production work site, can listen to the audio that is heard when the audio produced by the multi-channel audio playback software 24 is played in the original sound field by binaurally processed audio. When the user U changes the playback posture, the HRTF is changed according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if he or she similarly changed their posture. In addition, the user U is presented with a CG image that is viewed at the listening position in the original sound field, and when the user U changes the playback posture, the space of the original sound field that the listener would see if the listener in the original sound field similarly changed their posture is presented as a CG image. Therefore, the user U can perform audio production work and check the production results while viewing realistic audio and CG images. In addition, since the playback process (playback mode) is automatically switched when the playback posture is changed, the user U can reduce the effort required to switch playback modes, and work efficiency is significantly improved.
<再生処理装置11の再生モード切替えの第3実施例の処理手順>
 図8は、再生処理装置11の再生モード切替えの第3実施例の処理手順例を示したフローチャートである。ステップS11では、BRTFファイル取得部101は、図4の測定処理装置81により生成されたBRTFファイルを取得する。ステップS12では、空間情報読込部121は、ステップS11で取得されたBRTFファイルから空間情報に含まれる空間形状データと測定音源(原音源)の位置情報とを読み込む。ステップS13では、測定位置情報読込部124は、ステップS11で取得されたBRTFファイルから測定位置情報を読み込む。ステップS14では、再生姿勢情報取得部126は、現時点のユーザUの再生姿勢情報を取得する。
<Processing Procedure of Third Example of Playback Mode Switching of the Playback Processing Device 11>
Fig. 8 is a flow chart showing an example of a processing procedure of the third embodiment of the playback mode switching of the playback processing device 11. In step S11, the BRTF file acquisition unit 101 acquires the BRTF file generated by the measurement processing device 81 of Fig. 4. In step S12, the spatial information reading unit 121 reads the spatial shape data included in the spatial information and the position information of the measurement sound source (original sound source) from the BRTF file acquired in step S11. In step S13, the measurement position information reading unit 124 reads the measurement position information from the BRTF file acquired in step S11. In step S14, the playback posture information acquisition unit 126 acquires the playback posture information of the user U at the current time.
 ステップS15では、CGモデル取得部122は、ステップS12で読み込まれた空間情報(空間形状データと測定音源の位置情報)に基づいて、CGデータ記憶部123から原音場を存在する物体(壁、床、天井、スピーカ、スクリーン、座席等)に対応する3Dモデルの素材データを取得し、原音場の空間を模倣したCGモデルを仮想空間(CG空間)上に生成する。ステップS16では、CG空間描画部125は、ステップS13で読み込まれた測定位置情報と、ステップS14で取得された再生姿勢情報とに基づいて、CG空間のレンダリングの際の仮想カメラの位置及び姿勢を設定し、レンダリングにより2DのCG映像を生成する。ステップS17では、映像描画処理部127は、ステップS16で生成されたCG映像を再生処理装置11に接続された表示装置に表示するための映像信号を生成して、表示装置に出力する。 In step S15, the CG model acquisition unit 122 acquires material data of a 3D model corresponding to the objects (walls, floors, ceilings, speakers, screens, seats, etc.) that exist in the original sound field from the CG data storage unit 123 based on the spatial information (spatial shape data and position information of the measured sound source) read in step S12, and generates a CG model that imitates the space of the original sound field in a virtual space (CG space). In step S16, the CG space rendering unit 125 sets the position and attitude of the virtual camera when rendering the CG space based on the measurement position information read in step S13 and the playback attitude information acquired in step S14, and generates a 2D CG image by rendering. In step S17, the image rendering processing unit 127 generates an image signal for displaying the CG image generated in step S16 on a display device connected to the playback processing device 11, and outputs it to the display device.
 ステップS18では、係数読込部111は、ステップS14で取得された再生姿勢情報に基づいて、ユーザUの再生姿勢に対応する係数データをステップS11で取得されたBRTFファイルから読み込む。ステップS19では、畳み込み処理部112は、ステップS18で読み込まれた係数データに基づいて、FIRフィルタの係数を設定し、図1の多ch音声再生ソフト24から供給される音声信号に対して、FIRフィルタを用いた畳み込み処理(バイノーラル処理)を行う。ステップS20では、音声再生処理部113は、畳み込み処理部112によりバイノーラル処理された音声信号を図1の音声再生装置12に出力する。ステップS20の処理が終了すると、本フローチャートの処理が終了する。本フローチャートの処理は繰り返し実行される。 In step S18, the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of the user U from the BRTF file acquired in step S11, based on the playback posture information acquired in step S14. In step S19, the convolution processing unit 112 sets the coefficients of an FIR filter based on the coefficient data read in step S18, and performs convolution processing (binaural processing) using an FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. In step S20, the audio playback processing unit 113 outputs the audio signal binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1. When the processing of step S20 ends, the processing of this flowchart ends. The processing of this flowchart is executed repeatedly.
 なお、再生モード切替えの第3実施例では、再生モード切替えとしてバイノーラル処理に適用されるHRTFの切替え又は調整のみが行われることとした。ただし、これに限らず、再生モード切替えとして、ミックスダウン処理への切替えが含まれるようにしてもよい。この場合に、再生モードがミックスダウン処理に切り替えられた際には、制作作業場所の実写映像、又は、CG映像が表示装置に表示されるようにしてもよい。 In the third embodiment of the playback mode switching, only the switching or adjustment of the HRTF applied to the binaural processing is performed as the playback mode switching. However, this is not limited to this, and the playback mode switching may also include switching to mixdown processing. In this case, when the playback mode is switched to mixdown processing, live-action footage or CG footage of the production work site may be displayed on the display device.
<再生処理装置11の再生モード切替えの第4実施例>
 再生処理装置11の再生モード切替えの第4実施例では、ユーザUの状態を示すセンシング情報に基づいて、バイノーラル処理に適用するHRTF(又はBRTF)が調整されて音声再生モードの切替えが行われる。また、第4実施例では、ユーザUにVRゴーグル51で提示される映像は第2実施例と同様に、再生モードの切替えに連動してCG映像と実写映像とで切替えられる。
<Fourth embodiment of switching of playback mode of playback processing device 11>
In a fourth embodiment of the playback mode switching of the playback processing device 11, the HRTF (or BRTF) applied to the binaural processing is adjusted and the audio playback mode is switched based on sensing information indicating the state of the user U. Also, in the fourth embodiment, the image presented to the user U through the VR goggles 51 is switched between CG images and real-life images in conjunction with the switching of the playback mode, similar to the second embodiment.
 図9は、再生モード切替えの第4実施例が適用される再生処理装置11の構成例を示したブロック図である。図中、図7と共通する部分には同一符号が付されており、その説明は適宜省略する。図9の再生処理装置11は、BRTFファイル取得部101、音声制御部102、及び、表示制御部103を有する点で、図7の再生処理装置11と共通する。また、図9の音声制御部102は、係数読込部111、畳み込み処理部112、音声再生処理部113、残響量調整設定値読込部141、及び、残響量調整処理部142を有する。また、図9の表示制御部103は、空間情報読込部121、CGモデル取得部122、CGデータ記憶部123、測定位置情報読込部124、再生姿勢情報取得部126、映像描画処理部127、CG空間描画/映像出力切替え部131、手元空間映像取得部132、及び、ユーザ操作情報取得部133を有する。 Figure 9 is a block diagram showing an example configuration of a playback processing device 11 to which the fourth embodiment of the playback mode switching is applied. In the figure, parts that are common to Figure 7 are given the same reference numerals, and their explanation will be omitted as appropriate. The playback processing device 11 in Figure 9 is common to the playback processing device 11 in Figure 7 in that it has a BRTF file acquisition unit 101, an audio control unit 102, and a display control unit 103. In addition, the audio control unit 102 in Figure 9 has a coefficient reading unit 111, a convolution processing unit 112, an audio playback processing unit 113, a reverberation amount adjustment setting value reading unit 141, and a reverberation amount adjustment processing unit 142. The display control unit 103 in FIG. 9 also includes a space information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a playback attitude information acquisition unit 126, a video rendering processing unit 127, a CG space rendering/video output switching unit 131, a hand space video acquisition unit 132, and a user operation information acquisition unit 133.
 したがって、図9の音声制御部102は、係数読込部111、畳み込み処理部112、及び、音声再生処理部113を有する点で、図7の音声制御部102と共通する。だたし、図9の音声制御部102は、残響量調整設定値読込部141、及び、残響量調整処理部142を新たに有する点で、図7の音声制御部102と相違する。また、図9の表示制御部103は、空間情報読込部121、CGモデル取得部122、CGデータ記憶部123、測定位置情報読込部124、再生姿勢情報取得部126、及び、映像描画処理部127を有する点で、図7の表示制御部103と共通する。ただし、図9の表示制御部103は、図7のCG空間描画部125の代わりにCG空間描画/映像出力切替え部131を有する点、並びに、手元空間映像取得部132、及び、ユーザ操作情報取得部133を新たに有する点で、図7の表示制御部103と相違する。 Therefore, the audio control unit 102 in Fig. 9 is common to the audio control unit 102 in Fig. 7 in that it has a coefficient reading unit 111, a convolution processing unit 112, and an audio playback processing unit 113. However, the audio control unit 102 in Fig. 9 differs from the audio control unit 102 in Fig. 7 in that it newly has a reverberation amount adjustment setting value reading unit 141 and a reverberation amount adjustment processing unit 142. Furthermore, the display control unit 103 in Fig. 9 is common to the display control unit 103 in Fig. 7 in that it has a spatial information reading unit 121, a CG model acquisition unit 122, a CG data storage unit 123, a measurement position information reading unit 124, a playback attitude information acquisition unit 126, and a video drawing processing unit 127. However, the display control unit 103 in FIG. 9 differs from the display control unit 103 in FIG. 7 in that it has a CG space rendering/video output switching unit 131 instead of the CG space rendering unit 125 in FIG. 7, and in that it newly has a hand space video acquisition unit 132 and a user operation information acquisition unit 133.
 図9の音声制御部102において、残響量調整設定値読込部141は、音声再生時(現時点)における制作者であるユーザUの再生姿勢情報を再生姿勢情報取得部126から取得し、ユーザUの再生姿勢に対応する残響音調整設定値を読み込む。残響音調整設定値は、バイノーラル処理におけるRTF(室内伝達関数)又はRIR(室内インパルス応答)を調整する値である。例えば、係数読込部111によりBRTFファイルから取得される係数データが、RTFを考慮しないHTRFのような伝達特性に対応した係数データである場合には、残響音調整設定値は、BRTFファイルから取得される係数データに対して付加するRTFによる(残響音を生じさせる)係数データとすることできる。この場合に、残響音調整設定値は、ユーザUの再生姿勢に応じて予め決められた値に設定される。例えば、原音場の環境で聴取する音声をバイノーラル処理により再現する第1再生モードと、音響制作の作業場所の環境で聴取する音声をバイノーラル処理により再現する第2再生モードとが、ユーザUの再生姿勢により切り替えられることとする。このとき、第1再生モードのときには、残響音調整設定値として、BRTFファイルから取得される係数データに対して、大きな残響音を生じさせる係数データが付加されるような設定値が用いられる。第2再生モードのときには、小さな残響音を生じさせる係数データが付加されるような設定値が用いられる。第1再生モードや第2再生モードにおいてもユーザUの再生姿勢に応じた残響音調整設定値が用いられる場合であってもよい。また、第2再生モードのときには、バイノーラル処理ではなく、ミックスダウン処理が行われる場合であってもよい。 In the audio control unit 102 of FIG. 9, the reverberation adjustment setting value reading unit 141 obtains playback posture information of the user U, who is the producer at the time of audio playback (current time), from the playback posture information acquisition unit 126, and reads the reverberation adjustment setting value corresponding to the playback posture of the user U. The reverberation adjustment setting value is a value that adjusts the RTF (room transfer function) or RIR (room impulse response) in binaural processing. For example, if the coefficient data acquired from the BRTF file by the coefficient reading unit 111 is coefficient data corresponding to a transfer characteristic such as HTRF that does not take RTF into account, the reverberation adjustment setting value can be coefficient data (that generates reverberation) according to the RTF that is added to the coefficient data acquired from the BRTF file. In this case, the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U. For example, a first playback mode in which the sound heard in the original sound field environment is reproduced by binaural processing, and a second playback mode in which the sound heard in the environment of the sound production work place is reproduced by binaural processing are switched depending on the playback posture of the user U. In this case, in the first playback mode, a setting value is used as the reverberation adjustment setting value such that coefficient data that generates a large reverberation is added to the coefficient data acquired from the BRTF file. In the second playback mode, a setting value is used such that coefficient data that generates a small reverberation is added. In the first playback mode and the second playback mode, a reverberation adjustment setting value according to the playback posture of the user U may be used. In the second playback mode, a mixdown process may be performed instead of binaural processing.
 一方、係数読込部111によりBRTFファイルから取得される係数データが、RTFを考慮したBTRFのような伝達特性に対応した係数データである場合には、残響音調整設定値は、BRTFファイルから取得される係数データのうち、RTFの影響が大きい(残響音への影響が大きい)係数データの大きさを調整する値(ゲイン)とすることできる。この場合に、残響音調整設定値は、ユーザUの再生姿勢に応じて予め決められた値に設定される。例えば、上述と同様に第1再生モードと第2再生モードとを、ユーザUの再生姿勢により切り替えることとする。第1再生モードのときには、残響音調整設定値として、BRTFファイルから取得される係数データに対して、大きな残響音を生じさせるゲインが用いられる。第2再生モードのときには、小さな残響音を生じさせるゲインが用いられる。第1再生モードや第2再生モードにおいてもユーザUの再生姿勢に応じた残響音調整設定値が用いられる場合であってもよい。また、B第2再生モードのときには、バイノーラル処理ではなく、ミックスダウン処理が行われる場合であってもよい。 On the other hand, if the coefficient data acquired from the BRTF file by the coefficient reading unit 111 is coefficient data corresponding to a transfer characteristic such as BTRF that takes RTF into consideration, the reverberation adjustment setting value can be a value (gain) that adjusts the magnitude of coefficient data acquired from the BRTF file that is greatly influenced by RTF (has a large influence on reverberation). In this case, the reverberation adjustment setting value is set to a value that is predetermined according to the playback posture of the user U. For example, as in the above, the first playback mode and the second playback mode are switched according to the playback posture of the user U. In the first playback mode, a gain that generates a large reverberation is used as the reverberation adjustment setting value for the coefficient data acquired from the BRTF file. In the second playback mode, a gain that generates a small reverberation is used. In the first playback mode and the second playback mode, a reverberation adjustment setting value according to the playback posture of the user U may be used. In addition, in the B second playback mode, a mixdown process may be performed instead of a binaural process.
 残響量調整処理部142は、係数読込部111により、BRTFファイルから取得された係数データと、残響量調整設定値読込部141により読み込まれた残響音調整設定値とに基づいて、係数データを調整することにより、残響音調整設定値に応じた残響特性が付加されるように係数データを調整し、残響音調整設定値(再生姿勢)に応じた残響特性が考慮されたFIRフィルタの係数を生成する。生成された係数は、畳み込み処理部112におけるFIRフィルタの係数として設定されてバイノーラル処理が行われる。 The reverberation amount adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired from the BRTF file by the coefficient reading unit 111 and the reverberation adjustment setting value read by the reverberation amount adjustment setting value reading unit 141, thereby adjusting the coefficient data so that reverberation characteristics according to the reverberation adjustment setting value are added, and generates coefficients of an FIR filter that take into account the reverberation characteristics according to the reverberation adjustment setting value (playback posture). The generated coefficients are set as coefficients of the FIR filter in the convolution processing unit 112, and binaural processing is performed.
 ここで、係数読込部111がBRTFファイルから取得する係数データは、第3実施例と同様に、ユーザUの再生姿勢に対応する測定姿勢に対応付けられた係数データであってもよいし、ユーザUの再生姿勢にかかわらず一定の測定姿勢情報に対応付けられた係数データであってよい。また、残響音調整設定値は、ユーザUの再生姿勢に応じた値ではなく、ユーザUの視線方向に応じた値であってもよい。また、ユーザUが手元(下)以外を向いた場合に上述の第1再生モードとして、ユーザUが手元(下)を向いた場合に上述の第2再生モードに切り替えられるようにしてもよい。例えば、ユーザUが手元(下)を向いた場合には音響制作の作業を行い易いように残響音調整設定値が残響をほぼ生じさせない値とし、ユーザUが上を向いた場合には制作結果の確認が適切に行われるように原音場で生じる残響音を発生させるような値としてもよい。ユーザUが上又は手元を向いた場合とは、ユーザUの頭部正面が上又はは手元を向いた場合に限らずユーザUの視線方向が上又は手元を向いた場合としてもよい。 Here, the coefficient data acquired by the coefficient reading unit 111 from the BRTF file may be coefficient data associated with a measurement posture corresponding to the playback posture of the user U, as in the third embodiment, or may be coefficient data associated with fixed measurement posture information regardless of the playback posture of the user U. The reverberation adjustment setting value may be a value corresponding to the line of sight of the user U, rather than a value corresponding to the playback posture of the user U. The first playback mode may be selected when the user U faces a direction other than toward the hands (down), and the second playback mode may be selected when the user U faces toward the hands (down). For example, when the user U faces toward the hands (down), the reverberation adjustment setting value may be a value that generates almost no reverberation so that sound production work is easy, and when the user U faces up, a value that generates reverberation generated in the original sound field so that the production results can be properly confirmed. The case where the user U faces up or toward the hands is not limited to the case where the front of the user U's head faces up or toward the hands, but may be the case where the line of sight of the user U faces up or toward the hands.
 図9の表示制御部103において、CG空間描画/映像出力切替え部131は、第3実施例のCG空間描画部125と同様に、測定位置情報読込部124により取得された測定位置情報と、再生姿勢情報取得部126により取得された再生姿勢情報とに基づいて、仮想カメラの位置及び姿勢を設定して、CGモデル取得部122に生成されたCG空間のレンダリングを行い、2DのCG映像を生成する。生成されたCG映像は、映像描画処理部127に供給される(上述の第1再生モードの場合)。一方、CG空間描画/映像出力切替え部131は、再生姿勢情報取得部126により取得された再生姿勢情報に基づいて、上述の第2再生モードの場合、例えば、ユーザUが手元を向いたことを検出した場合には、CG映像の生成から切り替えて、手元空間映像取得部132により取得されたユーザUの手元空間の実写映像を取得し、映像描画処理部127に供給する。手元空間映像取得部132は、例えば、VRゴーグル等に設置された外界カメラで撮影された映像を手元空間の実写映像として取得することができる。映像描画処理部127は、CG空間描画/映像出力切替え部131からのCG映像又は実写映像を、再生処理装置11に接続された表示装置に表示するための映像信号を生成して表示装置に出力する。 9, the CG space rendering/video output switching unit 131, like the CG space rendering unit 125 of the third embodiment, sets the position and orientation of the virtual camera based on the measurement position information acquired by the measurement position information reading unit 124 and the playback orientation information acquired by the playback orientation information acquisition unit 126, renders the CG space generated by the CG model acquisition unit 122, and generates a 2D CG image. The generated CG image is supplied to the video rendering processing unit 127 (in the case of the first playback mode described above). On the other hand, in the case of the second playback mode described above, for example, when the CG space rendering/video output switching unit 131 detects that the user U has turned toward his/her hands based on the playback orientation information acquired by the playback orientation information acquisition unit 126, the CG space rendering/video output switching unit 131 switches from generating CG images to acquiring a live-action image of the user U's hand space acquired by the hand space image acquisition unit 132, and supplies it to the video rendering processing unit 127. The hand space image acquisition unit 132 can acquire, for example, an image captured by an external camera installed in VR goggles or the like as a live-action image of the hand space. The image rendering processing unit 127 generates an image signal for displaying the CG image or live-action image from the CG space rendering/image output switching unit 131 on a display device connected to the playback processing device 11, and outputs the signal to the display device.
 ユーザ操作情報取得部133は、ミキシングコンソール等の図2のコンソール機41が操作された場合や、モニタ等に表示されたGUIに対する操作が行われた場合に、その操作内容等をユーザ操作情報として取得する。CG空間描画/映像出力切替え部131は、実写映像を映像描画処理部127に供給して表示装置に表示させている際に、ユーザ操作情報取得部133からのユーザ操作情報に基づいて、ユーザ操作の操作内容を示す情報(操作情報)を実写映像に重畳させる。これにより、表示装置に実写映像に重ねて操作情報が提示される。 When a console device 41 in FIG. 2 such as a mixing console is operated, or when an operation is performed on a GUI displayed on a monitor or the like, the user operation information acquisition unit 133 acquires the operation content as user operation information. When the CG space rendering/video output switching unit 131 supplies live-action video to the video rendering processing unit 127 for display on the display device, it superimposes information indicating the user's operation content (operation information) on the live-action video based on the user operation information from the user operation information acquisition unit 133. As a result, the operation information is presented on the display device superimposed on the live-action video.
 図10は、コンソール機41をユーザUが操作した際にコンソール機41の実写映像に重ねて表示される操作情報を例示した図である。図10においてコンソール機41は実写映像である。これに対して操作情報161が重ねて表示される。操作情報161は、コンソール機41の操作された部分の拡大図(CG映像)と、操作によって変更された数値(編集値)の情報とを含む。このような操作情報161が実写映像に重ねてユーザUに提示されることで、実写映像だけでは操作し難いコンソール機41等の操作が行い易くなる。なお、CG空間描画/映像出力切替え部131はCG映像を映像描画処理部127に供給して表示装置に表示させている場合においても、図10のような操作情報をCG映像に重畳させてユーザUに提示されるようにしてもよい。操作情報は、操作によって変更された数値のみであってもよく、操作情報は、図10に示した形態に限らない。また、フェーダーコントローラでトラックのボリュームを編集する場合、マウスやエンコーダなどでパニングを編集する場合、キーボードでトラック名や数値などを入力する場合等においても、それらの操作装置の実写映像に重ねて操作情報を表示させるようにしてもよい。コンソール機41については、実写映像ではなく、CG映像を採用するとともに、そのコンソール機41のCG映像に、操作情報を重畳して表示し、ユーザUの手元には、コンソール機41ではなく、コンソール機41を模した平面や、箱、凹凸のあるパネル等を配置することができる。加えて、コンソール機41のフェーダやつまみなどを操作する感覚を、触覚再現ハプティクス技術で再現することができる。この場合、実際にはユーザUの手元にないコンソール機41があるかのように見せかけることができる。さらに、ユーザUはVRゴーグルを被って作業を行うことで、音響制作作業場所が、実物のコンソール機器41等の音響機器があるスタジオや作業室等ではなくても、そのようなスタジオ等にいるかのような感覚を享受ことができる。 10 is a diagram illustrating an example of operation information displayed superimposed on the live-action image of the console device 41 when the user U operates the console device 41. In FIG. 10, the console device 41 is a live-action image. Operation information 161 is displayed superimposed on this. The operation information 161 includes an enlarged view (CG image) of the operated part of the console device 41 and information on the numerical value (edited value) changed by the operation. By presenting such operation information 161 to the user U superimposed on the live-action image, it becomes easier to operate the console device 41, which is difficult to operate using only the live-action image. Note that even when the CG space drawing/image output switching unit 131 supplies the CG image to the image drawing processing unit 127 to display it on the display device, the operation information as shown in FIG. 10 may be superimposed on the CG image and presented to the user U. The operation information may be only the numerical value changed by the operation, and the operation information is not limited to the form shown in FIG. 10. In addition, when editing the track volume with a fader controller, editing panning with a mouse or encoder, inputting track names and values with a keyboard, and the like, the operation information may be displayed superimposed on the live-action image of the operation device. For the console 41, a CG image is used instead of a live-action image, and the operation information is displayed superimposed on the CG image of the console 41, and instead of the console 41, a flat surface, a box, a panel with unevenness, or the like that imitates the console 41 can be placed at the user U's hand. In addition, the sensation of operating the faders, knobs, etc. of the console 41 can be reproduced by haptic reproduction technology. In this case, it is possible to make it appear as if the console 41 is not actually in the user U's hand. Furthermore, by wearing VR goggles while working, the user U can enjoy the sensation of being in a studio or a workroom where sound production work is carried out, even if the sound production work place is not a studio or a workroom where sound equipment such as the actual console 41 is located.
 再生処理装置11の再生モード切替えの第4実施例によれば、音響制作作業場所で音響制作を行っている制作者であるユーザUは、多ch音声再生ソフト24により制作した音声を原音場で再生した場合に聴取する音声を、バイノーラル処理された音声により聴取することができる。また、ユーザUが再生姿勢を変えることで、再生姿勢に応じた残響特性が考慮されたHRTFに調整され、原音場の聴取者が同様に姿勢を変えた場合に聴取する音声を聴取することができる。また、ユーザUには原音場での聴取位置で視認するCG映像と、手元空間の実写映像とが自動的に切り替えて提示される。したがって、再生モードの自動切替えと連動してユーザUに提示される映像も自動で切り替えられるので、ユーザUが音響制作の作業と制作結果の確認作業とでそれらを手動で切り替える手間を低減することができ、作業効率が格段に向上する。 According to the fourth embodiment of the playback mode switching of the playback processing device 11, the user U, who is a producer performing audio production at the audio production work site, can listen to the binaurally processed audio that is the same as the audio that would be heard if the audio produced by the multi-channel audio playback software 24 were played in the original sound field. In addition, by changing the playback posture of the user U, the HRTF is adjusted to take into account the reverberation characteristics according to the playback posture, and the user U can listen to the audio that a listener in the original sound field would hear if they similarly changed their posture. In addition, the user U is presented with automatic switching between the CG image viewed at the listening position in the original sound field and the live-action image of the space in front of him/her. Therefore, the image presented to the user U is automatically switched in conjunction with the automatic switching of the playback mode, so that the effort required for the user U to manually switch between the audio production work and the work of checking the production results can be reduced, and work efficiency is significantly improved.
<再生処理装置11の再生モード切替えの第4実施例の処理手順>
 図11は、再生処理装置11の再生処理切替えの第4実施例の処理手順例を示したフローチャートである。尚、図11において、ステップS41乃至S44、及び、ステップS46乃至S48は、第3実施例における図8のステップS11乃至S17と共通するので、説明を省略する。ステップS45では、CG空間描画/映像出力切替え部131は、ステップS44で取得された再生姿勢情報に基づいて、出力映像がCG映像か実写映像(手元映像)かを判定する。例えば、ユーザUの姿勢(頭部正面)が上を向いている場合には、出力映像がCG映像であると判定し、ユーザUの姿勢が手元を向いている場合には、出力映像が実写映像であると判定する。ステップS45において、出力映像が実写映像であると判定された場合、処理はステップS49に進み、CG空間描画/映像出力切替え部131は、手元空間映像取得部132から手元空間の実写映像を取得する。ステップS50では、CG空間描画/映像出力切替え部131は、ユーザ操作情報取得部133からユーザ操作情報を取得する。CG空間描画/映像出力切替え部131は、ユーザ操作情報に基づいてユーザ操作が行われたことを検出した場合には、その操作内容を示す操作情報をステップS49で取得した実写映像に重畳する。ステップS51では、映像描画処理部127は、ステップS49で取得された実写映像、又は、ステップS50で操作情報が重畳された実写映像を表示装置に表示するための映像信号を生成して表示装置に出力する。処理はステップS51からステップS52に進む。
<Processing Procedure of Playback Mode Switching in the Playback Processing Device 11 in the Fourth Example>
11 is a flow chart showing an example of a processing procedure of the fourth embodiment of the playback process switching of the playback processing device 11. In FIG. 11, steps S41 to S44 and steps S46 to S48 are common to steps S11 to S17 in FIG. 8 in the third embodiment, so the description will be omitted. In step S45, the CG space drawing/video output switching unit 131 judges whether the output image is a CG image or a live-action image (hand image) based on the playback posture information acquired in step S44. For example, if the posture (head front) of the user U is facing up, it is judged that the output image is a CG image, and if the posture of the user U is facing the hands, it is judged that the output image is a live-action image. In step S45, if it is judged that the output image is a live-action image, the process proceeds to step S49, and the CG space drawing/video output switching unit 131 acquires a live-action image of the hand space from the hand space image acquisition unit 132. In step S50, the CG space rendering/video output switching unit 131 acquires user operation information from the user operation information acquisition unit 133. When the CG space rendering/video output switching unit 131 detects that a user operation has been performed based on the user operation information, it superimposes operation information indicating the operation content on the live-action video acquired in step S49. In step S51, the video rendering processing unit 127 generates a video signal for displaying on the display device the live-action video acquired in step S49 or the live-action video on which the operation information has been superimposed in step S50, and outputs the video signal to the display device. The process proceeds from step S51 to step S52.
 ステップS52では、残響量調整設定値読込部141は、ステップS44で取得された再生姿勢情報に基づいて、ユーザUの再生姿勢に対応する残響音調整設定値を読み込む。ステップS53では、係数読込部111は、ステップS44で取得された再生姿勢情報に基づいて、ユーザUの再生姿勢に対応する係数データをステップS41で取得されたBRTFファイルから読み込む。なお、係数読込部111が読み込む係数データは、ユーザUの再生姿勢とは関係なく所定の再生姿勢に対応する係数データであってもよい。ステップS54では、残響量調整処理部142は、ステップS53で取得された係数データと、ステップS52で取得された残響音調整設定値とに基づいて、係数データを調整し、再生姿勢に応じた残響量を考慮したFIRフィルタの係数を生成する。ステップS55では、畳み込み処理部112は、ステップS54で生成された係数を、FIRフィルタの係数として設定し、図1の多ch音声再生ソフト24から供給される音声信号に対して、FIRフィルタを用いた畳み込み処理(バイノーラル処理)を行う。ステップS56では、音声再生処理部113は、畳み込み処理部112によりバイノーラル処理された音声信号を図1の音声再生装置12に出力する。ステップS56の処理が終了すると、本フローチャートの処理が終了する。本フローチャートの処理は繰り返し実行される。 In step S52, the reverberation adjustment setting value reading unit 141 reads the reverberation adjustment setting value corresponding to the playback posture of user U based on the playback posture information acquired in step S44. In step S53, the coefficient reading unit 111 reads coefficient data corresponding to the playback posture of user U from the BRTF file acquired in step S41 based on the playback posture information acquired in step S44. Note that the coefficient data read by the coefficient reading unit 111 may be coefficient data corresponding to a specific playback posture regardless of the playback posture of user U. In step S54, the reverberation adjustment processing unit 142 adjusts the coefficient data based on the coefficient data acquired in step S53 and the reverberation adjustment setting value acquired in step S52, and generates coefficients of an FIR filter that take into account the reverberation amount according to the playback posture. In step S55, the convolution processing unit 112 sets the coefficients generated in step S54 as the coefficients of an FIR filter, and performs convolution processing (binaural processing) using the FIR filter on the audio signal supplied from the multi-channel audio playback software 24 in FIG. 1. In step S56, the audio playback processing unit 113 outputs the audio signal that has been binaurally processed by the convolution processing unit 112 to the audio playback device 12 in FIG. 1. When the processing of step S56 ends, the processing of this flowchart ends. The processing of this flowchart is executed repeatedly.
<他の実施例>
 再生処理装置11における再生モード切替えは、再生処理装置11に接続された任意の操作装置の操作に連携して行われるようにしてもよい。図12には、再生処理装置11に接続される操作装置の例として、コントローラ171やコンソール機172が例示されている。例えば、コントローラ171がジョイスティック171Aの傾斜角度を図1のセンシング部22が取得することとする。このとき、トリガ生成部23は、ジョイスティック171Aの特定の傾斜角度を検出した場合に再生モード切替えのトリガ信号を出力する。
<Other Examples>
The switching of the playback mode in the playback processing device 11 may be performed in coordination with the operation of an arbitrary operation device connected to the playback processing device 11. In Fig. 12, a controller 171 and a console machine 172 are illustrated as examples of operation devices connected to the playback processing device 11. For example, the sensing unit 22 in Fig. 1 acquires the tilt angle of a joystick 171A of the controller 171. At this time, the trigger generation unit 23 outputs a trigger signal for switching the playback mode when a specific tilt angle of the joystick 171A is detected.
 具体例として、ジョイスティック171Aの傾斜角度を原音場における聴取者の頭部中心から原音源までの距離に連携させる。このとき、ジョイスティック171Aの傾斜角度が、頭部と原音源とが重なる角度又は頭部中心と原音源とが十分に近いとみなせる角度であることが検出された場合には、トリガ生成部23は、再生モードがミックスダウン処理に設定され、又は、バイノーラル処理に適用するHRTFが変化するようにトリガ信号を出力する。ジョイスティック171Aの傾斜角度が、頭部中心と原音源とが所定距離以上離れたとみなせる角度であることが検出された場合には、トリガ生成部23は、再生モードがバイノーラル処理に設定され、又は、バイノーラル処理に適用するHRTFが変化するようにトリガ信号を出力する。 As a specific example, the tilt angle of the joystick 171A is linked to the distance from the center of the listener's head in the original sound field to the original sound source. At this time, if it is detected that the tilt angle of the joystick 171A is an angle at which the head and the original sound source overlap, or an angle at which the center of the head and the original sound source can be considered to be sufficiently close, the trigger generation unit 23 sets the playback mode to mixdown processing, or outputs a trigger signal to change the HRTF applied to binaural processing. If it is detected that the tilt angle of the joystick 171A is an angle at which the center of the head and the original sound source are considered to be apart by a predetermined distance or more, the trigger generation unit 23 sets the playback mode to binaural processing, or outputs a trigger signal to change the HRTF applied to binaural processing.
 また、ジョイスティック171Aを原音場における原音源(又は音像)の座標位置を変更する操作に使用することができる。例えば、原音場の所定の位置の原点に対する原音源の距離の操作や、原点に対する原音源の水平方向又は仰角方向の操作に用いることができる。このとき、トリガ生成部23は、原音源と原点との距離に応じてトリガ信号を出力するようにしてもよい。また、コンソール機172のスライダ172Aの位置を原音場における聴取者の頭部中心から原音源までの距離に連携させ、トリガ生成部23は、スライダ172Aの位置に応じてトリガ信号を出力してもよい。 The joystick 171A can also be used to change the coordinate position of the original sound source (or sound image) in the original sound field. For example, it can be used to manipulate the distance of the original sound source relative to an origin at a predetermined position in the original sound field, or to manipulate the horizontal or elevation angle of the original sound source relative to the origin. In this case, the trigger generation unit 23 may output a trigger signal according to the distance between the original sound source and the origin. Furthermore, the position of the slider 172A of the console device 172 may be linked to the distance from the center of the listener's head in the original sound field to the original sound source, and the trigger generation unit 23 may output a trigger signal according to the position of the slider 172A.
 <コンピュータの構成例>
 上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、コンピュータにインストールされる。ここで、コンピュータには、専用のハードウェアに組み込まれているコンピュータや、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどが含まれる。
<Example of computer configuration>
The above-mentioned series of processes can be executed by hardware or software. When the series of processes is executed by software, the program constituting the software is installed in a computer. Here, the computer includes a computer built into dedicated hardware, and a general-purpose personal computer, for example, capable of executing various functions by installing various programs.
 図13は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 13 is a block diagram showing an example of the hardware configuration of a computer that executes the above-mentioned series of processes using a program.
 コンピュータにおいて、CPU(Central Processing Unit)201,ROM(Read Only Memory)202,RAM(Random Access Memory)203は、バス204により相互に接続されている。 In a computer, a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, and a RAM (Random Access Memory) 203 are interconnected by a bus 204.
 バス204には、さらに、入出力インタフェース205が接続されている。入出力インタフェース205には、入力部206、出力部207、記憶部208、通信部209、及びドライブ210が接続されている。 Further connected to the bus 204 is an input/output interface 205. Connected to the input/output interface 205 are an input unit 206, an output unit 207, a storage unit 208, a communication unit 209, and a drive 210.
 入力部206は、キーボード、マウス、マイクロフォンなどよりなる。出力部207は、ディスプレイ、スピーカなどよりなる。記憶部208は、ハードディスクや不揮発性のメモリなどよりなる。通信部209は、ネットワークインタフェースなどよりなる。ドライブ210は、磁気ディスク、光ディスク、光磁気ディスク、又は半導体メモリなどのリムーバブルメディア211を駆動する。 The input unit 206 includes a keyboard, mouse, microphone, etc. The output unit 207 includes a display, speaker, etc. The storage unit 208 includes a hard disk, non-volatile memory, etc. The communication unit 209 includes a network interface, etc. The drive 210 drives removable media 211 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory.
 以上のように構成されるコンピュータでは、CPU201が、例えば、記憶部208に記憶されているプログラムを、入出力インタフェース205及びバス204を介して、RAM203にロードして実行することにより、上述した一連の処理が行われる。 In a computer configured as described above, the CPU 201 loads a program stored in the storage unit 208, for example, into the RAM 203 via the input/output interface 205 and the bus 204, and executes the program, thereby performing the above-mentioned series of processes.
 コンピュータ(CPU201)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア211に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 The program executed by the computer (CPU 201) can be provided by being recorded on removable media 211, such as package media, for example. The program can also be provided via a wired or wireless transmission medium, such as a local area network, the Internet, or digital satellite broadcasting.
 コンピュータでは、プログラムは、リムーバブルメディア211をドライブ210に装着することにより、入出力インタフェース205を介して、記憶部208にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部209で受信し、記憶部208にインストールすることができる。その他、プログラムは、ROM202や記憶部208に、あらかじめインストールしておくことができる。 In a computer, a program can be installed in the storage unit 208 via the input/output interface 205 by inserting the removable medium 211 into the drive 210. The program can also be received by the communication unit 209 via a wired or wireless transmission medium and installed in the storage unit 208. Alternatively, the program can be pre-installed in the ROM 202 or storage unit 208.
 なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program in which processing is performed chronologically in the order described in this specification, or a program in which processing is performed in parallel or at the required timing, such as when called.
 ここで、本明細書において、コンピュータがプログラムに従って行う処理は、必ずしもフローチャートとして記載された順序に沿って時系列に行われる必要はない。すなわち、コンピュータがプログラムに従って行う処理は、並列的あるいは個別に実行される処理(例えば、並列処理あるいはオブジェクトによる処理)も含む。 In this specification, the processing performed by a computer according to a program does not necessarily have to be performed in chronological order according to the order described in the flowchart. In other words, the processing performed by a computer according to a program also includes processing that is executed in parallel or individually (for example, parallel processing or processing by objects).
 また、プログラムは、1のコンピュータ(プロセッサ)により処理されるものであっても良いし、複数のコンピュータによって分散処理されるものであっても良い。さらに、プログラムは、遠方のコンピュータに転送されて実行されるものであっても良い。 The program may be processed by one computer (processor), or may be distributed among multiple computers. Furthermore, the program may be transferred to a remote computer for execution.
 さらに、本明細書において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれも、システムである。 Furthermore, in this specification, a system refers to a collection of multiple components (devices, modules (parts), etc.), regardless of whether all the components are in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device in which multiple modules are housed in a single housing, are both systems.
 また、例えば、1つの装置(または処理部)として説明した構成を分割し、複数の装置(または処理部)として構成するようにしてもよい。逆に、以上において複数の装置(または処理部)として説明した構成をまとめて1つの装置(または処理部)として構成されるようにしてもよい。また、各装置(または各処理部)の構成に上述した以外の構成を付加するようにしてももちろんよい。さらに、システム全体としての構成や動作が実質的に同じであれば、ある装置(または処理部)の構成の一部を他の装置(または他の処理部)の構成に含めるようにしてもよい。 Also, for example, the configuration described above as one device (or processing unit) may be divided and configured as multiple devices (or processing units). Conversely, the configurations described above as multiple devices (or processing units) may be combined and configured as one device (or processing unit). Of course, configurations other than those described above may also be added to the configuration of each device (or processing unit). Furthermore, as long as the configuration and operation of the system as a whole are substantially the same, part of the configuration of one device (or processing unit) may be included in the configuration of another device (or other processing unit).
 また、例えば、本技術は、1つの機能を、ネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 Also, for example, this technology can be configured as cloud computing, in which a single function is shared and processed collaboratively by multiple devices via a network.
 また、例えば、上述したプログラムは、任意の装置において実行することができる。その場合、その装置が、必要な機能(機能ブロック等)を有し、必要な情報を得ることができるようにすればよい。 Furthermore, for example, the above-mentioned program can be executed in any device. In that case, it is sufficient that the device has the necessary functions (functional blocks, etc.) and is able to obtain the necessary information.
 また、例えば、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。換言するに、1つのステップに含まれる複数の処理を、複数のステップの処理として実行することもできる。逆に、複数のステップとして説明した処理を1つのステップとしてまとめて実行することもできる。 Furthermore, for example, each step described in the above flowchart can be executed by one device, or can be shared and executed by multiple devices. Furthermore, if one step includes multiple processes, the multiple processes included in that one step can be executed by one device, or can be shared and executed by multiple devices. In other words, multiple processes included in one step can be executed as multiple step processes. Conversely, processes described as multiple steps can be executed collectively as one step.
 なお、コンピュータが実行するプログラムは、プログラムを記述するステップの処理が、本明細書で説明する順序に沿って時系列に実行されるようにしても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで個別に実行されるようにしても良い。つまり、矛盾が生じない限り、各ステップの処理が上述した順序と異なる順序で実行されるようにしてもよい。さらに、このプログラムを記述するステップの処理が、他のプログラムの処理と並列に実行されるようにしても良いし、他のプログラムの処理と組み合わせて実行されるようにしても良い。 In addition, the processing of the steps that describe a program executed by a computer may be executed chronologically in the order described in this specification, or may be executed in parallel, or individually at the required timing, such as when a call is made. In other words, as long as no contradictions arise, the processing of each step may be executed in an order different from the order described above. Furthermore, the processing of the steps that describe this program may be executed in parallel with the processing of other programs, or may be executed in combination with the processing of other programs.
 なお、本明細書において複数説明した本技術は、矛盾が生じない限り、それぞれ独立に単体で実施することができる。もちろん、任意の複数の本技術を併用して実施することもできる。例えば、いずれかの実施の形態において説明した本技術の一部または全部を、他の実施の形態において説明した本技術の一部または全部と組み合わせて実施することもできる。また、上述した任意の本技術の一部または全部を、上述していない他の技術と併用して実施することもできる。 Note that the multiple present technologies described in this specification can be implemented independently and individually, provided no contradictions arise. Of course, any multiple present technologies can also be implemented in combination. For example, part or all of the present technologies described in any embodiment can be implemented in combination with part or all of the present technologies described in other embodiments. Also, part or all of any of the present technologies described above can be implemented in combination with other technologies not described above.
 <構成の組み合わせ例>
 なお、本技術は以下のような構成も取ることができる。
(1)
 入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する再生信号生成部
 を有する情報処理装置。
(2)
 前記3D音響再生処理は、前記入力された音声信号に対して空間の音響特性を反映させる処理である
 前記(1)に記載の情報処理装置。
(3)
 前記非3D音響再生処理は、前記入力された音声信号のチャンネル数を変更する処理である
 前記(1)又は(2)に記載の情報処理装置。
(4)
 前記再生信号生成部は、前記ユーザの状態を示すセンシング情報に基づいて選択された再生処理を実行する
 前記(1)乃至(3)のいずれかに記載の情報処理装置。
(5)
 前記ユーザの状態は、前記ユーザの頭部の姿勢又は視線方向に関する状態である
 前記(1)乃至(4)のいずれかに記載の情報処理装置。
(6)
 前記ユーザの状態は、前記再生信号生成部により実行される前記再生処理の切替え以外の用途に使用される操作部材に対する操作状態である
 前記(1)乃至(5)のいずれかに記載の情報処理装置。
(7)
 前記3D音響再生処理は、前記音響特性に対応した伝達関数を前記入力された音声信号に対して畳み込む処理である
 前記(2)に記載の情報処理装置。
(8)
 前記3D音響再生処理は、FIRフィルタを用いた処理である
 前記(7)に記載の情報処理装置。
(9)
 前記伝達関数は、頭部伝達関数、バイノーラル伝達関数、又は、頭部伝達関数と室内伝達関数との組合せである
 前記(7)又は(8)に記載の情報処理装置。
(10)
 前記再生信号生成部は、前記空間で実測された前記音響特性を用いる
 前記(2)、及び、(7)乃至(9)のいずれかに記載の情報処理装置。
(11)
 前記再生信号生成部は、複数の姿勢で実測された前記音響特性のうち、現時点での前記ユーザの姿勢に対応する前記音響特性を用いる
 前記(10)に記載の情報処理装置。
(12)
 前記再生信号生成部は、前記伝達関数を変更し、又は、調整することにより実行する前記再生処理を切り替える
 前記(7)乃至(11)のいずれかに記載の情報処理装置。
(13)
 前記3D音響再生処理により反映する前記音響特性を有する前記空間のCG映像を出力する表示制御部
 を有する
 前記(2)、及び、(7)乃至(12)のいずれかに記載の情報処理装置。
(14)
 前記表示制御部は、前記再生信号生成部による前記3D音響再生処理の実行に連動して前記CG映像を出力する
 前記(13)に記載の情報処理装置。
(15)
 前記CG映像は、前記3D音響再生処理が前記入力された音声信号に対して反映させる前記音響特性を有する前記空間を再現したCG空間を仮想カメラで撮影した映像である
 前記(14)に記載の情報処理装置。
(16)
 前記CG映像は、前記ユーザの姿勢に応じて前記仮想カメラの姿勢が変更されて撮影された映像である
 前記(15)に記載の情報処理装置。
(17)
 前記表示制御部は、前記再生信号生成部による前記非3D音響再生処理の実行に連動して実写映像を出力する
 前記(14)乃至(16)のいずれかに記載の情報処理装置。
(18)
 前記実写映像は、前記ユーザの周囲をカメラで撮影した映像である
 前記(17)に記載の情報処理装置。
(19)
 再生信号生成部
 を有する
 情報処理装置の
 前記再生信号生成部が、入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する
 情報処理方法。
(20)
 コンピュータを
 入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する再生信号生成部
 として機能させるためのプログラム。
<Examples of configuration combinations>
The present technology can also be configured as follows.
(1)
An information processing device having a playback signal generation unit that performs a playback process corresponding to a user's state from among multiple types of playback processes including 3D audio playback processing and non-3D audio playback processing on an input audio signal to generate an audio signal for playback.
(2)
The information processing device according to (1), wherein the 3D sound reproduction process is a process for reflecting acoustic characteristics of a space in the input audio signal.
(3)
The information processing device according to (1) or (2), wherein the non-3D sound reproduction process is a process for changing a number of channels of the input audio signal.
(4)
The information processing device according to any one of (1) to (3), wherein the playback signal generation unit executes a playback process selected based on sensing information indicating a state of the user.
(5)
The information processing device according to any one of (1) to (4), wherein the state of the user is a state related to a head posture or a line of sight direction of the user.
(6)
The information processing device according to any one of (1) to (5), wherein the state of the user is an operation state of an operation member used for purposes other than switching the playback process executed by the playback signal generating section.
(7)
The information processing device according to (2), wherein the 3D sound reproduction process is a process of convolving a transfer function corresponding to the acoustic characteristics with the input audio signal.
(8)
The information processing device according to (7), wherein the 3D sound reproduction process is a process using an FIR filter.
(9)
The information processing device according to (7) or (8), wherein the transfer function is a head-related transfer function, a binaural transfer function, or a combination of a head-related transfer function and a room transfer function.
(10)
The information processing device according to any one of (2) and (7) to (9), wherein the reproduction signal generation unit uses the acoustic characteristics actually measured in the space.
(11)
The information processing device according to (10), wherein the reproduction signal generation unit uses the acoustic characteristic corresponding to the current posture of the user among the acoustic characteristics actually measured in a plurality of postures.
(12)
The information processing device according to any one of (7) to (11), wherein the reproduction signal generation unit switches the reproduction process to be executed by changing or adjusting the transfer function.
(13)
The information processing device according to any one of (2) and (7) to (12), further comprising a display control unit that outputs a CG image of the space having the acoustic characteristics reflected by the 3D audio reproduction process.
(14)
The information processing device according to (13), wherein the display control unit outputs the CG image in conjunction with execution of the 3D sound reproduction process by the reproduction signal generation unit.
(15)
The information processing device described in (14), wherein the CG image is an image captured by a virtual camera of a CG space that reproduces the space having the acoustic characteristics that the 3D audio playback processing reflects on the input audio signal.
(16)
The information processing device according to (15), wherein the CG image is an image captured by changing the posture of the virtual camera in accordance with the posture of the user.
(17)
The information processing device according to any one of (14) to (16), wherein the display control unit outputs a live-action video in conjunction with execution of the non-3D sound reproduction process by the reproduction signal generation unit.
(18)
The information processing device according to (17), wherein the live-action image is an image captured by a camera around the user.
(19)
An information processing method of an information processing device having a playback signal generation unit, the playback signal generation unit executing a playback process corresponding to a user's state from among multiple types of playback processes including 3D sound playback processing and non-3D sound playback processing on an input audio signal, to generate an audio signal for playback.
(20)
A program for causing a computer to function as a playback signal generating unit that generates an audio signal for playback by executing a playback process corresponding to the user's state from among multiple types of playback processes, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal.
 なお、本実施の形態は、上述した実施の形態に限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。また、本明細書に記載された効果はあくまで例示であって限定されるものではなく、他の効果があってもよい。 Note that this embodiment is not limited to the above-described embodiment, and various modifications are possible without departing from the gist of this disclosure. Furthermore, the effects described in this specification are merely examples and are not limiting, and other effects may also be present.
 1 姿勢, 11 再生処理装置, 12 音声再生装置, 21 再生信号生成部, 22 センシング部, 23 トリガ生成部, 24 多ch音声再生ソフト, 31 トリガ受信部, 32 切替処理部, 33 バイノーラル処理部, 34 2chミックスダウン/ステレオ再生処理部, 35 パススルー処理部, 41 コンソール機, 42A,42B モニタ, 43 カメラ, 44 ヘッドフォン, 45 スピーカ, 51 VRゴーグル, 81 測定処理装置, 91 スピーカ, 101 BRTFファイル取得部, 102 音声制御部, 103 表示制御部, 111 係数読込部, 112 畳み込み処理部, 113 音声再生処理部, 121 空間情報読込部, 122 CGモデル取得部, 123 CGデータ記憶部, 124 測定位置情報読込部, 125 CG空間描画部, 126 再生姿勢情報取得部, 127 映像描画処理部, 131 映像出力切替え部, 132 手元空間映像取得部, 133 ユーザ操作情報取得部, 141 残響量調整設定値読込部, 142 残響量調整処理部 1 Posture, 11 Playback processing device, 12 Audio playback device, 21 Playback signal generation unit, 22 Sensing unit, 23 Trigger generation unit, 24 Multi-channel audio playback software, 31 Trigger reception unit, 32 Switching processing unit, 33 Binaural processing unit, 34 2ch mixdown/stereo playback processing unit, 35 Pass-through processing unit, 41 Console device, 42A, 42B Monitor, 43 Camera, 44 Headphones, 45 Speaker, 51 VR goggles, 81 Measurement processing unit, 91 Speaker, 101 BR TF file acquisition unit, 102, audio control unit, 103, display control unit, 111, coefficient reading unit, 112, convolution processing unit, 113, audio playback processing unit, 121, spatial information reading unit, 122, CG model acquisition unit, 123, CG data storage unit, 124, measurement position information reading unit, 125, CG space drawing unit, 126, playback posture information acquisition unit, 127, video drawing processing unit, 131, video output switching unit, 132, handheld space video acquisition unit, 133, user operation information acquisition unit, 141, reverberation amount adjustment setting value reading unit, 142, reverberation amount adjustment processing unit

Claims (20)

  1.  入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する再生信号生成部
     を有する情報処理装置。
    An information processing device having a playback signal generation unit that performs a playback process corresponding to a user's state from among multiple types of playback processes including 3D audio playback processing and non-3D audio playback processing on an input audio signal to generate an audio signal for playback.
  2.  前記3D音響再生処理は、前記入力された音声信号に対して空間の音響特性を反映させる処理である
     請求項1に記載の情報処理装置。
    The information processing device according to claim 1 , wherein the 3D sound reproduction process is a process for reflecting acoustic characteristics of a space in the input audio signal.
  3.  前記非3D音響再生処理は、前記入力された音声信号のチャンネル数を変更する処理である
     請求項1に記載の情報処理装置。
    The information processing device according to claim 1 , wherein the non-3D sound reproduction process is a process for changing a number of channels of the input audio signal.
  4.  前記再生信号生成部は、前記ユーザの状態を示すセンシング情報に基づいて選択された再生処理を実行する
     請求項1に記載の情報処理装置。
    The information processing device according to claim 1 , wherein the playback signal generating section executes a playback process selected based on sensing information indicating a state of the user.
  5.  前記ユーザの状態は、前記ユーザの頭部の姿勢又は視線方向に関する状態である
     請求項1に記載の情報処理装置。
    The information processing device according to claim 1 , wherein the state of the user is a state related to a head posture or a line of sight direction of the user.
  6.  前記ユーザの状態は、前記再生信号生成部により実行される前記再生処理の切替え以外の用途に使用される操作部材に対する操作状態である
     請求項1に記載の情報処理装置。
    The information processing apparatus according to claim 1 , wherein the user's state is an operation state of an operation member used for purposes other than switching the playback process executed by the playback signal generating section.
  7.  前記3D音響再生処理は、前記音響特性に対応した伝達関数を前記入力された音声信号に対して畳み込む処理である
     請求項2に記載の情報処理装置。
    The information processing device according to claim 2 , wherein the 3D sound reproduction process is a process of convolving a transfer function corresponding to the acoustic characteristics with the input audio signal.
  8.  前記3D音響再生処理は、FIRフィルタを用いた処理である
     請求項7に記載の情報処理装置。
    The information processing device according to claim 7 , wherein the 3D sound reproduction process is a process using an FIR filter.
  9.  前記伝達関数は、頭部伝達関数、バイノーラル伝達関数、又は、頭部伝達関数と室内伝達関数との組合せである
     請求項7に記載の情報処理装置。
    The information processing device according to claim 7 , wherein the transfer function is a head-related transfer function, a binaural transfer function, or a combination of a head-related transfer function and a room transfer function.
  10.  前記再生信号生成部は、前記空間で実測された前記音響特性を用いる
     請求項2に記載の情報処理装置。
    The information processing device according to claim 2 , wherein the reproduction signal generating section uses the acoustic characteristics actually measured in the space.
  11.  前記再生信号生成部は、複数の姿勢で実測された前記音響特性のうち、現時点での前記ユーザの姿勢に対応する前記音響特性を用いる
     請求項10に記載の情報処理装置。
    The information processing device according to claim 10 , wherein the reproduction signal generating section uses the acoustic characteristic corresponding to the current posture of the user from among the acoustic characteristics actually measured in a plurality of postures.
  12.  前記再生信号生成部は、前記伝達関数を変更し、又は、調整することにより実行する前記再生処理を切り替える
     請求項7に記載の情報処理装置。
    The information processing device according to claim 7 , wherein the reproduction signal generating section switches the reproduction process to be executed by changing or adjusting the transfer function.
  13.  前記3D音響再生処理により反映する前記音響特性を有する前記空間のCG映像を出力する表示制御部
     を有する
     請求項2に記載の情報処理装置。
    The information processing device according to claim 2 , further comprising a display control unit that outputs a CG image of the space having the acoustic characteristics reflected by the 3D audio reproduction process.
  14.  前記表示制御部は、前記再生信号生成部による前記3D音響再生処理の実行に連動して前記CG映像を出力する
     請求項13に記載の情報処理装置。
    The information processing device according to claim 13 , wherein the display control unit outputs the CG image in conjunction with execution of the 3D sound reproduction process by the reproduction signal generation unit.
  15.  前記CG映像は、前記3D音響再生処理が前記入力された音声信号に対して反映させる前記音響特性を有する前記空間を再現したCG空間を仮想カメラで撮影した映像である
     請求項14に記載の情報処理装置。
    The information processing device according to claim 14 , wherein the CG image is an image captured by a virtual camera of a CG space that reproduces the space having the acoustic characteristics that the 3D audio reproduction process reflects on the input audio signal.
  16.  前記CG映像は、前記ユーザの姿勢に応じて前記仮想カメラの姿勢が変更されて撮影された映像である
     請求項15に記載の情報処理装置。
    The information processing device according to claim 15 , wherein the CG image is an image captured by changing the posture of the virtual camera in accordance with the posture of the user.
  17.  前記表示制御部は、前記再生信号生成部による前記非3D音響再生処理の実行に連動して実写映像を出力する
     請求項14に記載の情報処理装置。
    The information processing device according to claim 14 , wherein the display control unit outputs an actual video image in conjunction with execution of the non-3D sound reproduction process by the reproduction signal generation unit.
  18.  前記実写映像は、前記ユーザの周囲をカメラで撮影した映像である
     請求項17に記載の情報処理装置。
    The information processing device according to claim 17 , wherein the actual image is an image of the surroundings of the user captured by a camera.
  19.  再生信号生成部
     を有する
     情報処理装置の
     前記再生信号生成部が、入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する
     情報処理方法。
    An information processing method of an information processing device having a playback signal generation unit, the playback signal generation unit executing a playback process corresponding to a user's state from among multiple types of playback processes including 3D sound playback processing and non-3D sound playback processing on an input audio signal, to generate an audio signal for playback.
  20.  コンピュータを
     入力された音声信号に対して3D音響再生処理と非3D音響再生処理とを含む複数種の再生処理のうち、ユーザの状態に対応した再生処理を実行して再生用の音声信号を生成する再生信号生成部
     として機能させるためのプログラム。
    A program for causing a computer to function as a playback signal generating unit that generates an audio signal for playback by executing a playback process corresponding to the user's state from among multiple types of playback processes, including 3D audio playback processing and non-3D audio playback processing, on an input audio signal.
PCT/JP2024/001050 2023-02-01 2024-01-17 Information processing device, information processing method, and program WO2024161992A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2023-013802 2023-02-01
JP2023013802 2023-02-01

Publications (1)

Publication Number Publication Date
WO2024161992A1 true WO2024161992A1 (en) 2024-08-08

Family

ID=92146432

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/001050 WO2024161992A1 (en) 2023-02-01 2024-01-17 Information processing device, information processing method, and program

Country Status (1)

Country Link
WO (1) WO2024161992A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006128816A (en) * 2004-10-26 2006-05-18 Victor Co Of Japan Ltd Recording program and reproducing program corresponding to stereoscopic video and stereoscopic audio, recording apparatus and reproducing apparatus, and recording medium
JP2009159073A (en) * 2007-12-25 2009-07-16 Panasonic Corp Acoustic playback apparatus and acoustic playback method
JP2010118978A (en) * 2008-11-14 2010-05-27 Victor Co Of Japan Ltd Controller of localization of sound, and method of controlling localization of sound
WO2022124084A1 (en) * 2020-12-09 2022-06-16 ソニーグループ株式会社 Reproduction apparatus, reproduction method, information processing apparatus, information processing method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006128816A (en) * 2004-10-26 2006-05-18 Victor Co Of Japan Ltd Recording program and reproducing program corresponding to stereoscopic video and stereoscopic audio, recording apparatus and reproducing apparatus, and recording medium
JP2009159073A (en) * 2007-12-25 2009-07-16 Panasonic Corp Acoustic playback apparatus and acoustic playback method
JP2010118978A (en) * 2008-11-14 2010-05-27 Victor Co Of Japan Ltd Controller of localization of sound, and method of controlling localization of sound
WO2022124084A1 (en) * 2020-12-09 2022-06-16 ソニーグループ株式会社 Reproduction apparatus, reproduction method, information processing apparatus, information processing method, and program

Similar Documents

Publication Publication Date Title
EP3311593B1 (en) Binaural audio reproduction
EP2589231B1 (en) Facilitating communications using a portable communication device and directed sound output
US11877135B2 (en) Audio apparatus and method of audio processing for rendering audio elements of an audio scene
US20100328419A1 (en) Method and apparatus for improved matching of auditory space to visual space in video viewing applications
US20100328423A1 (en) Method and apparatus for improved mactching of auditory space to visual space in video teleconferencing applications using window-based displays
US11589184B1 (en) Differential spatial rendering of audio sources
JP2007502589A (en) Devices for level correction in wavefield synthesis systems.
JP4698594B2 (en) Apparatus and method for calculating discrete values of components in a speaker signal
KR20200087130A (en) Signal processing device and method, and program
JP4498280B2 (en) Apparatus and method for determining playback position
JP5754595B2 (en) Trans oral system
CN111492342A (en) Audio scene processing
WO2024161992A1 (en) Information processing device, information processing method, and program
EP3745745A1 (en) Apparatus, method, computer program or system for use in rendering audio
Kearney et al. Design of an interactive virtual reality system for ensemble singing
KR102058228B1 (en) Method for authoring stereoscopic contents and application thereof
WO2020209103A1 (en) Information processing device and method, reproduction device and method, and program
WO2023058466A1 (en) Information processing device and data structure
KR20230059283A (en) Actual Feeling sound processing system to improve immersion in performances and videos
Sousa The development of a'Virtual Studio'for monitoring Ambisonic based multichannel loudspeaker arrays through headphones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24749946

Country of ref document: EP

Kind code of ref document: A1