EP3419021A1 - Device and method for distinguishing natural and artificial sound - Google Patents
Device and method for distinguishing natural and artificial sound Download PDFInfo
- Publication number
- EP3419021A1 EP3419021A1 EP17305754.8A EP17305754A EP3419021A1 EP 3419021 A1 EP3419021 A1 EP 3419021A1 EP 17305754 A EP17305754 A EP 17305754A EP 3419021 A1 EP3419021 A1 EP 3419021A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- sound
- signal
- descriptor related
- artificial
- windows
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000004590 computer program Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 description 18
- 238000007906 compression Methods 0.000 description 18
- 230000006870 function Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 8
- 230000005236 sound signal Effects 0.000 description 7
- 210000003127 knee Anatomy 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 4
- 230000002238 attenuated effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- the present disclosure relates generally to audio recognition and in particular to determining if sound is natural or artificial.
- Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate.
- Figure 1 illustrates a generic conventional audio classification pipeline 100 that comprises an audio sensor 110 capturing a raw audio signal, a preprocessing module 120 that prepares the captured audio for a features extraction module 130 that outputs extracted features to a classifier module 140 that uses entries in an audio database 150 to label audio that is then output.
- the labelled audio can then be used to, for example, determine the activities of persons (and even pets) in the location where the audio was captured.
- Knowledge of the activities can be used in situations like e-health, care of children or the elderly, and home security.
- parents could use the knowledge to determine what their children do when they are alone at home: for instance, after school, are they doing their homework or watching television?
- the present principles are directed to a method for determining if sound is artificial.
- a hardware input interface obtains a signal corresponding to sound in an environment and at least one hardware processor calculates from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determines that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- the present principles are directed to a device for determining if sound is artificial, comprising a hardware input interface configured to obtain a signal corresponding to sound in an environment, and at least one hardware processor configured to calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- the present principles are directed to a computer program comprising program code instructions executable by a processor for implementing the method according to the first aspect.
- the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the first aspect.
- One way of monitoring a person in order to, for instance, anticipate problems, is to verify if the habits of the person are followed. To do this, it can be useful to classify ambient sound in the person's location as:
- the present principles rely on the fact that most artificial audio sources use dynamic range compression to enhance the sound and to make it more present. It is for example possible to enhance the sound to avoid a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation standard that has limited frequency spectrum range.
- Dynamic range compression which is a very common technique in the broadcast chain and in media content workflows, amplifies parts of the audio signal with low amplitude (upward compression), reduces the loud parts of the sound (downward compression), or both.
- natural sounds tend to be characterized by a wider dynamic range, which typically means that more low power sounds tend to be present in a natural audio signal than in a dynamic range compressed audio signal. Hence, detecting such dynamic differences within the sound can help differentiating artificial and natural sound.
- DRC Dynamic Range Compression
- Figure 2 illustrates conventional downward compression with a hard knee.
- a compression function curve 210 has a first part 212 that is neutral - i.e., an input level transformed by this part results in an equal output level.
- the curve further has a second part 214 that meets the first part 212 at a hard knee 216.
- the second part 214 performs downward compression, which means that an input level L l is transformed into a lower output level Lo.
- Figure 2 also shows a threshold 220 that lies between the first part 212 and the second part 214.
- the threshold 220 coincides with the hard knee 216, but it will be appreciated that in case a soft knee is used, this will extend around the threshold 220 and comprise part of the first part 212 and the second part 214 as well.
- the first part of the curve would be flatter so that an input level results in a higher output level (except perhaps at the hard knee). It will further be understood that the function can allow both downward and upward compression, in which case the first part and the second part can have identical slopes or different slopes.
- DRC can for example be used:
- Figure 3 illustrates a signal 310 without dynamic range compression and the same signal 320 with dynamic range compression. As can be seen, the loud parts have been attenuated (downward compression) and the low parts have been amplified (upward compression).
- the characteristic of the frequency modulation limits the frequency spectrum range, which in turn limits the acoustic dynamic range. If the frequency spectrum range is not respected, this will result in spectrum overlaps and audio distortion. Simply reducing the amplitude of the signal fed to the modulator so that it never clips the signal requires an important reduction of the input signal, which results in a reduction in the signal-to-noise (SNR) ratio.
- SNR signal-to-noise
- FM also applies to digital radio that includes an ADC (Analog to Digital Converter) in front of the modulator and for which dynamic range is limited.
- ADC Analog to Digital Converter
- DRC can also preserve the audio amplification chain and as well as any speakers from saturation when they are not dimensioned to render the natural dynamic range.
- compressing broadcast radio FM or broadcast TV enables the high-power amplifier transmitter required to broadcast the signal over the air to transmit using a more constant output power. Doing so can increase the lifetime of the amplifier. Indeed, the standardization community tries to find the best compromise between audio quality for the end user and economy when it comes to the broadcasting infrastructure.
- AGC Automatic Gain Control
- an audio signal that is broadcast or streamed over the air, a cellular network or a broadband network typically has a compressed acoustic dynamic.
- a music or voice audio signal listened through a speaker has different dynamical properties than audio produced by natural sources like human voices, animal sounds and (non-amplified) instruments.
- a device can analyse captured ambient sound to determine at least one of:
- the captured sound is divided into a number of sections (or windows); the windows can be distinct, but are generally overlapping with a subsequent window starting at the middle of the window just before.
- Each window has an index, that we note k for instance.
- the windows have a same size, noted w.
- the captured sound i.e. the part for which it should be determined if it is natural or artificial, is thus divided in a set of K possibly overlapping windows.
- the size w may take the value of 1024, but other values such as 2048 have also been contemplated.
- the output for the different windows defines a series of instantaneous power values P k for the K windows of the captured sound signal.
- the standard deviation for the power is also an indirect measure of the standard deviation for the amplitude.
- This coefficient of variation of the power is a first descriptor, related to loudness, used to distinguish natural sounds from artificial sounds.
- consecutive windows marked 'silent' are grouped in 'silent' groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups.
- the signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups.
- groups of 'non-silent' windows smaller than a certain size are marked 'silent'.
- the second descriptor, related to silence, for distinguishing sound is the proportion of 'silent' windows over the number of windows K of the signal subject to examination. silent windows K where K is the number of windows considered.
- the detection of the silent windows may occur before the calculation of the descriptor CV(P) explained above, and this descriptor may be computed only on the windows which are marked as 'non silent'.
- the two descriptors described above are expected to have a high value for the first (high variation of the power) and the second (large number of silent windows) in case of a natural sound, and the opposite for an artificial sound (power constantly high, nearly no silent window). This will be used in a classification system as exposed hereafter.
- a first possibility is to take the first and second descriptors as input to a supervised classifier that is trained to separate the natural sound from the artificial sound.
- the supervised classifier may for instance be based on a decision tree, using two thresholds corresponding to the two descriptors.
- a second possibility is to use a set of conditions such as:
- FIG. 4 illustrates a device for audio distinction 400 according to the present principles.
- the device 400 comprises at least one hardware processing unit (“processor") 410 configured to execute instructions of a first software program and to process audio for distinction, as described herein.
- the device 400 further comprises at least one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured to store the software program and data required to distinguish sound.
- the device 400 can also comprise at least one user communications interface (“User I / O ”) 430 for interfacing with a user.
- User I / O user communications interface
- the device 400 further comprises an input interface 440 and an output interface 450.
- the input interface 440 is configured to obtain audio for distinguishing; the input interface 440 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio.
- the output interface 450 is configured to output information about distinguished audio - is it natural or artificial sound - for example for presentation on a screen or by transfer to a further device.
- Non-transitory, computer-readable storage medium 460 includes a computer program with instructions that, when executed by the processor 410 performs the methods described herein.
- the processor 410 can also be configured to use the distinction to determine user activity as described in the background part of the description.
- the device 400 is preferably implemented as a single device such as a gateway, but its functionality can also be distributed over a plurality of devices.
- the processor 410 may have access to other data and use this data to determine that the sound has been incorrectly classified, for example in case the sound was classified as natural and the data originates from another device and indicates that artificial sound is indeed rendered in the environment where the processor 410 is located. If this occurs regularly, it could mean that the classification model used by the processor 410 is not accurate enough.
- the device 400 can send anonymized descriptors that caused false incorrect classification to a server, so that the global model can be adapted to these descriptors (i.e. recomputed with those new inputs). The global model can then be distributed to the individual devices.
- a stream processing big-data infrastructure such as Storm or Spark is particularly relevant.
- FIG. 5 illustrates a flowchart for a method of audio distinction according to the present principles.
- the device 400 obtains captured sound, either by capturing it itself or receiving captured sound from another device.
- the processor 410 calculates power standard deviation, i.e., the first descriptor, (as a measure of amplitude standard deviation) as already explained.
- the processor 410 calculates the silence level, i.e., the second descriptor, as already described.
- step S540 the processor uses the first and second descriptors to determine if the captured sound is natural or artificial, as already described.
- the processor 410 can then for example output information on whether the sound is natural or artificial through the output interface 450 or use this information internally as input to other functions.
- the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
- general-purpose devices which may include a processor, memory and input/output interfaces.
- the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
- processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
- DSP digital signal processor
- ROM read only memory
- RAM random access memory
- any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
- any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
- the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Device (400) and method for determining if sound is artificial. A hardware input interface (440) obtains (S510) a signal corresponding to sound in an environment, and at least one hardware processor (410) calculates (S520, S530), by from the signal at least one of a descriptor related to loudness and a descriptor related to silence and determines (S540) that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
Description
- The present disclosure relates generally to audio recognition and in particular to determining if sound is natural or artificial.
- This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
- Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate.
-
Figure 1 illustrates a generic conventionalaudio classification pipeline 100 that comprises anaudio sensor 110 capturing a raw audio signal, a preprocessingmodule 120 that prepares the captured audio for afeatures extraction module 130 that outputs extracted features to aclassifier module 140 that uses entries in anaudio database 150 to label audio that is then output. - The labelled audio can then be used to, for example, determine the activities of persons (and even pets) in the location where the audio was captured. Knowledge of the activities can be used in situations like e-health, care of children or the elderly, and home security. In addition, parents could use the knowledge to determine what their children do when they are alone at home: for instance, after school, are they doing their homework or watching television?
- In some of these cases, it can be important to distinguish between natural sound - for example talking or singing persons, or a barking dog - and artificial sound, i.e. sound that is rendered by a rendering device such as a radio, a television or a hi-fi system. It will be appreciated that persons talking on the television could be mistaken for real persons discussing. So far, this issue appears to have no suitable conventional solution.
- It will be appreciated that there is a desire for a solution that addresses this problem. The present principles provide such a solution.
- In a first aspect, the present principles are directed to a method for determining if sound is artificial. At a device, a hardware input interface obtains a signal corresponding to sound in an environment and at least one hardware processor calculates from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determines that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- Various embodiments of the first aspect include:
- That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
- That the descriptor related to loudness is a standard deviation for power of the signal.
- In a second aspect, the present principles are directed to a device for determining if sound is artificial, comprising a hardware input interface configured to obtain a signal corresponding to sound in an environment, and at least one hardware processor configured to calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- Various embodiments of the second aspect include:
- That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
- That the descriptor related to loudness is a standard deviation for power of the signal.
- That the input interface is configured to capture the sound. The input interface can comprise a microphone.
- That the device further comprises an output interface for outputting information about whether the sound is natural or artificial.
- In a third aspect, the present principles are directed to a computer program comprising program code instructions executable by a processor for implementing the method according to the first aspect.
- In a fourth aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the first aspect.
- Preferred features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
-
Figure 1 illustrates a generic conventional audio classification pipeline; -
Figure 2 illustrates conventional downward compression with a hard knee; -
Figure 3 illustrates a signal without dynamic range compression and the same signal with dynamic range compression; -
Figure 4 illustrates a device for audio distinction according to the present principles; and -
Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles. - One way of monitoring a person in order to, for instance, anticipate problems, is to verify if the habits of the person are followed. To do this, it can be useful to classify ambient sound in the person's location as:
- no sound, i.e., silence.
- natural ambient sound, such as for example physical people speaking, cooking, dog barking.
- artificial ambient sound, such as sound coming from a radio, a television, or a hi-fi system. In this context, "artificial" means that the sound was processed for broadcast or recording and subsequent rendering.
- To detect artificial ambient sound, the present principles rely on the fact that most artificial audio sources use dynamic range compression to enhance the sound and to make it more present. It is for example possible to enhance the sound to avoid a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation standard that has limited frequency spectrum range.
- Dynamic range compression, which is a very common technique in the broadcast chain and in media content workflows, amplifies parts of the audio signal with low amplitude (upward compression), reduces the loud parts of the sound (downward compression), or both. On the other hand, natural sounds tend to be characterized by a wider dynamic range, which typically means that more low power sounds tend to be present in a natural audio signal than in a dynamic range compressed audio signal. Hence, detecting such dynamic differences within the sound can help differentiating artificial and natural sound.
- Dynamic Range Compression (DRC) for audio will now be described in further detail. As already mentioned, DRC can amplify low sounds, attenuate high sounds, or both.
-
Figure 2 illustrates conventional downward compression with a hard knee. Acompression function curve 210 has afirst part 212 that is neutral - i.e., an input level transformed by this part results in an equal output level. The curve further has asecond part 214 that meets thefirst part 212 at ahard knee 216. Thesecond part 214 performs downward compression, which means that an input level Ll is transformed into a lower output level Lo.Figure 2 also shows athreshold 220 that lies between thefirst part 212 and thesecond part 214. In this example, thethreshold 220 coincides with thehard knee 216, but it will be appreciated that in case a soft knee is used, this will extend around thethreshold 220 and comprise part of thefirst part 212 and thesecond part 214 as well. - It will also be understood that for upward compression, the first part of the curve would be flatter so that an input level results in a higher output level (except perhaps at the hard knee). It will further be understood that the function can allow both downward and upward compression, in which case the first part and the second part can have identical slopes or different slopes.
- DRC can for example be used:
- In public spaces to make music sound louder without having to increase the peak amplitude.
- In music production for a better mix between vocals and instruments.
- In voice processing to avoid sibilance.
- In broadcasting to fit a broadcast signal with narrow range, as will be explained in more detail.
- In marketing to increase the impact of commercials.
- To protect circuitry in devices with amplifiers, and also to avoid clipping or saturation effects.
- In hearing aids and headphones to make certain sounds more audible while others are attenuated.
-
Figure 3 illustrates asignal 310 without dynamic range compression and thesame signal 320 with dynamic range compression. As can be seen, the loud parts have been attenuated (downward compression) and the low parts have been amplified (upward compression). - In the case of FM (Frequency Modulation) radio broadcasting, the characteristic of the frequency modulation limits the frequency spectrum range, which in turn limits the acoustic dynamic range. If the frequency spectrum range is not respected, this will result in spectrum overlaps and audio distortion. Simply reducing the amplitude of the signal fed to the modulator so that it never clips the signal requires an important reduction of the input signal, which results in a reduction in the signal-to-noise (SNR) ratio. A lower SNR ratio in turn means that a listener will hear more transmission noise, especially during the more quiet part of the transmission.
- The effect on FM also applies to digital radio that includes an ADC (Analog to Digital Converter) in front of the modulator and for which dynamic range is limited.
- DRC can also preserve the audio amplification chain and as well as any speakers from saturation when they are not dimensioned to render the natural dynamic range.
- In addition, compressing broadcast radio FM or broadcast TV enables the high-power amplifier transmitter required to broadcast the signal over the air to transmit using a more constant output power. Doing so can increase the lifetime of the amplifier. Indeed, the standardization community tries to find the best compromise between audio quality for the end user and economy when it comes to the broadcasting infrastructure.
- Further, Automatic Gain Control (AGC) is useful for microphone capture when a speaker talks over a low background sound that should be shared with the audience. AGC aims to provide a control level output signal regardless of the input signal. In other words, weak input signals are amplified and loud input signals are attenuated. The outcome is a less dynamic sound that is suitable for network broadcasting.
- It will be understood that an audio signal that is broadcast or streamed over the air, a cellular network or a broadband network typically has a compressed acoustic dynamic. Hence, a music or voice audio signal listened through a speaker has different dynamical properties than audio produced by natural sources like human voices, animal sounds and (non-amplified) instruments.
- As an effect of DRC is an amplification of sounds below a first amplitude threshold and an attenuation of sounds above a second amplitude threshold (possibly the same as the first threshold), it can be seen that sounds with DRC have a smaller amplitude variance than natural sounds. In addition, most broadcast sources - television, radio, music - tend to avoid silence. Therefore, the proportion of silence will be low for artificial sound sources.
- Hence to distinguish artificial ambient sound from natural ambient sound, a device can analyse captured ambient sound to determine at least one of:
- if the variance of the amplitude is above (natural ambient sound) or below (artificial ambient sound) a variance threshold value, and
- if the level of silence is above (natural ambient sound) or below (artificial ambient sound) a silence threshold value
- One way of calculating the amplitude variance is as follows, but it will be appreciated that other ways exist. First, the captured sound is divided into a number of sections (or windows); the windows can be distinct, but are generally overlapping with a subsequent window starting at the middle of the window just before. Each window has an index, that we note k for instance. The windows have a same size, noted w. The captured sound, i.e. the part for which it should be determined if it is natural or artificial, is thus divided in a set of K possibly overlapping windows.
-
- The size w may take the value of 1024, but other values such as 2048 have also been contemplated.
- The output for the different windows defines a series of instantaneous power values Pk for the K windows of the captured sound signal.
-
- Since amplitude and power are inextricably linked, the standard deviation for the power is also an indirect measure of the standard deviation for the amplitude.
-
- This coefficient of variation of the power is a first descriptor, related to loudness, used to distinguish natural sounds from artificial sounds.
- To calculate the level of silence, first the windows whose RMS power is below a given threshold τ are marked as 'silent'.
- Then in an optional step, consecutive windows marked 'silent' are grouped in 'silent' groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups. The signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups. To clean the signal of anomalous or outlying events, groups of 'non-silent' windows smaller than a certain size (such as a few windows, e.g. three) are marked 'silent'.
-
- In a variation, the detection of the silent windows may occur before the calculation of the descriptor CV(P) explained above, and this descriptor may be computed only on the windows which are marked as 'non silent'.
- The two descriptors described above are expected to have a high value for the first (high variation of the power) and the second (large number of silent windows) in case of a natural sound, and the opposite for an artificial sound (power constantly high, nearly no silent window). This will be used in a classification system as exposed hereafter.
- To classify the sound, different possibilities exist. A first possibility is to take the first and second descriptors as input to a supervised classifier that is trained to separate the natural sound from the artificial sound. The supervised classifier may for instance be based on a decision tree, using two thresholds corresponding to the two descriptors.
- A second possibility is to use a set of conditions such as:
- IF descriptor1 > thresholdl THEN sound is artificial
- ELSE IF descriptor2 > threshold2 THEN sound is natural
- ELSE IF descriptor1 > threshold3 AND descriptor2 < threshold 4 THEN sound is artificial
- ELSE IF descriptor1 < threshold5 and descriptor2 > threshold6 THEN sound is natural
- Naturally, there are many ways of expressing the conditions, using different thresholds that in addition may depend on many things such as locality and equipment.
-
Figure 4 illustrates a device foraudio distinction 400 according to the present principles. Thedevice 400 comprises at least one hardware processing unit ("processor") 410 configured to execute instructions of a first software program and to process audio for distinction, as described herein. Thedevice 400 further comprises at least one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured to store the software program and data required to distinguish sound. Thedevice 400 can also comprise at least one user communications interface ("User I/O") 430 for interfacing with a user. - The
device 400 further comprises aninput interface 440 and anoutput interface 450. Theinput interface 440 is configured to obtain audio for distinguishing; theinput interface 440 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio. Theoutput interface 450 is configured to output information about distinguished audio - is it natural or artificial sound - for example for presentation on a screen or by transfer to a further device. - Non-transitory, computer-
readable storage medium 460 includes a computer program with instructions that, when executed by theprocessor 410 performs the methods described herein. - The
processor 410 can also be configured to use the distinction to determine user activity as described in the background part of the description. - The
device 400 is preferably implemented as a single device such as a gateway, but its functionality can also be distributed over a plurality of devices. - In some cases, the
processor 410 may have access to other data and use this data to determine that the sound has been incorrectly classified, for example in case the sound was classified as natural and the data originates from another device and indicates that artificial sound is indeed rendered in the environment where theprocessor 410 is located. If this occurs regularly, it could mean that the classification model used by theprocessor 410 is not accurate enough. In this case, thedevice 400 can send anonymized descriptors that caused false incorrect classification to a server, so that the global model can be adapted to these descriptors (i.e. recomputed with those new inputs). The global model can then be distributed to the individual devices. In such an implementation, a stream processing big-data infrastructure such as Storm or Spark is particularly relevant. -
Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles. In step S510 thedevice 400 obtains captured sound, either by capturing it itself or receiving captured sound from another device. In step S520, theprocessor 410 calculates power standard deviation, i.e., the first descriptor, (as a measure of amplitude standard deviation) as already explained. In step S530, theprocessor 410 calculates the silence level, i.e., the second descriptor, as already described. Finally, in step S540, the processor uses the first and second descriptors to determine if the captured sound is natural or artificial, as already described. - The
processor 410 can then for example output information on whether the sound is natural or artificial through theoutput interface 450 or use this information internally as input to other functions. - It will be appreciated that the present principles can provide a solution for audio recognition that can enable:
- Respect of users' privacy since the sound can be distinguished in a device located in the users' location rather than being sent to a device "in the cloud".
- A small footprint on the distinguishing device since it is sufficient to retain the model, some variables and the present sound windows.
- It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
- The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
- All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
- Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
- Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
- The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
- Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
- In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
Claims (15)
- A method for determining if sound is artificial, the method comprising at a device (400):obtaining (S510), by a hardware input interface (440) a signal corresponding to sound in an environment;calculating (S520, S530), by at least one hardware processor (410) from the signal at least one of a descriptor related to loudness and a descriptor related to silence; anddetermining (S540), by the at least one hardware processor (410), that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- The method of claim 1, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
- The method of claim 2, wherein the adjacent windows are overlapping.
- The method of claim 2 or 3, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
- The method of any one of claims 1 to 4, wherein the descriptor related to loudness is a standard deviation for power of the signal.
- A device (400) for determining if sound is natural, comprising:a hardware input interface (440) configured to obtain a signal corresponding to sound in an environment; andat least one hardware processor (410) configured to:calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence; anddetermine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
- The device of claim 6, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
- The device of claim 7, wherein the adjacent windows are overlapping.
- The device of claim 7 or 8, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
- The device of any one of claims 6 to 9, wherein the descriptor related to loudness is a standard deviation for power of the signal.
- The device of any one of claims 6 to 10, wherein the input interface (440) is configured to capture the sound.
- The device of claim 11, wherein the input interface (440) comprises a microphone.
- The device of any one of claims 6 to 12, further comprising an output interface (450) for outputting information about whether the sound is natural or artificial.
- A computer program comprising instructions that, when executed cause at least one hardware processor (410) to perform the method of any one of claims 1-5.
- A non-transitory, computer-readable storage medium (460) including that, when executed, cause at least one hardware processor (410) to perform the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP17305754.8A EP3419021A1 (en) | 2017-06-20 | 2017-06-20 | Device and method for distinguishing natural and artificial sound |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP17305754.8A EP3419021A1 (en) | 2017-06-20 | 2017-06-20 | Device and method for distinguishing natural and artificial sound |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3419021A1 true EP3419021A1 (en) | 2018-12-26 |
Family
ID=59298418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP17305754.8A Withdrawn EP3419021A1 (en) | 2017-06-20 | 2017-06-20 | Device and method for distinguishing natural and artificial sound |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP3419021A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109682676A (en) * | 2018-12-29 | 2019-04-26 | 上海工程技术大学 | A kind of feature extracting method of the acoustic emission signal of fiber tension failure |
EP3828888A1 (en) * | 2019-11-27 | 2021-06-02 | Thomson Licensing | Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130204607A1 (en) * | 2011-12-08 | 2013-08-08 | Forrest S. Baker III Trust | Voice Detection For Automated Communication System |
-
2017
- 2017-06-20 EP EP17305754.8A patent/EP3419021A1/en not_active Withdrawn
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130204607A1 (en) * | 2011-12-08 | 2013-08-08 | Forrest S. Baker III Trust | Voice Detection For Automated Communication System |
Non-Patent Citations (1)
Title |
---|
ADAM GREENHALL ET AL: "Cepstral mean based speech source discrimination", ACOUSTICS SPEECH AND SIGNAL PROCESSING (ICASSP), 2010 IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 14 March 2010 (2010-03-14), pages 4490 - 4493, XP031697529, ISBN: 978-1-4244-4295-9 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109682676A (en) * | 2018-12-29 | 2019-04-26 | 上海工程技术大学 | A kind of feature extracting method of the acoustic emission signal of fiber tension failure |
EP3828888A1 (en) * | 2019-11-27 | 2021-06-02 | Thomson Licensing | Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium |
WO2021104818A1 (en) * | 2019-11-27 | 2021-06-03 | Thomson Licensing | Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium |
US11930332B2 (en) | 2019-11-27 | 2024-03-12 | Thomson Licensing | Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9609441B2 (en) | Smart hearing aid | |
CN102124758B (en) | Hearing aid, hearing assistance system, walking detection method, and hearing assistance method | |
US10275209B2 (en) | Sharing of custom audio processing parameters | |
Perez-Gonzalez et al. | Automatic gain and fader control for live mixing | |
CN108235181B (en) | Method for noise reduction in an audio processing apparatus | |
US10555069B2 (en) | Approach for detecting alert signals in changing environments | |
US10853025B2 (en) | Sharing of custom audio processing parameters | |
EP3419021A1 (en) | Device and method for distinguishing natural and artificial sound | |
US11894006B2 (en) | Compressor target curve to avoid boosting noise | |
CN109634554B (en) | Method and device for outputting information | |
WO2019002417A1 (en) | Sound responsive device and method | |
CN114902560B (en) | Apparatus and method for automatic volume control with ambient noise compensation | |
US11064301B2 (en) | Sound level control for hearing assistive devices | |
CN110022514B (en) | Method, device and system for reducing noise of audio signal and computer storage medium | |
KR20210086217A (en) | Hoarse voice noise filtering system | |
CN113259826B (en) | Method and device for realizing hearing aid in electronic terminal | |
US11490211B2 (en) | Directivity hearing-aid device and method thereof | |
WO2008075305A1 (en) | Method and apparatus to address source of lombard speech | |
CN112887856B (en) | Sound processing method and system for reducing noise | |
CN115967894B (en) | Microphone sound processing method, system, terminal equipment and storage medium | |
CN213877572U (en) | Human voice enhancement and environment prediction system based on deep learning | |
EP4303874A1 (en) | Providing a measure of intelligibility of an audio signal | |
JPH08298698A (en) | Environmental sound analyzer | |
Ilmi et al. | Automatic control music amplifier using speech signal utilizing by TMS320C6713 | |
TWI584275B (en) | Electronic device and method for analyzing and playing sound signal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20190627 |