[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

EP3419021A1 - Device and method for distinguishing natural and artificial sound - Google Patents

Device and method for distinguishing natural and artificial sound Download PDF

Info

Publication number
EP3419021A1
EP3419021A1 EP17305754.8A EP17305754A EP3419021A1 EP 3419021 A1 EP3419021 A1 EP 3419021A1 EP 17305754 A EP17305754 A EP 17305754A EP 3419021 A1 EP3419021 A1 EP 3419021A1
Authority
EP
European Patent Office
Prior art keywords
sound
signal
descriptor related
artificial
windows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP17305754.8A
Other languages
German (de)
French (fr)
Inventor
Jean-Ronan Vigouroux
Alexey Ozerov
Erwan Le Merrer
Philippe Gilberton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Licensing SAS
Original Assignee
Thomson Licensing SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing SAS filed Critical Thomson Licensing SAS
Priority to EP17305754.8A priority Critical patent/EP3419021A1/en
Publication of EP3419021A1 publication Critical patent/EP3419021A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present disclosure relates generally to audio recognition and in particular to determining if sound is natural or artificial.
  • Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate.
  • Figure 1 illustrates a generic conventional audio classification pipeline 100 that comprises an audio sensor 110 capturing a raw audio signal, a preprocessing module 120 that prepares the captured audio for a features extraction module 130 that outputs extracted features to a classifier module 140 that uses entries in an audio database 150 to label audio that is then output.
  • the labelled audio can then be used to, for example, determine the activities of persons (and even pets) in the location where the audio was captured.
  • Knowledge of the activities can be used in situations like e-health, care of children or the elderly, and home security.
  • parents could use the knowledge to determine what their children do when they are alone at home: for instance, after school, are they doing their homework or watching television?
  • the present principles are directed to a method for determining if sound is artificial.
  • a hardware input interface obtains a signal corresponding to sound in an environment and at least one hardware processor calculates from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determines that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  • the present principles are directed to a device for determining if sound is artificial, comprising a hardware input interface configured to obtain a signal corresponding to sound in an environment, and at least one hardware processor configured to calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  • the present principles are directed to a computer program comprising program code instructions executable by a processor for implementing the method according to the first aspect.
  • the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the first aspect.
  • One way of monitoring a person in order to, for instance, anticipate problems, is to verify if the habits of the person are followed. To do this, it can be useful to classify ambient sound in the person's location as:
  • the present principles rely on the fact that most artificial audio sources use dynamic range compression to enhance the sound and to make it more present. It is for example possible to enhance the sound to avoid a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation standard that has limited frequency spectrum range.
  • Dynamic range compression which is a very common technique in the broadcast chain and in media content workflows, amplifies parts of the audio signal with low amplitude (upward compression), reduces the loud parts of the sound (downward compression), or both.
  • natural sounds tend to be characterized by a wider dynamic range, which typically means that more low power sounds tend to be present in a natural audio signal than in a dynamic range compressed audio signal. Hence, detecting such dynamic differences within the sound can help differentiating artificial and natural sound.
  • DRC Dynamic Range Compression
  • Figure 2 illustrates conventional downward compression with a hard knee.
  • a compression function curve 210 has a first part 212 that is neutral - i.e., an input level transformed by this part results in an equal output level.
  • the curve further has a second part 214 that meets the first part 212 at a hard knee 216.
  • the second part 214 performs downward compression, which means that an input level L l is transformed into a lower output level Lo.
  • Figure 2 also shows a threshold 220 that lies between the first part 212 and the second part 214.
  • the threshold 220 coincides with the hard knee 216, but it will be appreciated that in case a soft knee is used, this will extend around the threshold 220 and comprise part of the first part 212 and the second part 214 as well.
  • the first part of the curve would be flatter so that an input level results in a higher output level (except perhaps at the hard knee). It will further be understood that the function can allow both downward and upward compression, in which case the first part and the second part can have identical slopes or different slopes.
  • DRC can for example be used:
  • Figure 3 illustrates a signal 310 without dynamic range compression and the same signal 320 with dynamic range compression. As can be seen, the loud parts have been attenuated (downward compression) and the low parts have been amplified (upward compression).
  • the characteristic of the frequency modulation limits the frequency spectrum range, which in turn limits the acoustic dynamic range. If the frequency spectrum range is not respected, this will result in spectrum overlaps and audio distortion. Simply reducing the amplitude of the signal fed to the modulator so that it never clips the signal requires an important reduction of the input signal, which results in a reduction in the signal-to-noise (SNR) ratio.
  • SNR signal-to-noise
  • FM also applies to digital radio that includes an ADC (Analog to Digital Converter) in front of the modulator and for which dynamic range is limited.
  • ADC Analog to Digital Converter
  • DRC can also preserve the audio amplification chain and as well as any speakers from saturation when they are not dimensioned to render the natural dynamic range.
  • compressing broadcast radio FM or broadcast TV enables the high-power amplifier transmitter required to broadcast the signal over the air to transmit using a more constant output power. Doing so can increase the lifetime of the amplifier. Indeed, the standardization community tries to find the best compromise between audio quality for the end user and economy when it comes to the broadcasting infrastructure.
  • AGC Automatic Gain Control
  • an audio signal that is broadcast or streamed over the air, a cellular network or a broadband network typically has a compressed acoustic dynamic.
  • a music or voice audio signal listened through a speaker has different dynamical properties than audio produced by natural sources like human voices, animal sounds and (non-amplified) instruments.
  • a device can analyse captured ambient sound to determine at least one of:
  • the captured sound is divided into a number of sections (or windows); the windows can be distinct, but are generally overlapping with a subsequent window starting at the middle of the window just before.
  • Each window has an index, that we note k for instance.
  • the windows have a same size, noted w.
  • the captured sound i.e. the part for which it should be determined if it is natural or artificial, is thus divided in a set of K possibly overlapping windows.
  • the size w may take the value of 1024, but other values such as 2048 have also been contemplated.
  • the output for the different windows defines a series of instantaneous power values P k for the K windows of the captured sound signal.
  • the standard deviation for the power is also an indirect measure of the standard deviation for the amplitude.
  • This coefficient of variation of the power is a first descriptor, related to loudness, used to distinguish natural sounds from artificial sounds.
  • consecutive windows marked 'silent' are grouped in 'silent' groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups.
  • the signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups.
  • groups of 'non-silent' windows smaller than a certain size are marked 'silent'.
  • the second descriptor, related to silence, for distinguishing sound is the proportion of 'silent' windows over the number of windows K of the signal subject to examination. silent windows K where K is the number of windows considered.
  • the detection of the silent windows may occur before the calculation of the descriptor CV(P) explained above, and this descriptor may be computed only on the windows which are marked as 'non silent'.
  • the two descriptors described above are expected to have a high value for the first (high variation of the power) and the second (large number of silent windows) in case of a natural sound, and the opposite for an artificial sound (power constantly high, nearly no silent window). This will be used in a classification system as exposed hereafter.
  • a first possibility is to take the first and second descriptors as input to a supervised classifier that is trained to separate the natural sound from the artificial sound.
  • the supervised classifier may for instance be based on a decision tree, using two thresholds corresponding to the two descriptors.
  • a second possibility is to use a set of conditions such as:
  • FIG. 4 illustrates a device for audio distinction 400 according to the present principles.
  • the device 400 comprises at least one hardware processing unit (“processor") 410 configured to execute instructions of a first software program and to process audio for distinction, as described herein.
  • the device 400 further comprises at least one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured to store the software program and data required to distinguish sound.
  • the device 400 can also comprise at least one user communications interface (“User I / O ”) 430 for interfacing with a user.
  • User I / O user communications interface
  • the device 400 further comprises an input interface 440 and an output interface 450.
  • the input interface 440 is configured to obtain audio for distinguishing; the input interface 440 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio.
  • the output interface 450 is configured to output information about distinguished audio - is it natural or artificial sound - for example for presentation on a screen or by transfer to a further device.
  • Non-transitory, computer-readable storage medium 460 includes a computer program with instructions that, when executed by the processor 410 performs the methods described herein.
  • the processor 410 can also be configured to use the distinction to determine user activity as described in the background part of the description.
  • the device 400 is preferably implemented as a single device such as a gateway, but its functionality can also be distributed over a plurality of devices.
  • the processor 410 may have access to other data and use this data to determine that the sound has been incorrectly classified, for example in case the sound was classified as natural and the data originates from another device and indicates that artificial sound is indeed rendered in the environment where the processor 410 is located. If this occurs regularly, it could mean that the classification model used by the processor 410 is not accurate enough.
  • the device 400 can send anonymized descriptors that caused false incorrect classification to a server, so that the global model can be adapted to these descriptors (i.e. recomputed with those new inputs). The global model can then be distributed to the individual devices.
  • a stream processing big-data infrastructure such as Storm or Spark is particularly relevant.
  • FIG. 5 illustrates a flowchart for a method of audio distinction according to the present principles.
  • the device 400 obtains captured sound, either by capturing it itself or receiving captured sound from another device.
  • the processor 410 calculates power standard deviation, i.e., the first descriptor, (as a measure of amplitude standard deviation) as already explained.
  • the processor 410 calculates the silence level, i.e., the second descriptor, as already described.
  • step S540 the processor uses the first and second descriptors to determine if the captured sound is natural or artificial, as already described.
  • the processor 410 can then for example output information on whether the sound is natural or artificial through the output interface 450 or use this information internally as input to other functions.
  • the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
  • general-purpose devices which may include a processor, memory and input/output interfaces.
  • the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
  • DSP digital signal processor
  • ROM read only memory
  • RAM random access memory
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
  • the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Device (400) and method for determining if sound is artificial. A hardware input interface (440) obtains (S510) a signal corresponding to sound in an environment, and at least one hardware processor (410) calculates (S520, S530), by from the signal at least one of a descriptor related to loudness and a descriptor related to silence and determines (S540) that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.

Description

    TECHNICAL FIELD
  • The present disclosure relates generally to audio recognition and in particular to determining if sound is natural or artificial.
  • BACKGROUND
  • This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
  • Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate.
  • Figure 1 illustrates a generic conventional audio classification pipeline 100 that comprises an audio sensor 110 capturing a raw audio signal, a preprocessing module 120 that prepares the captured audio for a features extraction module 130 that outputs extracted features to a classifier module 140 that uses entries in an audio database 150 to label audio that is then output.
  • The labelled audio can then be used to, for example, determine the activities of persons (and even pets) in the location where the audio was captured. Knowledge of the activities can be used in situations like e-health, care of children or the elderly, and home security. In addition, parents could use the knowledge to determine what their children do when they are alone at home: for instance, after school, are they doing their homework or watching television?
  • In some of these cases, it can be important to distinguish between natural sound - for example talking or singing persons, or a barking dog - and artificial sound, i.e. sound that is rendered by a rendering device such as a radio, a television or a hi-fi system. It will be appreciated that persons talking on the television could be mistaken for real persons discussing. So far, this issue appears to have no suitable conventional solution.
  • It will be appreciated that there is a desire for a solution that addresses this problem. The present principles provide such a solution.
  • SUMMARY OF DISCLOSURE
  • In a first aspect, the present principles are directed to a method for determining if sound is artificial. At a device, a hardware input interface obtains a signal corresponding to sound in an environment and at least one hardware processor calculates from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determines that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  • Various embodiments of the first aspect include:
    • That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
    • That the descriptor related to loudness is a standard deviation for power of the signal.
  • In a second aspect, the present principles are directed to a device for determining if sound is artificial, comprising a hardware input interface configured to obtain a signal corresponding to sound in an environment, and at least one hardware processor configured to calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  • Various embodiments of the second aspect include:
    • That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
    • That the descriptor related to loudness is a standard deviation for power of the signal.
    • That the input interface is configured to capture the sound. The input interface can comprise a microphone.
    • That the device further comprises an output interface for outputting information about whether the sound is natural or artificial.
  • In a third aspect, the present principles are directed to a computer program comprising program code instructions executable by a processor for implementing the method according to the first aspect.
  • In a fourth aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the first aspect.
  • BRIEF DESCRIPTION OF DRAWINGS
  • Preferred features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
    • Figure 1 illustrates a generic conventional audio classification pipeline;
    • Figure 2 illustrates conventional downward compression with a hard knee;
    • Figure 3 illustrates a signal without dynamic range compression and the same signal with dynamic range compression;
    • Figure 4 illustrates a device for audio distinction according to the present principles; and
    • Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles.
    DESCRIPTION OF EMBODIMENTS
  • One way of monitoring a person in order to, for instance, anticipate problems, is to verify if the habits of the person are followed. To do this, it can be useful to classify ambient sound in the person's location as:
    • no sound, i.e., silence.
    • natural ambient sound, such as for example physical people speaking, cooking, dog barking.
    • artificial ambient sound, such as sound coming from a radio, a television, or a hi-fi system. In this context, "artificial" means that the sound was processed for broadcast or recording and subsequent rendering.
  • To detect artificial ambient sound, the present principles rely on the fact that most artificial audio sources use dynamic range compression to enhance the sound and to make it more present. It is for example possible to enhance the sound to avoid a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation standard that has limited frequency spectrum range.
  • Dynamic range compression, which is a very common technique in the broadcast chain and in media content workflows, amplifies parts of the audio signal with low amplitude (upward compression), reduces the loud parts of the sound (downward compression), or both. On the other hand, natural sounds tend to be characterized by a wider dynamic range, which typically means that more low power sounds tend to be present in a natural audio signal than in a dynamic range compressed audio signal. Hence, detecting such dynamic differences within the sound can help differentiating artificial and natural sound.
  • Dynamic Range Compression
  • Dynamic Range Compression (DRC) for audio will now be described in further detail. As already mentioned, DRC can amplify low sounds, attenuate high sounds, or both.
  • Figure 2 illustrates conventional downward compression with a hard knee. A compression function curve 210 has a first part 212 that is neutral - i.e., an input level transformed by this part results in an equal output level. The curve further has a second part 214 that meets the first part 212 at a hard knee 216. The second part 214 performs downward compression, which means that an input level Ll is transformed into a lower output level Lo. Figure 2 also shows a threshold 220 that lies between the first part 212 and the second part 214. In this example, the threshold 220 coincides with the hard knee 216, but it will be appreciated that in case a soft knee is used, this will extend around the threshold 220 and comprise part of the first part 212 and the second part 214 as well.
  • It will also be understood that for upward compression, the first part of the curve would be flatter so that an input level results in a higher output level (except perhaps at the hard knee). It will further be understood that the function can allow both downward and upward compression, in which case the first part and the second part can have identical slopes or different slopes.
  • DRC can for example be used:
    • In public spaces to make music sound louder without having to increase the peak amplitude.
    • In music production for a better mix between vocals and instruments.
    • In voice processing to avoid sibilance.
    • In broadcasting to fit a broadcast signal with narrow range, as will be explained in more detail.
    • In marketing to increase the impact of commercials.
    • To protect circuitry in devices with amplifiers, and also to avoid clipping or saturation effects.
    • In hearing aids and headphones to make certain sounds more audible while others are attenuated.
  • Figure 3 illustrates a signal 310 without dynamic range compression and the same signal 320 with dynamic range compression. As can be seen, the loud parts have been attenuated (downward compression) and the low parts have been amplified (upward compression).
  • In the case of FM (Frequency Modulation) radio broadcasting, the characteristic of the frequency modulation limits the frequency spectrum range, which in turn limits the acoustic dynamic range. If the frequency spectrum range is not respected, this will result in spectrum overlaps and audio distortion. Simply reducing the amplitude of the signal fed to the modulator so that it never clips the signal requires an important reduction of the input signal, which results in a reduction in the signal-to-noise (SNR) ratio. A lower SNR ratio in turn means that a listener will hear more transmission noise, especially during the more quiet part of the transmission.
  • The effect on FM also applies to digital radio that includes an ADC (Analog to Digital Converter) in front of the modulator and for which dynamic range is limited.
  • DRC can also preserve the audio amplification chain and as well as any speakers from saturation when they are not dimensioned to render the natural dynamic range.
  • In addition, compressing broadcast radio FM or broadcast TV enables the high-power amplifier transmitter required to broadcast the signal over the air to transmit using a more constant output power. Doing so can increase the lifetime of the amplifier. Indeed, the standardization community tries to find the best compromise between audio quality for the end user and economy when it comes to the broadcasting infrastructure.
  • Further, Automatic Gain Control (AGC) is useful for microphone capture when a speaker talks over a low background sound that should be shared with the audience. AGC aims to provide a control level output signal regardless of the input signal. In other words, weak input signals are amplified and loud input signals are attenuated. The outcome is a less dynamic sound that is suitable for network broadcasting.
  • It will be understood that an audio signal that is broadcast or streamed over the air, a cellular network or a broadband network typically has a compressed acoustic dynamic. Hence, a music or voice audio signal listened through a speaker has different dynamical properties than audio produced by natural sources like human voices, animal sounds and (non-amplified) instruments.
  • As an effect of DRC is an amplification of sounds below a first amplitude threshold and an attenuation of sounds above a second amplitude threshold (possibly the same as the first threshold), it can be seen that sounds with DRC have a smaller amplitude variance than natural sounds. In addition, most broadcast sources - television, radio, music - tend to avoid silence. Therefore, the proportion of silence will be low for artificial sound sources.
  • Hence to distinguish artificial ambient sound from natural ambient sound, a device can analyse captured ambient sound to determine at least one of:
    • if the variance of the amplitude is above (natural ambient sound) or below (artificial ambient sound) a variance threshold value, and
    • if the level of silence is above (natural ambient sound) or below (artificial ambient sound) a silence threshold value
  • One way of calculating the amplitude variance is as follows, but it will be appreciated that other ways exist. First, the captured sound is divided into a number of sections (or windows); the windows can be distinct, but are generally overlapping with a subsequent window starting at the middle of the window just before. Each window has an index, that we note k for instance. The windows have a same size, noted w. The captured sound, i.e. the part for which it should be determined if it is natural or artificial, is thus divided in a set of K possibly overlapping windows.
  • The Root Mean Square (RMS) power of the sound for the window k is defined as: P k = 1 w i = 0 w 1 s i 2
    Figure imgb0001
    where the si are the w contiguous samples of the sound in the window k.
  • The size w may take the value of 1024, but other values such as 2048 have also been contemplated.
  • The output for the different windows defines a series of instantaneous power values Pk for the K windows of the captured sound signal.
  • The mean power can then be calculated as P = 1 K k = 1 K P k
    Figure imgb0002
    and the standard deviation as σ P = 1 K k = 1 K P k P
    Figure imgb0003
  • Since amplitude and power are inextricably linked, the standard deviation for the power is also an indirect measure of the standard deviation for the amplitude.
  • It is preferred to obtain a normalised measure by dividing the standard deviation by the mean power: CV P = σ P P
    Figure imgb0004
  • This coefficient of variation of the power is a first descriptor, related to loudness, used to distinguish natural sounds from artificial sounds.
  • To calculate the level of silence, first the windows whose RMS power is below a given threshold τ are marked as 'silent'.
  • Then in an optional step, consecutive windows marked 'silent' are grouped in 'silent' groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups. The signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups. To clean the signal of anomalous or outlying events, groups of 'non-silent' windows smaller than a certain size (such as a few windows, e.g. three) are marked 'silent'.
  • Finally, the second descriptor, related to silence, for distinguishing sound is the proportion of 'silent' windows over the number of windows K of the signal subject to examination. silent windows K
    Figure imgb0005
    where K is the number of windows considered.
  • In a variation, the detection of the silent windows may occur before the calculation of the descriptor CV(P) explained above, and this descriptor may be computed only on the windows which are marked as 'non silent'.
  • The two descriptors described above are expected to have a high value for the first (high variation of the power) and the second (large number of silent windows) in case of a natural sound, and the opposite for an artificial sound (power constantly high, nearly no silent window). This will be used in a classification system as exposed hereafter.
  • To classify the sound, different possibilities exist. A first possibility is to take the first and second descriptors as input to a supervised classifier that is trained to separate the natural sound from the artificial sound. The supervised classifier may for instance be based on a decision tree, using two thresholds corresponding to the two descriptors.
  • A second possibility is to use a set of conditions such as:
    • IF descriptor1 > thresholdl THEN sound is artificial
    • ELSE IF descriptor2 > threshold2 THEN sound is natural
    • ELSE IF descriptor1 > threshold3 AND descriptor2 < threshold 4 THEN sound is artificial
    • ELSE IF descriptor1 < threshold5 and descriptor2 > threshold6 THEN sound is natural
    where descriptor1 is the descriptor related to loudness and descriptor2 is the descriptor related to silence, and the various thresholds are thresholds in the model used for the determination.
  • Naturally, there are many ways of expressing the conditions, using different thresholds that in addition may depend on many things such as locality and equipment.
  • Figure 4 illustrates a device for audio distinction 400 according to the present principles. The device 400 comprises at least one hardware processing unit ("processor") 410 configured to execute instructions of a first software program and to process audio for distinction, as described herein. The device 400 further comprises at least one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured to store the software program and data required to distinguish sound. The device 400 can also comprise at least one user communications interface ("User I/O") 430 for interfacing with a user.
  • The device 400 further comprises an input interface 440 and an output interface 450. The input interface 440 is configured to obtain audio for distinguishing; the input interface 440 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio. The output interface 450 is configured to output information about distinguished audio - is it natural or artificial sound - for example for presentation on a screen or by transfer to a further device.
  • Non-transitory, computer-readable storage medium 460 includes a computer program with instructions that, when executed by the processor 410 performs the methods described herein.
  • The processor 410 can also be configured to use the distinction to determine user activity as described in the background part of the description.
  • The device 400 is preferably implemented as a single device such as a gateway, but its functionality can also be distributed over a plurality of devices.
  • In some cases, the processor 410 may have access to other data and use this data to determine that the sound has been incorrectly classified, for example in case the sound was classified as natural and the data originates from another device and indicates that artificial sound is indeed rendered in the environment where the processor 410 is located. If this occurs regularly, it could mean that the classification model used by the processor 410 is not accurate enough. In this case, the device 400 can send anonymized descriptors that caused false incorrect classification to a server, so that the global model can be adapted to these descriptors (i.e. recomputed with those new inputs). The global model can then be distributed to the individual devices. In such an implementation, a stream processing big-data infrastructure such as Storm or Spark is particularly relevant.
  • Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles. In step S510 the device 400 obtains captured sound, either by capturing it itself or receiving captured sound from another device. In step S520, the processor 410 calculates power standard deviation, i.e., the first descriptor, (as a measure of amplitude standard deviation) as already explained. In step S530, the processor 410 calculates the silence level, i.e., the second descriptor, as already described. Finally, in step S540, the processor uses the first and second descriptors to determine if the captured sound is natural or artificial, as already described.
  • The processor 410 can then for example output information on whether the sound is natural or artificial through the output interface 450 or use this information internally as input to other functions.
  • It will be appreciated that the present principles can provide a solution for audio recognition that can enable:
    • Respect of users' privacy since the sound can be distinguished in a device located in the users' location rather than being sent to a device "in the cloud".
    • A small footprint on the distinguishing device since it is sufficient to retain the model, some variables and the present sound windows.
  • It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
  • The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
  • All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
  • Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
  • Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
  • The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
  • Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Claims (15)

  1. A method for determining if sound is artificial, the method comprising at a device (400):
    obtaining (S510), by a hardware input interface (440) a signal corresponding to sound in an environment;
    calculating (S520, S530), by at least one hardware processor (410) from the signal at least one of a descriptor related to loudness and a descriptor related to silence; and
    determining (S540), by the at least one hardware processor (410), that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  2. The method of claim 1, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
  3. The method of claim 2, wherein the adjacent windows are overlapping.
  4. The method of claim 2 or 3, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
  5. The method of any one of claims 1 to 4, wherein the descriptor related to loudness is a standard deviation for power of the signal.
  6. A device (400) for determining if sound is natural, comprising:
    a hardware input interface (440) configured to obtain a signal corresponding to sound in an environment; and
    at least one hardware processor (410) configured to:
    calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence; and
    determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
  7. The device of claim 6, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
  8. The device of claim 7, wherein the adjacent windows are overlapping.
  9. The device of claim 7 or 8, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
  10. The device of any one of claims 6 to 9, wherein the descriptor related to loudness is a standard deviation for power of the signal.
  11. The device of any one of claims 6 to 10, wherein the input interface (440) is configured to capture the sound.
  12. The device of claim 11, wherein the input interface (440) comprises a microphone.
  13. The device of any one of claims 6 to 12, further comprising an output interface (450) for outputting information about whether the sound is natural or artificial.
  14. A computer program comprising instructions that, when executed cause at least one hardware processor (410) to perform the method of any one of claims 1-5.
  15. A non-transitory, computer-readable storage medium (460) including that, when executed, cause at least one hardware processor (410) to perform the method of any one of claims 1-5.
EP17305754.8A 2017-06-20 2017-06-20 Device and method for distinguishing natural and artificial sound Withdrawn EP3419021A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP17305754.8A EP3419021A1 (en) 2017-06-20 2017-06-20 Device and method for distinguishing natural and artificial sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP17305754.8A EP3419021A1 (en) 2017-06-20 2017-06-20 Device and method for distinguishing natural and artificial sound

Publications (1)

Publication Number Publication Date
EP3419021A1 true EP3419021A1 (en) 2018-12-26

Family

ID=59298418

Family Applications (1)

Application Number Title Priority Date Filing Date
EP17305754.8A Withdrawn EP3419021A1 (en) 2017-06-20 2017-06-20 Device and method for distinguishing natural and artificial sound

Country Status (1)

Country Link
EP (1) EP3419021A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
EP3828888A1 (en) * 2019-11-27 2021-06-02 Thomson Licensing Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130204607A1 (en) * 2011-12-08 2013-08-08 Forrest S. Baker III Trust Voice Detection For Automated Communication System

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130204607A1 (en) * 2011-12-08 2013-08-08 Forrest S. Baker III Trust Voice Detection For Automated Communication System

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ADAM GREENHALL ET AL: "Cepstral mean based speech source discrimination", ACOUSTICS SPEECH AND SIGNAL PROCESSING (ICASSP), 2010 IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 14 March 2010 (2010-03-14), pages 4490 - 4493, XP031697529, ISBN: 978-1-4244-4295-9 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109682676A (en) * 2018-12-29 2019-04-26 上海工程技术大学 A kind of feature extracting method of the acoustic emission signal of fiber tension failure
EP3828888A1 (en) * 2019-11-27 2021-06-02 Thomson Licensing Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium
WO2021104818A1 (en) * 2019-11-27 2021-06-03 Thomson Licensing Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium
US11930332B2 (en) 2019-11-27 2024-03-12 Thomson Licensing Method for recognizing at least one naturally emitted sound produced by a real-life sound source in an environment comprising at least one artificial sound source, corresponding apparatus, computer program product and computer-readable carrier medium

Similar Documents

Publication Publication Date Title
US9609441B2 (en) Smart hearing aid
CN102124758B (en) Hearing aid, hearing assistance system, walking detection method, and hearing assistance method
US10275209B2 (en) Sharing of custom audio processing parameters
Perez-Gonzalez et al. Automatic gain and fader control for live mixing
CN108235181B (en) Method for noise reduction in an audio processing apparatus
US10555069B2 (en) Approach for detecting alert signals in changing environments
US10853025B2 (en) Sharing of custom audio processing parameters
EP3419021A1 (en) Device and method for distinguishing natural and artificial sound
US11894006B2 (en) Compressor target curve to avoid boosting noise
CN109634554B (en) Method and device for outputting information
WO2019002417A1 (en) Sound responsive device and method
CN114902560B (en) Apparatus and method for automatic volume control with ambient noise compensation
US11064301B2 (en) Sound level control for hearing assistive devices
CN110022514B (en) Method, device and system for reducing noise of audio signal and computer storage medium
KR20210086217A (en) Hoarse voice noise filtering system
CN113259826B (en) Method and device for realizing hearing aid in electronic terminal
US11490211B2 (en) Directivity hearing-aid device and method thereof
WO2008075305A1 (en) Method and apparatus to address source of lombard speech
CN112887856B (en) Sound processing method and system for reducing noise
CN115967894B (en) Microphone sound processing method, system, terminal equipment and storage medium
CN213877572U (en) Human voice enhancement and environment prediction system based on deep learning
EP4303874A1 (en) Providing a measure of intelligibility of an audio signal
JPH08298698A (en) Environmental sound analyzer
Ilmi et al. Automatic control music amplifier using speech signal utilizing by TMS320C6713
TWI584275B (en) Electronic device and method for analyzing and playing sound signal

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20190627