CN110363148A - A kind of method of face vocal print feature fusion verifying - Google Patents
A kind of method of face vocal print feature fusion verifying Download PDFInfo
- Publication number
- CN110363148A CN110363148A CN201910641594.7A CN201910641594A CN110363148A CN 110363148 A CN110363148 A CN 110363148A CN 201910641594 A CN201910641594 A CN 201910641594A CN 110363148 A CN110363148 A CN 110363148A
- Authority
- CN
- China
- Prior art keywords
- frequency
- image
- time
- domain signal
- cepstrum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000004927 fusion Effects 0.000 title claims abstract description 12
- 230000001755 vocal effect Effects 0.000 title abstract 3
- 239000013598 vector Substances 0.000 claims abstract description 18
- 238000001228 spectrum Methods 0.000 claims abstract description 16
- 238000012795 verification Methods 0.000 claims abstract description 13
- 238000006243 chemical reaction Methods 0.000 claims abstract description 9
- 238000009432 framing Methods 0.000 claims abstract description 6
- 230000009466 transformation Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 6
- 210000005069 ears Anatomy 0.000 claims description 5
- 230000010354 integration Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 5
- 230000000694 effects Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/60—Rotation of whole images or parts thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/10—Image enhancement or restoration using non-spatial domain filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/02—Preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/08—Feature extraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/12—Classification; Matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20048—Transform domain processing
- G06T2207/20052—Discrete cosine transform [DCT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20048—Transform domain processing
- G06T2207/20056—Discrete and fast Fourier transform, [DFT, FFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Collating Specific Patterns (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of method of face vocal print feature fusion verifying, comprising the following steps: the audio files of input is parsed into the time-domain signal of sound;The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing;The linear relationship that can perceive frequency conversion at human ear is converted by log spectrum;By cepstrum, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate;Sound spectrum feature vector is extracted, the vector is converted into image;Described image is merged with two-dimension human face image.The method of face vocal print feature proposed by the present invention fusion verifying, can achieve the effect for only doing one-time authentication, and does not have a kind of mode erroneous detection in application layer joint verification and will cause and entirely verify unacceptable problem, improve usage experience.
Description
Technical Field
The invention relates to the field of biological feature recognition, in particular to a method for fusion verification of face voiceprint features.
Background
In recent years, with deep learning and continuous development and maturity of computer vision technology, several identity authentication technologies based on computer vision technology in biometric identification are rapidly developed, especially face identification technology, which has the characteristics of non-contact and rapid identification and is widely applied to various services needing identity verification. Voiceprint recognition, which is one type of biometric recognition, is a service that performs identification based on the acoustic wave characteristics of a speaker. The identity recognition is independent of the accent and the language, and can be used for speaker identification and speaker confirmation.
Although various biological recognition technologies are gradually developed and applied to life and production, the use mode has obvious singleness. The conditions of missed detection and false detection always exist in a single verification mode, and the current research and application hotspots jointly use a plurality of biological identification technologies to achieve higher safety and accuracy, such as face identification and voiceprint identification. However, these methods only combine two recognition technologies at the application layer, for example, the face recognition plus the voiceprint recognition means that the face recognition passes, and then the voiceprint recognition passes is verified. And the human face features and the voice print features are not fused on the bottom layer feature level, so that the effect of only performing one-time verification is achieved, the whole verification cannot be passed due to one-way false detection, and the use experience is reduced.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method for fusing and verifying the face voiceprint features.
The technical scheme adopted by the invention for realizing the purpose is as follows: a method for fusing and verifying face voiceprint features comprises the following steps:
analyzing an input sound file into a time domain signal of sound;
converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing;
converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation;
separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation);
extracting a sound frequency spectrum characteristic vector, and converting the vector into an image;
and fusing the image and the two-dimensional face image.
Converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing, specifically:
selecting a time-frequency localized window function h (t), and calculating power spectrums at different moments through short-time Fourier transform, wherein the formula of the short-time Fourier transform is as follows:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
The window function is a hamming window.
The conversion of the frequency into a linear relationship which can be perceived by human ears through logarithmic spectrum conversion is specifically as follows:
converting the frequency scale into a logarithmic frequency scale by the following formula to enable the perception of the human ear to the frequency to be in a linear perception relation:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform.
Through the frequency inversion analysis, the direct current signal component and the sinusoidal signal component in the frequency domain signal after the conversion are separated by adopting DCT, which specifically comprises the following steps:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency bins, and u represents the frequency bins of the cepstrum.
The extracting of the sound frequency spectrum feature vector converts the vector into an image, specifically:
range of output vector:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
thus, a cepstrum of a sound is obtained, the horizontal axis of which is time and the vertical axis of which is frequency; where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
The image and the two-dimensional face image are fused, and the method specifically comprises the following steps:
and clockwise rotating the cepstrum by 90 degrees, if the length of the transverse axis of the spliced image is not consistent with that of the inverted cepstrum rotated by 90 degrees, scaling the two-dimensional face image to enable the lengths of the transverse axes of the two images to be consistent, and splicing the two images.
The invention has the following advantages and beneficial effects: the method for fusing and verifying the voiceprint features of the human face provided by the invention fuses the voiceprint features and the voice print features at the bottom feature level, can achieve the effect of verifying only once, does not have the problem that the whole verification cannot be passed due to false detection in one mode in application layer combined verification, and improves the use experience. (if any effect is supplemented as much as possible)
Drawings
FIG. 1 is a time domain signal diagram of sound;
FIG. 2 is a frequency domain signal of sound;
FIG. 3 is a chart of the hamming window function of the present invention;
FIG. 4 is a graph of the cepstrum of a sound according to the present invention;
FIG. 5 is a network architecture diagram of the VGG16 of the present invention;
FIG. 6 is a flow chart of a method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The flow of the method for fusing and verifying the human face voiceprint features is shown in fig. 6, and an input sound file is analyzed into a sound time domain signal; converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing; converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation; separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation); extracting a sound frequency spectrum characteristic vector, and converting the vector into an image; and fusing the image and the two-dimensional face image.
Fig. 1 is a time domain signal of sound and fig. 2 is a frequency domain signal of sound. The sound signal is a one-dimensional time domain signal, and the frequency change rule is difficult to be seen intuitively. If it is changed to the frequency domain by fourier transform, although the frequency distribution of the signal can be seen, the time domain information is lost and the change of the frequency distribution with time cannot be seen. To solve this problem, we use a short-time fourier transform (STFT) to determine the frequency and phase of the local area sinusoid of the time-varying signal. The specific method comprises the following steps: selecting a time-frequency localized window function, supposing that the analysis window function h (t) is stable in a short time interval, and enabling f (t) h (t) to be stable signals in different limited time widths, thereby calculating power spectrums at different moments, wherein the window function selects a hamming window which is a cosine window and can well reflect the attenuation relation of energy at a certain moment along with time. The short-time fourier transform is formulated as:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
Is a hamming window function, where N represents the discrete points of the window function and N represents the total number of discrete points, as shown in fig. 3.
The unit of frequency is Hertz (Hz), the audible frequency range of human ear is 20-20000Hz, but the human ear is not in linear perception relation to Hz, the ordinary frequency scale is converted into logarithmic frequency scale, and the mapping relation is shown as the following formula:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform. Through the formula, the perception of the human ear to the frequency is in a linear relation. That is, on this scale, if the frequencies of two segments of speech differ by a factor of two, the pitch that the human ear can perceive will probably also differ by a factor of two.
Based on the log spectrum, the direct current signal component and the sinusoidal signal component in the converted frequency domain signal are separated by adopting DCT (discrete cosine transformation), and the final obtained result is called a cepstrum:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency points, u represents a frequency point of the cepstrum, and 0 represents a direct current component.
Since the cepstrum outputs vectors, which cannot be displayed by the picture, it needs to be converted into an image matrix. The range of the output vector needs to be:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
this results in a cepstrum of the sound as shown in figure 4. Wherein the horizontal axis is time and the vertical axis is frequency. The brighter the place the greater the value (the greater the energy). Where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
In general, the frequency points extracted by us are fixed, that is, the length of the vertical axis is fixed, and the image 4 is rotated clockwise by 90 degrees and spliced behind the face image. And if the length of the horizontal axis of the face picture is not consistent with the length of the horizontal axis of the cepstrum after the face picture is rotated by 90 degrees, scaling the face picture to enable the lengths of the horizontal axes of the face picture and the cepstrum to be consistent. And the picture obtained after splicing is the fused feature. And finally, training a recognition model through a convolutional neural network. A typical CNN classification model can be abbreviated as two steps:
z=CNN(x)
p=softmax(zW)
where x is the input picture and p is the probability output for each class. When training as a classification problem, we output x and the corresponding one hot label p, but when using, we do not use the whole model, and we only use the part cnn (x), which is responsible for converting the picture into a vector with fixed length. With the conversion model (encoder), under any scene, the new face and sound fusion features can be encoded and then converted into comparison between the encoding vectors, so that the original classification model is not relied on. This completes the algorithm for biometric identification using the fused features.
The following is one embodiment of the present invention:
we use convolutional neural networks to identify the fused features, specifically, we use VGG16 networks, the network structure is that shown in fig. 5, the network structure is that of VGG16, there are 16 layers (excluding pooling and softmax layers), all convolution kernels use 3 × 3 in size, pooling use maximum pooling with 2 × 2 in size and 2 in step size, and the depth of the convolutional layers is 64- >128- >256- >512- > 512. The VGG16 is one type of neural network, and the training data may be different, including from which library to extract, and the number.
The training data used 1000 pictures of human faces (50 people, 20 pictures per person) extracted from ImageNet and 1000 fragments of human voice (also 50 people, 20 fragments per person) extracted from AudioSet. And randomly pairing 50 persons of the face and 50 persons of the voice, and randomly recombining the paired face pictures and voice fragments to finally obtain 50 face and voice pairs, wherein each face and voice pair comprises 20 samples, and the face and voice pairs are input into a VGG16 network after feature fusion to construct a 50-class model and train to obtain a feature encoder based on CNN.
Claims (7)
1. A method for fusing and verifying face voiceprint features is characterized by comprising the following steps:
analyzing an input sound file into a time domain signal of sound;
converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing;
converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation;
separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation);
extracting a sound frequency spectrum characteristic vector, and converting the vector into an image;
and fusing the image and the two-dimensional face image.
2. The method for fusion verification of human face voiceprint features according to claim 1, wherein the time domain signal is converted into a frequency domain signal by short-time fourier transform and windowing framing, specifically:
selecting a time-frequency localized window function h (t), and calculating power spectrums at different moments through short-time Fourier transform, wherein the formula of the short-time Fourier transform is as follows:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
3. The method of claim 2, wherein the window function is a hamming window.
4. The method for fusion verification of human face voiceprint features according to claim 1, wherein the frequency is converted into a linear relationship which can be perceived by human ears by log-spectrum transformation, specifically:
converting the frequency scale into a logarithmic frequency scale by the following formula to enable the perception of the human ear to the frequency to be in a linear perception relation:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform.
5. The method for fusion verification of human face voiceprint features according to claim 1, wherein the dc signal component and the sinusoidal signal component in the converted frequency domain signal are separated by means of a DCT transform through a cepstrum analysis, specifically:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency bins, and u represents the frequency bins of the cepstrum.
6. The method for fusion verification of human face voiceprint features according to claim 1, wherein the extracting of the voice spectrum feature vector and the converting of the vector into an image are specifically as follows:
range of output vector:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
thus, a cepstrum of a sound is obtained, the horizontal axis of which is time and the vertical axis of which is frequency; where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
7. The method for fusion verification of face voiceprint features according to claim 1, wherein the fusion of the image with a two-dimensional face image is specifically as follows:
and clockwise rotating the cepstrum by 90 degrees, if the length of the transverse axis of the spliced image is not consistent with that of the inverted cepstrum rotated by 90 degrees, scaling the two-dimensional face image to enable the lengths of the transverse axes of the two images to be consistent, and splicing the two images.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910641594.7A CN110363148A (en) | 2019-07-16 | 2019-07-16 | A kind of method of face vocal print feature fusion verifying |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910641594.7A CN110363148A (en) | 2019-07-16 | 2019-07-16 | A kind of method of face vocal print feature fusion verifying |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110363148A true CN110363148A (en) | 2019-10-22 |
Family
ID=68219964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910641594.7A Pending CN110363148A (en) | 2019-07-16 | 2019-07-16 | A kind of method of face vocal print feature fusion verifying |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110363148A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111709004A (en) * | 2020-08-19 | 2020-09-25 | 北京远鉴信息技术有限公司 | Identity authentication method and device, electronic equipment and readable storage medium |
CN111814128A (en) * | 2020-09-01 | 2020-10-23 | 北京远鉴信息技术有限公司 | Identity authentication method, device, equipment and storage medium based on fusion characteristics |
CN112133311A (en) * | 2020-09-18 | 2020-12-25 | 科大讯飞股份有限公司 | Speaker recognition method, related device and readable storage medium |
CN113114417A (en) * | 2021-03-30 | 2021-07-13 | 深圳市冠标科技发展有限公司 | Audio transmission method and device, electronic equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104835507A (en) * | 2015-03-30 | 2015-08-12 | 渤海大学 | Serial-parallel combined multi-mode emotion information fusion and identification method |
CN105469253A (en) * | 2015-11-19 | 2016-04-06 | 桂林航天工业学院 | Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN106127156A (en) * | 2016-06-27 | 2016-11-16 | 上海元趣信息技术有限公司 | Robot interactive method based on vocal print and recognition of face |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
CN107274887A (en) * | 2017-05-09 | 2017-10-20 | 重庆邮电大学 | Speaker's Further Feature Extraction method based on fusion feature MGFCC |
CN108124488A (en) * | 2017-12-12 | 2018-06-05 | 福建联迪商用设备有限公司 | A kind of payment authentication method and terminal based on face and vocal print |
CN108399395A (en) * | 2018-03-13 | 2018-08-14 | 成都数智凌云科技有限公司 | The compound identity identifying method of voice and face based on end-to-end deep neural network |
CN108847251A (en) * | 2018-07-04 | 2018-11-20 | 武汉斗鱼网络科技有限公司 | A kind of voice De-weight method, device, server and storage medium |
CN108962231A (en) * | 2018-07-04 | 2018-12-07 | 武汉斗鱼网络科技有限公司 | A kind of method of speech classification, device, server and storage medium |
CN109910818A (en) * | 2019-02-15 | 2019-06-21 | 东华大学 | A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification |
-
2019
- 2019-07-16 CN CN201910641594.7A patent/CN110363148A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104835507A (en) * | 2015-03-30 | 2015-08-12 | 渤海大学 | Serial-parallel combined multi-mode emotion information fusion and identification method |
CN105469253A (en) * | 2015-11-19 | 2016-04-06 | 桂林航天工业学院 | Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN106127156A (en) * | 2016-06-27 | 2016-11-16 | 上海元趣信息技术有限公司 | Robot interactive method based on vocal print and recognition of face |
CN107274887A (en) * | 2017-05-09 | 2017-10-20 | 重庆邮电大学 | Speaker's Further Feature Extraction method based on fusion feature MGFCC |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
CN108124488A (en) * | 2017-12-12 | 2018-06-05 | 福建联迪商用设备有限公司 | A kind of payment authentication method and terminal based on face and vocal print |
CN108399395A (en) * | 2018-03-13 | 2018-08-14 | 成都数智凌云科技有限公司 | The compound identity identifying method of voice and face based on end-to-end deep neural network |
CN108847251A (en) * | 2018-07-04 | 2018-11-20 | 武汉斗鱼网络科技有限公司 | A kind of voice De-weight method, device, server and storage medium |
CN108962231A (en) * | 2018-07-04 | 2018-12-07 | 武汉斗鱼网络科技有限公司 | A kind of method of speech classification, device, server and storage medium |
CN109910818A (en) * | 2019-02-15 | 2019-06-21 | 东华大学 | A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111709004A (en) * | 2020-08-19 | 2020-09-25 | 北京远鉴信息技术有限公司 | Identity authentication method and device, electronic equipment and readable storage medium |
CN111709004B (en) * | 2020-08-19 | 2020-11-13 | 北京远鉴信息技术有限公司 | Identity authentication method and device, electronic equipment and readable storage medium |
CN111814128A (en) * | 2020-09-01 | 2020-10-23 | 北京远鉴信息技术有限公司 | Identity authentication method, device, equipment and storage medium based on fusion characteristics |
CN111814128B (en) * | 2020-09-01 | 2020-12-11 | 北京远鉴信息技术有限公司 | Identity authentication method, device, equipment and storage medium based on fusion characteristics |
CN112133311A (en) * | 2020-09-18 | 2020-12-25 | 科大讯飞股份有限公司 | Speaker recognition method, related device and readable storage medium |
CN113114417A (en) * | 2021-03-30 | 2021-07-13 | 深圳市冠标科技发展有限公司 | Audio transmission method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Adeel et al. | Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments | |
CN110363148A (en) | A kind of method of face vocal print feature fusion verifying | |
US20210327431A1 (en) | 'liveness' detection system | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN109036382B (en) | Audio feature extraction method based on KL divergence | |
CN110909613A (en) | Video character recognition method and device, storage medium and electronic equipment | |
US20120303369A1 (en) | Energy-Efficient Unobtrusive Identification of a Speaker | |
US9947323B2 (en) | Synthetic oversampling to enhance speaker identification or verification | |
Kumar et al. | Harnessing ai for speech reconstruction using multi-view silent video feed | |
CN112786052A (en) | Speech recognition method, electronic device and storage device | |
US20220335944A1 (en) | Voice conversion apparatus, voice conversion learning apparatus, image generation apparatus, image generation learning apparatus, voice conversion method, voice conversion learning method, image generation method, image generation learning method, and computer program | |
CN109947971A (en) | Image search method, device, electronic equipment and storage medium | |
Mian Qaisar | Isolated speech recognition and its transformation in visual signs | |
Isyanto et al. | Voice biometrics for Indonesian language users using algorithm of deep learning CNN residual and hybrid of DWT-MFCC extraction features | |
Lee et al. | Seeing through the conversation: Audio-visual speech separation based on diffusion model | |
CN113782032B (en) | Voiceprint recognition method and related device | |
CN105741853A (en) | Digital speech perception hash method based on formant frequency | |
CN117935789A (en) | Speech recognition method, system, equipment and storage medium | |
Saleema et al. | Voice biometrics: the promising future of authentication in the internet of things | |
Anderson et al. | Robust tri-modal automatic speech recognition for consumer applications | |
Li et al. | Cross-Domain Audio Deepfake Detection: Dataset and Analysis | |
Nguyen et al. | Vietnamese speaker authentication using deep models | |
CN114512133A (en) | Sound object recognition method, sound object recognition device, server and storage medium | |
Yasmin et al. | Discrimination of male and female voice using occurrence pattern of spectral flux | |
He et al. | Automatic initial and final segmentation in cleft palate speech of Mandarin speakers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191022 |