[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110363148A - A kind of method of face vocal print feature fusion verifying - Google Patents

A kind of method of face vocal print feature fusion verifying Download PDF

Info

Publication number
CN110363148A
CN110363148A CN201910641594.7A CN201910641594A CN110363148A CN 110363148 A CN110363148 A CN 110363148A CN 201910641594 A CN201910641594 A CN 201910641594A CN 110363148 A CN110363148 A CN 110363148A
Authority
CN
China
Prior art keywords
frequency
image
time
domain signal
cepstrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910641594.7A
Other languages
Chinese (zh)
Inventor
胡增
江大白
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Applied Technology Co Ltd
Original Assignee
China Applied Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Applied Technology Co Ltd filed Critical China Applied Technology Co Ltd
Priority to CN201910641594.7A priority Critical patent/CN110363148A/en
Publication of CN110363148A publication Critical patent/CN110363148A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/10Image enhancement or restoration using non-spatial domain filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20052Discrete cosine transform [DCT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20056Discrete and fast Fourier transform, [DFT, FFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Collating Specific Patterns (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of method of face vocal print feature fusion verifying, comprising the following steps: the audio files of input is parsed into the time-domain signal of sound;The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing;The linear relationship that can perceive frequency conversion at human ear is converted by log spectrum;By cepstrum, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate;Sound spectrum feature vector is extracted, the vector is converted into image;Described image is merged with two-dimension human face image.The method of face vocal print feature proposed by the present invention fusion verifying, can achieve the effect for only doing one-time authentication, and does not have a kind of mode erroneous detection in application layer joint verification and will cause and entirely verify unacceptable problem, improve usage experience.

Description

Method for fusing and verifying face voiceprint features
Technical Field
The invention relates to the field of biological feature recognition, in particular to a method for fusion verification of face voiceprint features.
Background
In recent years, with deep learning and continuous development and maturity of computer vision technology, several identity authentication technologies based on computer vision technology in biometric identification are rapidly developed, especially face identification technology, which has the characteristics of non-contact and rapid identification and is widely applied to various services needing identity verification. Voiceprint recognition, which is one type of biometric recognition, is a service that performs identification based on the acoustic wave characteristics of a speaker. The identity recognition is independent of the accent and the language, and can be used for speaker identification and speaker confirmation.
Although various biological recognition technologies are gradually developed and applied to life and production, the use mode has obvious singleness. The conditions of missed detection and false detection always exist in a single verification mode, and the current research and application hotspots jointly use a plurality of biological identification technologies to achieve higher safety and accuracy, such as face identification and voiceprint identification. However, these methods only combine two recognition technologies at the application layer, for example, the face recognition plus the voiceprint recognition means that the face recognition passes, and then the voiceprint recognition passes is verified. And the human face features and the voice print features are not fused on the bottom layer feature level, so that the effect of only performing one-time verification is achieved, the whole verification cannot be passed due to one-way false detection, and the use experience is reduced.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method for fusing and verifying the face voiceprint features.
The technical scheme adopted by the invention for realizing the purpose is as follows: a method for fusing and verifying face voiceprint features comprises the following steps:
analyzing an input sound file into a time domain signal of sound;
converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing;
converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation;
separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation);
extracting a sound frequency spectrum characteristic vector, and converting the vector into an image;
and fusing the image and the two-dimensional face image.
Converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing, specifically:
selecting a time-frequency localized window function h (t), and calculating power spectrums at different moments through short-time Fourier transform, wherein the formula of the short-time Fourier transform is as follows:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
The window function is a hamming window.
The conversion of the frequency into a linear relationship which can be perceived by human ears through logarithmic spectrum conversion is specifically as follows:
converting the frequency scale into a logarithmic frequency scale by the following formula to enable the perception of the human ear to the frequency to be in a linear perception relation:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform.
Through the frequency inversion analysis, the direct current signal component and the sinusoidal signal component in the frequency domain signal after the conversion are separated by adopting DCT, which specifically comprises the following steps:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency bins, and u represents the frequency bins of the cepstrum.
The extracting of the sound frequency spectrum feature vector converts the vector into an image, specifically:
range of output vector:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
thus, a cepstrum of a sound is obtained, the horizontal axis of which is time and the vertical axis of which is frequency; where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
The image and the two-dimensional face image are fused, and the method specifically comprises the following steps:
and clockwise rotating the cepstrum by 90 degrees, if the length of the transverse axis of the spliced image is not consistent with that of the inverted cepstrum rotated by 90 degrees, scaling the two-dimensional face image to enable the lengths of the transverse axes of the two images to be consistent, and splicing the two images.
The invention has the following advantages and beneficial effects: the method for fusing and verifying the voiceprint features of the human face provided by the invention fuses the voiceprint features and the voice print features at the bottom feature level, can achieve the effect of verifying only once, does not have the problem that the whole verification cannot be passed due to false detection in one mode in application layer combined verification, and improves the use experience. (if any effect is supplemented as much as possible)
Drawings
FIG. 1 is a time domain signal diagram of sound;
FIG. 2 is a frequency domain signal of sound;
FIG. 3 is a chart of the hamming window function of the present invention;
FIG. 4 is a graph of the cepstrum of a sound according to the present invention;
FIG. 5 is a network architecture diagram of the VGG16 of the present invention;
FIG. 6 is a flow chart of a method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The flow of the method for fusing and verifying the human face voiceprint features is shown in fig. 6, and an input sound file is analyzed into a sound time domain signal; converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing; converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation; separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation); extracting a sound frequency spectrum characteristic vector, and converting the vector into an image; and fusing the image and the two-dimensional face image.
Fig. 1 is a time domain signal of sound and fig. 2 is a frequency domain signal of sound. The sound signal is a one-dimensional time domain signal, and the frequency change rule is difficult to be seen intuitively. If it is changed to the frequency domain by fourier transform, although the frequency distribution of the signal can be seen, the time domain information is lost and the change of the frequency distribution with time cannot be seen. To solve this problem, we use a short-time fourier transform (STFT) to determine the frequency and phase of the local area sinusoid of the time-varying signal. The specific method comprises the following steps: selecting a time-frequency localized window function, supposing that the analysis window function h (t) is stable in a short time interval, and enabling f (t) h (t) to be stable signals in different limited time widths, thereby calculating power spectrums at different moments, wherein the window function selects a hamming window which is a cosine window and can well reflect the attenuation relation of energy at a certain moment along with time. The short-time fourier transform is formulated as:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
Is a hamming window function, where N represents the discrete points of the window function and N represents the total number of discrete points, as shown in fig. 3.
The unit of frequency is Hertz (Hz), the audible frequency range of human ear is 20-20000Hz, but the human ear is not in linear perception relation to Hz, the ordinary frequency scale is converted into logarithmic frequency scale, and the mapping relation is shown as the following formula:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform. Through the formula, the perception of the human ear to the frequency is in a linear relation. That is, on this scale, if the frequencies of two segments of speech differ by a factor of two, the pitch that the human ear can perceive will probably also differ by a factor of two.
Based on the log spectrum, the direct current signal component and the sinusoidal signal component in the converted frequency domain signal are separated by adopting DCT (discrete cosine transformation), and the final obtained result is called a cepstrum:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency points, u represents a frequency point of the cepstrum, and 0 represents a direct current component.
Since the cepstrum outputs vectors, which cannot be displayed by the picture, it needs to be converted into an image matrix. The range of the output vector needs to be:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
this results in a cepstrum of the sound as shown in figure 4. Wherein the horizontal axis is time and the vertical axis is frequency. The brighter the place the greater the value (the greater the energy). Where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
In general, the frequency points extracted by us are fixed, that is, the length of the vertical axis is fixed, and the image 4 is rotated clockwise by 90 degrees and spliced behind the face image. And if the length of the horizontal axis of the face picture is not consistent with the length of the horizontal axis of the cepstrum after the face picture is rotated by 90 degrees, scaling the face picture to enable the lengths of the horizontal axes of the face picture and the cepstrum to be consistent. And the picture obtained after splicing is the fused feature. And finally, training a recognition model through a convolutional neural network. A typical CNN classification model can be abbreviated as two steps:
z=CNN(x)
p=softmax(zW)
where x is the input picture and p is the probability output for each class. When training as a classification problem, we output x and the corresponding one hot label p, but when using, we do not use the whole model, and we only use the part cnn (x), which is responsible for converting the picture into a vector with fixed length. With the conversion model (encoder), under any scene, the new face and sound fusion features can be encoded and then converted into comparison between the encoding vectors, so that the original classification model is not relied on. This completes the algorithm for biometric identification using the fused features.
The following is one embodiment of the present invention:
we use convolutional neural networks to identify the fused features, specifically, we use VGG16 networks, the network structure is that shown in fig. 5, the network structure is that of VGG16, there are 16 layers (excluding pooling and softmax layers), all convolution kernels use 3 × 3 in size, pooling use maximum pooling with 2 × 2 in size and 2 in step size, and the depth of the convolutional layers is 64- >128- >256- >512- > 512. The VGG16 is one type of neural network, and the training data may be different, including from which library to extract, and the number.
The training data used 1000 pictures of human faces (50 people, 20 pictures per person) extracted from ImageNet and 1000 fragments of human voice (also 50 people, 20 fragments per person) extracted from AudioSet. And randomly pairing 50 persons of the face and 50 persons of the voice, and randomly recombining the paired face pictures and voice fragments to finally obtain 50 face and voice pairs, wherein each face and voice pair comprises 20 samples, and the face and voice pairs are input into a VGG16 network after feature fusion to construct a 50-class model and train to obtain a feature encoder based on CNN.

Claims (7)

1. A method for fusing and verifying face voiceprint features is characterized by comprising the following steps:
analyzing an input sound file into a time domain signal of sound;
converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing;
converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation;
separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation);
extracting a sound frequency spectrum characteristic vector, and converting the vector into an image;
and fusing the image and the two-dimensional face image.
2. The method for fusion verification of human face voiceprint features according to claim 1, wherein the time domain signal is converted into a frequency domain signal by short-time fourier transform and windowing framing, specifically:
selecting a time-frequency localized window function h (t), and calculating power spectrums at different moments through short-time Fourier transform, wherein the formula of the short-time Fourier transform is as follows:
where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.
3. The method of claim 2, wherein the window function is a hamming window.
4. The method for fusion verification of human face voiceprint features according to claim 1, wherein the frequency is converted into a linear relationship which can be perceived by human ears by log-spectrum transformation, specifically:
converting the frequency scale into a logarithmic frequency scale by the following formula to enable the perception of the human ear to the frequency to be in a linear perception relation:
mel(f)=2595*log10(1+f/700)
where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform.
5. The method for fusion verification of human face voiceprint features according to claim 1, wherein the dc signal component and the sinusoidal signal component in the converted frequency domain signal are separated by means of a DCT transform through a cepstrum analysis, specifically:
wherein,
where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency bins, and u represents the frequency bins of the cepstrum.
6. The method for fusion verification of human face voiceprint features according to claim 1, wherein the extracting of the voice spectrum feature vector and the converting of the vector into an image are specifically as follows:
range of output vector:
mfcc∈[min,max]
linear transformation to the range of the image:
pixel∈[0,255]
thus, a cepstrum of a sound is obtained, the horizontal axis of which is time and the vertical axis of which is frequency; where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.
7. The method for fusion verification of face voiceprint features according to claim 1, wherein the fusion of the image with a two-dimensional face image is specifically as follows:
and clockwise rotating the cepstrum by 90 degrees, if the length of the transverse axis of the spliced image is not consistent with that of the inverted cepstrum rotated by 90 degrees, scaling the two-dimensional face image to enable the lengths of the transverse axes of the two images to be consistent, and splicing the two images.
CN201910641594.7A 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying Pending CN110363148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910641594.7A CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910641594.7A CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Publications (1)

Publication Number Publication Date
CN110363148A true CN110363148A (en) 2019-10-22

Family

ID=68219964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910641594.7A Pending CN110363148A (en) 2019-07-16 2019-07-16 A kind of method of face vocal print feature fusion verifying

Country Status (1)

Country Link
CN (1) CN110363148A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709004A (en) * 2020-08-19 2020-09-25 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111814128A (en) * 2020-09-01 2020-10-23 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN112133311A (en) * 2020-09-18 2020-12-25 科大讯飞股份有限公司 Speaker recognition method, related device and readable storage medium
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
CN105469253A (en) * 2015-11-19 2016-04-06 桂林航天工业学院 Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107274887A (en) * 2017-05-09 2017-10-20 重庆邮电大学 Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
CN105469253A (en) * 2015-11-19 2016-04-06 桂林航天工业学院 Handset NFC safety payment method based on integrated voiceprint and face characteristic encryption
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN107274887A (en) * 2017-05-09 2017-10-20 重庆邮电大学 Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN108847251A (en) * 2018-07-04 2018-11-20 武汉斗鱼网络科技有限公司 A kind of voice De-weight method, device, server and storage medium
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709004A (en) * 2020-08-19 2020-09-25 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111709004B (en) * 2020-08-19 2020-11-13 北京远鉴信息技术有限公司 Identity authentication method and device, electronic equipment and readable storage medium
CN111814128A (en) * 2020-09-01 2020-10-23 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN111814128B (en) * 2020-09-01 2020-12-11 北京远鉴信息技术有限公司 Identity authentication method, device, equipment and storage medium based on fusion characteristics
CN112133311A (en) * 2020-09-18 2020-12-25 科大讯飞股份有限公司 Speaker recognition method, related device and readable storage medium
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Adeel et al. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments
CN110363148A (en) A kind of method of face vocal print feature fusion verifying
US20210327431A1 (en) 'liveness' detection system
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN109036382B (en) Audio feature extraction method based on KL divergence
CN110909613A (en) Video character recognition method and device, storage medium and electronic equipment
US20120303369A1 (en) Energy-Efficient Unobtrusive Identification of a Speaker
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
Kumar et al. Harnessing ai for speech reconstruction using multi-view silent video feed
CN112786052A (en) Speech recognition method, electronic device and storage device
US20220335944A1 (en) Voice conversion apparatus, voice conversion learning apparatus, image generation apparatus, image generation learning apparatus, voice conversion method, voice conversion learning method, image generation method, image generation learning method, and computer program
CN109947971A (en) Image search method, device, electronic equipment and storage medium
Mian Qaisar Isolated speech recognition and its transformation in visual signs
Isyanto et al. Voice biometrics for Indonesian language users using algorithm of deep learning CNN residual and hybrid of DWT-MFCC extraction features
Lee et al. Seeing through the conversation: Audio-visual speech separation based on diffusion model
CN113782032B (en) Voiceprint recognition method and related device
CN105741853A (en) Digital speech perception hash method based on formant frequency
CN117935789A (en) Speech recognition method, system, equipment and storage medium
Saleema et al. Voice biometrics: the promising future of authentication in the internet of things
Anderson et al. Robust tri-modal automatic speech recognition for consumer applications
Li et al. Cross-Domain Audio Deepfake Detection: Dataset and Analysis
Nguyen et al. Vietnamese speaker authentication using deep models
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Yasmin et al. Discrimination of male and female voice using occurrence pattern of spectral flux
He et al. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022