CN114446326A - Swallowing disorder patient identification method and device based on time-frequency resolution - Google Patents
Swallowing disorder patient identification method and device based on time-frequency resolution Download PDFInfo
- Publication number
- CN114446326A CN114446326A CN202210097719.6A CN202210097719A CN114446326A CN 114446326 A CN114446326 A CN 114446326A CN 202210097719 A CN202210097719 A CN 202210097719A CN 114446326 A CN114446326 A CN 114446326A
- Authority
- CN
- China
- Prior art keywords
- data
- voice data
- frequency
- training data
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 208000019505 Deglutition disease Diseases 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 84
- 238000012360 testing method Methods 0.000 claims description 16
- 230000003595 spectral effect Effects 0.000 claims description 14
- 238000001228 spectrum Methods 0.000 claims description 13
- 230000003044 adaptive effect Effects 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 108010076504 Protein Sorting Signals Proteins 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 230000033764 rhythmic process Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 2
- 241000219095 Vitis Species 0.000 description 8
- 235000009754 Vitis X bourquina Nutrition 0.000 description 8
- 235000012333 Vitis X labruscana Nutrition 0.000 description 8
- 235000014787 Vitis vinifera Nutrition 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 230000010354 integration Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009747 swallowing Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)
Abstract
The application relates to a swallowing disorder patient identification method and equipment based on time-frequency resolution, wherein the method comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for input classifier training in the present application include at least: a distribution difference characteristic and a speech prosody difference characteristic in frequency domain energy. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.
Description
Technical Field
The application relates to the technical field of dysphagia classification, in particular to a dysphagia patient identification method and equipment based on time-frequency resolution.
Background
When a patient with swallowing disorder speaks, due to the imperfect swallowing function, the speech signal of the patient may have distribution difference with the speech signal of a normal person, such as the change of the energy concentration frequency band of the speech signal, the increase of the noise component, the change of the rhythm of speaking rhythm, and the like. Significant differences in fundamental frequency distribution and harmonic-to-noise ratio distribution of speech signals of dysphagia patients have been demonstrated in the prior art. The spectrogram is a combination of each frame spectrum of the speech signal, and comprises characteristics of a plurality of speech signals, such as fundamental frequency, energy distribution characteristics of the speech signal in each frequency band, formants and the like, and can reflect whether the speech signal is a silent section or a vocal section, reflect changes of pronunciation positions of the speech signal and the like. In the prior art, classification tests are carried out on the existing classical speech features generally used in dysphagia classification, such as MFCC parameters, HNR and the like, a series of classical speech features are used as classifier inputs for classification experiments, and key features for representing dysphagia patient speech changes are not explored.
Disclosure of Invention
In order to overcome the problem that the existing classical speech characteristics used in the related technology are classified and tested to some extent, and key characteristics for representing speech changes of dysphagia patients are not explored, the application provides a swallowing disorder patient identification method based on time-frequency resolution.
The scheme of the application is as follows:
according to a first aspect of the embodiments of the present application, there is provided a swallowing disorder patient identification method based on time-frequency resolution, including:
acquiring training data, wherein the training data comprises normal human voice data and dysphagia patient voice data;
preprocessing the training data based on a time domain;
extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; the features at least include: a distribution difference characteristic and a voice prosody difference characteristic on frequency domain energy;
training an identification model according to a feature set formed by a plurality of groups of features;
and recognizing the voice data to be recognized based on the recognition model.
Preferably, in an implementable manner of the present application, the preprocessing the training data based on the time domain includes:
performing high bit clipping on the training data based on a time domain.
Preferably, in an implementable manner of the present application, the performing high bit clipping on the training data based on the time domain includes:
taking an absolute value for each data point of the training data;
calculating the average value of all data points of the training data after the absolute value is taken;
obtaining a high clipping self-adaptive threshold value based on a preset high clipping coefficient and the average value of all data points;
traversing each data point in the training data, the data point being retained when an absolute value of the data point is not above the high clipping adaptive threshold; replacing the data value of the data point with 0 when the absolute value of the data point is above the high clip adaptive threshold;
and outputting the training data after the high-order clipping.
Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain includes:
carrying out amplitude normalization on the preprocessed training data;
performing 2048-point Fourier transform on each data point of the training data after the amplitude normalization, and taking the first 1024 points as energy coefficients;
and (3) taking the first 200 points of the Fourier transform coefficient of each data point to carry out significance difference test, and obtaining a frequency band with significance difference in energy distribution of the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, wherein the frequency band is used as a first classification characteristic.
Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain further includes:
calculating normalized spectral coefficient envelope areas of the normal human voice data and the dysphagia patient voice data, respectively, based on the energy coefficients; the ordinate of the normalized spectral coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;
and taking the corresponding relation between each group of energy coefficients and the frequency components as a second classification characteristic.
Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain further includes:
and calculating the distribution difference of the normal human voice data and the dysphagia patient voice data in different frequency bands in a frequency spectrum based on a preset algorithm to serve as a third classification characteristic.
Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain further includes:
performing short framing on the normal human voice data and the dysphagia patient voice data;
grouping corresponding frame signals of the normal human voice data and the dysphagia patient voice data;
and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, and taking the voice features corresponding to the frame sequence as fourth classification features.
Preferably, in an implementable manner of the present application, the preset algorithm includes:
determining an index A for evaluating amplitude variations of respective frequency componentstotal;
Wherein fs represents the sampling frequency; s represents a spectral coefficient obtained by Fourier transform of the current voice data; f represents a corresponding frequency index; d represents the center of symmetry of the frequency region and is an integer;
introducing a weight factor W, wherein the weight factor W is a logarithmic value of which the corresponding frequency coordinate scale takes 2 as a base;
calculating the third classification characteristic ILOG-SSDL:
preferably, in an implementable manner of the present application, the training data comprises a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of the dysphagia patient voice data is used to extract a plurality of sets of features and a second set of the dysphagia patient voice data is used to validate the recognition model.
According to a second aspect of embodiments of the present application, there is provided a swallowing disorder patient identification device based on time-frequency resolution, comprising:
a processor and a memory;
the processor and the memory are connected through a communication bus:
the processor is used for calling and executing the program stored in the memory;
the memory for storing a program for at least performing a time-frequency resolution based dysphagia patient identification method as claimed in any of the above.
The technical scheme provided by the application can comprise the following beneficial effects: the swallowing disorder patient identification method based on the time-frequency resolution comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for input classifier training in the present application include at least: a distribution difference characteristic and a speech prosody difference characteristic in frequency domain energy. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic flowchart of a swallowing disorder patient identification method based on time-frequency resolution according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating calculation of a third classification characteristic parameter in a swallowing disorder patient identification method based on time-frequency resolution according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a swallowing disorder patient identification device based on time-frequency resolution according to an embodiment of the present application.
Reference numerals: a processor-21; a memory-22.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
A swallowing disorder patient identification method based on time-frequency resolution, referring to fig. 1, includes:
s11: acquiring training data, wherein the training data comprises normal human voice data and dysphagia patient voice data;
preferably, in this embodiment, the training data includes a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of dysphagia patient voice data is used to extract a plurality of sets of features and a second set of dysphagia patient voice data is used to validate the recognition model.
Preferably, in the present embodiment, 40 cases of data are included in one group of normal human voice data, and 92 cases of voice data are included in two groups of dysphagia patients, wherein each group is 46 cases.
S12: preprocessing training data based on a time domain;
the voice data generally has mute sections with different lengths, and before and after mute sections generated when the voice data is recorded and a longer mute section between two sentences in the voice data need to be removed before feature extraction. The speech data is filtered after removing the silence segments. The signal filtering adopts a Butterworth high-pass filter to filter out components with frequency components lower than 500Hz, and the order of the filter is 10.
Specifically, the preprocessing of the training data based on the time domain includes:
performing high bit clipping on training data based on a time domain, comprising:
taking an absolute value for each data point of the training data;
calculating the average value of all data points of the training data after the absolute value is taken;
obtaining a high clipping adaptive threshold value based on a preset high clipping coefficient and the mean value of all data points;
traversing each data point in the training data, and retaining the data point when the absolute value of the data point is not higher than the high-order wavelet truncation adaptive threshold; when the absolute value of the data point is higher than the high-order clipping adaptive threshold value, replacing the data value of the data point with 0;
and outputting the training data after the high-order clipping.
In the embodiment, a concept of performing high-order clipping on training data based on a time domain is provided, that is, points with very significant amplitude values in the training data are removed in a self-adaptive manner, so that the feature extraction process is concentrated in the more subtle difference of the distribution of the training data. For each data point in the training data (i.e., each speech signal), the adaptive high-bit clipping is computed as follows:
1) taking an absolute value of each data point of the training data;
2) calculating the mean value m of all data points of the training data after the absolute value is taken;
3) based on a preset high clipping coefficient r (e.g., preset to 0.6) and the mean m of all data points, a high clipping adaptive threshold T1 ═ r × m is obtained.
4) Traversing each data point in the training data, and retaining the data point when the absolute value of the data point is not higher than the high-order wavelet truncation adaptive threshold; the data value of the data point is replaced with 0 when the absolute value of the data point is above the high clip adaptive threshold.
The training data that is finally output is training data that is subjected to high-order clipping.
S13: extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; characterized by at least comprising: a distribution difference characteristic and a voice prosody difference characteristic on frequency domain energy;
the features in this embodiment mainly include four types of speech features, which are fast fourier transform coefficients (FFT-8000), Normalized Spectral areas (NS-area), Improved Log Symmetric Spectral Difference coefficients (ILOG-SSDL), and Dynamic prosodic Difference feature Sets (DRDs), respectively, that characterize the energy distribution characteristics of the key band, and these feature parameters reflect the energy distribution characteristics and prosodic characteristics of the speech signal from different angles.
Specifically, extracting multiple groups of features from the preprocessed training data based on the frequency domain includes:
1) carrying out amplitude normalization on the preprocessed training data;
performing 2048-point Fourier transform on each data point of the training data after the amplitude normalization, and taking the first 1024 points as energy coefficients;
and (3) taking the first 200 points of the Fourier transform coefficient of each data point to carry out significance difference test, and obtaining a frequency band with significance difference in energy distribution of normal human voice data and dysphagia patient voice data based on preset confidence coefficient, wherein the frequency band is used as a first classification characteristic.
FFT-8000 has been dedicated to exploring differences in the distribution of energy in the various main bands of sound in dysphagia patients and normal human speech signals. For each sentence in the processed training data, in order to avoid the influence caused by the volume, amplitude normalization is firstly carried out, then 2048-point Fourier transform is carried out on each data point, the first 1024 points are taken as the energy coefficient to be solved due to the symmetry of the frequency spectrum, and the frequency bandwidth represented by each point is about 43 Hz. Since the frequency of human voice is low, the first 8000Hz speech signal already contains most useful information, so in this embodiment, the first 200(200 × 43 ═ 8600Hz) points of the fourier transform coefficient of each speech signal are tested for significance difference, and frequency components with significance difference between groups are searched, and in this embodiment, a total of 200 frequency components need to be tested between groups.
The significance difference test mode is T test, and the confidence coefficient is 99.5%. Through significance test, the frequency combinations with difference between groups are the most from the 100 th group to the 160 th group, namely, the speech data of the dysphagia patient and the normal human speech data have significant energy distribution difference in the frequency band of 4000Hz to 6400Hz, and the frequency characteristic combinations with significant difference are used as the first classification characteristics.
The significance test is to make an assumption about the parameters of the population (random variables) or the distribution form of the population in advance, and then use the sample information to judge whether the assumption (alternative assumption) is reasonable, i.e. whether the true situation of the population is significantly different from the original assumption. Alternatively, the significance test determines whether the difference between the sample and the hypothesis made for the population is a purely opportunistic variation or is caused by a discrepancy between the hypothesis made and the overall true situation. The significance test is to test the total hypothesis, and the principle is the 'small probability event real impossibility principle' to accept or reject the hypothesis.
Extracting a plurality of groups of features from the preprocessed training data based on the frequency domain, and further comprising:
2) respectively calculating normalized spectral coefficient envelope areas of the voice data of the normal person and the voice data of the dysphagia patient on the basis of the energy coefficients; the ordinate of the normalized spectral coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;
and taking the corresponding relation between each group of energy coefficients and the frequency components as a second classification characteristic.
The NS-area may reflect the overall energy distribution of the speech signal, and in this embodiment, the normalized spectral coefficient envelope area is calculated based on the 1024 energy coefficients obtained in the previous step. Specifically, the area under the spectral coefficient curve is calculated by adopting a trapezoidal numerical integration calculation method. In the trapezoidal numerical integration calculation, the abscissa is the frequency component corresponding to each energy coefficient, and the ordinate is the corresponding energy coefficient. After integration, for each set of energy coefficients, a second classification feature NS-area is obtained.
Extracting a plurality of groups of features from the preprocessed training data based on the frequency domain, and further comprising:
3) and calculating the distribution difference of the normal human voice data and the dysphagia patient voice data in different frequency bands in the frequency spectrum based on a preset algorithm to serve as a third classification characteristic.
The preset algorithm comprises the following steps:
determining an index A for evaluating amplitude variation of each frequency component based on equation (1)total;
Wherein fs represents the sampling frequency; s represents a spectral coefficient obtained by Fourier transform of the current voice data; f represents a corresponding frequency index; d represents the center of symmetry of the frequency region and is an integer;
introducing a weight factor W, wherein the weight factor W is a logarithmic value of which the corresponding frequency coordinate scale takes 2 as a base;
calculating a third classification characteristic ILOG-SSDL based on equation (2):
the spectrally-related characteristics of dysphagia patient speech data may differ from those of normal human speech data. These differences include the distribution of frequency components and their corresponding amplitudes in the speech spectrum. An algorithm is proposed in this embodiment to emphasize these differences in the speech and normal speech spectra of dysphagia patients, taking into account the variation in frequency content. In general, the difference of the speech spectrum is reflected on the distribution of the respective frequency components, which can be determined by their positions and corresponding amplitudes. In consideration of the amplitude of energy, there is proposed a method of determining an index A for evaluating the amplitude variation of each frequency componenttotal。
D is set to be 2 in the experiment, namely the range participating in calculation is the whole frequency spectrum (fs/2) × 2 ═ fs; if d is set to 4, (fs/4) × 2 ═ fs/2 then the frequency range involved in the calculation is the top 1/2 frequency component range, and so on. As shown in fig. 2 for an example where d is equal to 2, where the axis of symmetry is fs/2, and the signal values at equal distances from the axis of symmetry are a pair of symmetric sequences.
And obtaining the value of the symmetric spectrum difference of the amplitude of each frequency component through the formula (1). Since the spectral change not only relates to the amplitude but also to the position of the amplitude distribution, at AtotalA weighting factor is introduced in the calculation process of (2). In the embodiment, the weighting factor in SSDL is improved, the introduced weighting factor is a logarithmic value with base 2 corresponding to the frequency coordinate scale, and since the symmetry axis is fs/2, the coordinate weighting matrix is adjusted to emphasize the distribution difference between the low frequency component and the high frequency component. Differences occurring in higher frequency regions may be weighted by higher weights. As shown in equation (2).
And (4) calculating to obtain the final third classification characteristic ILOG-SSDL through the formula (2). The feature combines the difference of the distribution of the voice frequency components in the amplitude distribution and the corresponding position distribution, and emphasizes the distribution difference of the voice data of normal people and the voice data of dysphagia patients in different frequency bands in the frequency spectrum through improved weighting factors.
Extracting a plurality of groups of features from the preprocessed training data based on the frequency domain, and further comprising:
4) carrying out short framing on the voice data of a normal person and the voice data of a dysphagia patient;
grouping the corresponding frame signals of the voice data of the normal human and the voice data of the dysphagia patient;
and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between normal human voice data and the voice data of the dysphagia patient based on preset confidence, and taking the voice features corresponding to the frame sequence as fourth classification features.
The three characteristics mainly explore the difference between the speech data of the dysphagia patient and the speech data of the normal person in the detailed energy frequency band and the overall energy distribution. DRDs are different from the previous features in that it improves the time resolution by a short framing (each frame of speech has a length in the range of about 5-15 ms), calculates a spectrogram based on the short framing, and clearly identifies the locations of voiced segments and unvoiced segments on the spectrogram, thereby making it possible to reflect the prosody change characteristics of speech through the spectrogram.
Based on the short framing technique, the length of each frame of voice data is 1/1000 of the length of the signal itself, the frame is shifted 1/4000 of the length of the voice data itself, and the average value of the frequency spectrum of each frame of signal in the first 1300 frames of voice data is calculated.
In this embodiment, in order to explore a place where speech data generates prosody differences, corresponding speech data frame signals of speech data of a patient with dysphagia and normal human speech data are grouped, inter-group difference tests are performed on the spectral mean values of 1300 groups of speech signals by using T tests, respectively, a confidence coefficient is set to be 95%, and a frame sequence with significant differences in a set is located. Thereby finding out the characteristic difference of the dysphagia patient and the normal person when expressing the same content sentence. In this embodiment, voice data of a word that a patient with dysphagia and a normal person eat grape skins without spitting grape skins and grape skins with spitting of grape skins reversely are collected, and the result shows that the patient with dysphagia reads that the normal person eats grape skins without spitting grape skins and grape skins with spitting of grape skins, a plurality of feature groups with significant differences exist at the initial pronunciation position and at the end of the word, and the feature groups corresponding to the frame signal sequences with significant differences are used as fourth classification features.
S14: training an identification model according to a feature set consisting of a plurality of groups of features;
the feature set is formed based on the four types of features extracted previously, the model can be, but is not limited to, an SVM (support vector machine) classifier, and an optimal classification surface in an SVM algorithm is provided based on a linear separable condition. The optimal classification surface requires that the classification surface not only can separate the two types of sample points as error-free as possible, but also can maximize the classification gap of the two types, has outstanding advantages in the processing of two-classification signals, and is a typical classifier suitable for two-type identification. The gaussian kernel of the SVM classifier is used in this embodiment. The classification correctness was 81.4%, sensitivity was 85%, and specificity was 80% as verified on the speech data of the second group of dysphagia patients. The verification result proves that the method provided by the embodiment can effectively realize the correct classification of dysphagia patients and normal persons through voice data, and obtains better classification performance in the aspects of classification accuracy, feature pertinence and the like compared with the prior art.
S15: and recognizing the voice data to be recognized based on the recognition model.
The method for identifying a dysphagia patient based on time-frequency resolution in the embodiment comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for training the input classifier in this embodiment at least include: a distribution difference characteristic in frequency domain energy and a speech prosody difference characteristic. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.
A swallowing disorder patient identification device based on time-frequency resolution, referring to fig. 3, comprising:
a processor 21 and a memory 22;
the processor 21 is connected to the memory 22 by a communication bus:
the processor 21 is configured to call and execute a program stored in the memory 22;
a memory 22 for storing a program for performing at least the time-frequency resolution based dysphagia patient identification method in the above embodiment.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.
Claims (10)
1. A swallowing disorder patient identification method based on time-frequency resolution is characterized by comprising the following steps:
acquiring training data, wherein the training data comprises normal human voice data and dysphagia patient voice data;
preprocessing the training data based on a time domain;
extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; the features at least include: distribution difference characteristics and voice rhythm difference characteristics on frequency domain energy;
training an identification model according to a feature set formed by a plurality of groups of features;
and recognizing the voice data to be recognized based on the recognition model.
2. The method of claim 1, wherein the pre-processing the training data based on the time domain comprises:
performing high bit clipping on the training data based on a time domain.
3. The method of claim 2, wherein the high bit clipping the training data based on the time domain comprises:
taking an absolute value for each data point of the training data;
calculating the average value of all data points of the training data after the absolute value is taken;
obtaining a high clipping adaptive threshold value based on a preset high clipping coefficient and the mean value of all the data points;
traversing each data point in the training data, the data point being retained when an absolute value of the data point is not above the high clipping adaptive threshold; replacing the data value of the data point with 0 when the absolute value of the data point is above the high clip adaptive threshold;
and outputting the training data after the high-order clipping.
4. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain comprises:
carrying out amplitude normalization on the preprocessed training data;
performing 2048-point Fourier transform on each data point of the training data after the amplitude normalization, and taking the first 1024 points as energy coefficients;
and (3) taking the first 200 points of the Fourier transform coefficient of each data point to carry out significance difference test, and obtaining a frequency band with significance difference in energy distribution of the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, wherein the frequency band is used as a first classification characteristic.
5. The method of claim 4, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:
calculating normalized spectral coefficient envelope areas of the normal human voice data and the dysphagia patient voice data, respectively, based on the energy coefficients; the ordinate of the normalized spectrum coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;
and taking the corresponding relation between each group of energy coefficients and the frequency components as a second classification characteristic.
6. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:
and calculating the distribution difference of the normal human voice data and the dysphagia patient voice data in different frequency bands in a frequency spectrum based on a preset algorithm to serve as a third classification characteristic.
7. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:
performing short framing on the normal human voice data and the dysphagia patient voice data;
grouping corresponding frame signals of the normal human voice data and the dysphagia patient voice data;
and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, and taking a feature group corresponding to a frame sequence as a fourth classification feature.
8. The method of claim 6, wherein the predetermined algorithm comprises:
determining an index A for evaluating amplitude variations of respective frequency componentstotal;
Wherein fs represents the sampling frequency; s represents a spectral coefficient obtained by Fourier transform of the current voice data; f represents a corresponding frequency index; d represents the center of symmetry of the frequency region and is an integer;
introducing a weight factor W, wherein the weight factor W is a logarithmic value of the corresponding frequency coordinate scale with a base of 2;
calculating the third classification characteristic ILOG-SSDL:
9. the method of claim 1, wherein the training data comprises a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of the dysphagia patient voice data is used to extract a plurality of sets of features and a second set of the dysphagia patient voice data is used to validate the recognition model.
10. A time-frequency resolution based dysphagia patient identification device, comprising:
a processor and a memory;
the processor and the memory are connected through a communication bus:
the processor is used for calling and executing the program stored in the memory;
the memory for storing a program for performing at least a time-frequency resolution based dysphagia patient identification method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210097719.6A CN114446326B (en) | 2022-01-27 | 2022-01-27 | Dysphagia patient identification method and device based on time-frequency resolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210097719.6A CN114446326B (en) | 2022-01-27 | 2022-01-27 | Dysphagia patient identification method and device based on time-frequency resolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114446326A true CN114446326A (en) | 2022-05-06 |
CN114446326B CN114446326B (en) | 2023-07-04 |
Family
ID=81369470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210097719.6A Active CN114446326B (en) | 2022-01-27 | 2022-01-27 | Dysphagia patient identification method and device based on time-frequency resolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114446326B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6424635B1 (en) * | 1998-11-10 | 2002-07-23 | Nortel Networks Limited | Adaptive nonlinear processor for echo cancellation |
CN101188107A (en) * | 2007-09-28 | 2008-05-28 | 中国民航大学 | A voice recognition method based on wavelet decomposition and mixed Gauss model estimation |
KR20140134443A (en) * | 2013-05-14 | 2014-11-24 | 울산대학교 산학협력단 | Method for determine dysphagia using the feature vector of speech signal |
CN105982641A (en) * | 2015-01-30 | 2016-10-05 | 上海泰亿格康复医疗科技股份有限公司 | Speech and language hypoacousie multi-parameter diagnosis and rehabilitation apparatus and cloud rehabilitation system |
CN106875956A (en) * | 2017-02-15 | 2017-06-20 | 太原理工大学 | A kind of method of the hearing impairment degree for judging deaf and dumb patient |
CN107274888A (en) * | 2017-06-14 | 2017-10-20 | 大连海事大学 | A kind of Emotional speech recognition method based on octave signal intensity and differentiation character subset |
CN108198576A (en) * | 2018-02-11 | 2018-06-22 | 华南理工大学 | A kind of Alzheimer's disease prescreening method based on phonetic feature Non-negative Matrix Factorization |
US20180289308A1 (en) * | 2017-04-05 | 2018-10-11 | The Curators Of The University Of Missouri | Quantification of bulbar function |
JP2019164106A (en) * | 2018-03-20 | 2019-09-26 | 本田技研工業株式会社 | Abnormal noise detection device and detection metho |
CN111613248A (en) * | 2020-05-07 | 2020-09-01 | 北京声智科技有限公司 | Pickup testing method, device and system |
CN111867672A (en) * | 2018-02-16 | 2020-10-30 | 西北大学 | Wireless medical sensor and method |
KR102216160B1 (en) * | 2020-03-05 | 2021-02-16 | 가톨릭대학교 산학협력단 | Apparatus and method for diagnosing disease that causes voice and swallowing disorders |
CN113223498A (en) * | 2021-05-20 | 2021-08-06 | 四川大学华西医院 | Swallowing disorder identification method, device and apparatus based on throat voice information |
CN113724712A (en) * | 2021-08-10 | 2021-11-30 | 南京信息工程大学 | Bird sound identification method based on multi-feature fusion and combination model |
-
2022
- 2022-01-27 CN CN202210097719.6A patent/CN114446326B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6424635B1 (en) * | 1998-11-10 | 2002-07-23 | Nortel Networks Limited | Adaptive nonlinear processor for echo cancellation |
CN101188107A (en) * | 2007-09-28 | 2008-05-28 | 中国民航大学 | A voice recognition method based on wavelet decomposition and mixed Gauss model estimation |
KR20140134443A (en) * | 2013-05-14 | 2014-11-24 | 울산대학교 산학협력단 | Method for determine dysphagia using the feature vector of speech signal |
CN105982641A (en) * | 2015-01-30 | 2016-10-05 | 上海泰亿格康复医疗科技股份有限公司 | Speech and language hypoacousie multi-parameter diagnosis and rehabilitation apparatus and cloud rehabilitation system |
CN106875956A (en) * | 2017-02-15 | 2017-06-20 | 太原理工大学 | A kind of method of the hearing impairment degree for judging deaf and dumb patient |
US20180289308A1 (en) * | 2017-04-05 | 2018-10-11 | The Curators Of The University Of Missouri | Quantification of bulbar function |
CN107274888A (en) * | 2017-06-14 | 2017-10-20 | 大连海事大学 | A kind of Emotional speech recognition method based on octave signal intensity and differentiation character subset |
CN108198576A (en) * | 2018-02-11 | 2018-06-22 | 华南理工大学 | A kind of Alzheimer's disease prescreening method based on phonetic feature Non-negative Matrix Factorization |
CN111867672A (en) * | 2018-02-16 | 2020-10-30 | 西北大学 | Wireless medical sensor and method |
JP2019164106A (en) * | 2018-03-20 | 2019-09-26 | 本田技研工業株式会社 | Abnormal noise detection device and detection metho |
KR102216160B1 (en) * | 2020-03-05 | 2021-02-16 | 가톨릭대학교 산학협력단 | Apparatus and method for diagnosing disease that causes voice and swallowing disorders |
CN111613248A (en) * | 2020-05-07 | 2020-09-01 | 北京声智科技有限公司 | Pickup testing method, device and system |
CN113223498A (en) * | 2021-05-20 | 2021-08-06 | 四川大学华西医院 | Swallowing disorder identification method, device and apparatus based on throat voice information |
CN113724712A (en) * | 2021-08-10 | 2021-11-30 | 南京信息工程大学 | Bird sound identification method based on multi-feature fusion and combination model |
Non-Patent Citations (2)
Title |
---|
付方玲 等: "结合听觉模型的腭裂语音高鼻音等级自动识别" * |
朱明星: "基于神经肌肉生理信息的吞咽与发音功能评估方法研究" * |
Also Published As
Publication number | Publication date |
---|---|
CN114446326B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bezoui et al. | Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC) | |
CN104123934A (en) | Speech composition recognition method and system | |
Ramashini et al. | Robust cepstral feature for bird sound classification | |
Borsky et al. | Modal and nonmodal voice quality classification using acoustic and electroglottographic features | |
Murugappan et al. | DWT and MFCC based human emotional speech classification using LDA | |
Martinez et al. | On the relevance of auditory-based Gabor features for deep learning in robust speech recognition | |
Dewi et al. | Analysis of LFCC feature extraction in baby crying classification using KNN | |
López-Pabón et al. | Cepstral analysis and Hilbert-Huang transform for automatic detection of Parkinson’s disease | |
Campi et al. | Machine learning mitigants for speech based cyber risk | |
Hasija et al. | Recognition of children Punjabi speech using tonal non-tonal classifier | |
Vieira et al. | Non-Stationarity-Based Adaptive Segmentation Applied to Voice Disorder Discrimination | |
He et al. | Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms | |
Tsenov et al. | Speech recognition using neural networks | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
CN114446326B (en) | Dysphagia patient identification method and device based on time-frequency resolution | |
Prasasti et al. | Identification of baby cry with discrete wavelet transform, mel frequency cepstral coefficient and principal component analysis | |
Iwok et al. | Evaluation of Machine Learning Algorithms using Combined Feature Extraction Techniques for Speaker Identification | |
Sahoo et al. | Analyzing the vocal tract characteristics for out-of-breath speech | |
Islam et al. | Bangla dataset and MMFCC in text-dependent speaker identification. | |
Jamil et al. | Influences of age in emotion recognition of spontaneous speech: A case of an under-resourced language | |
Meyer et al. | Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities | |
Bonifaco et al. | Comparative analysis of filipino-based rhinolalia aperta speech using mel frequency cepstral analysis and Perceptual Linear Prediction | |
Zilany | A novel neural feature for a text-dependent speaker identification system. | |
Siafarikas et al. | Objective wavelet packet features for speaker verification. | |
TWI395200B (en) | A speech recognition method for all languages without using samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |