CN114446326A

CN114446326A - Swallowing disorder patient identification method and device based on time-frequency resolution

Info

Publication number: CN114446326A
Application number: CN202210097719.6A
Authority: CN
Inventors: 李颖; 彭旭超; 何飞; 杨雪
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-06
Anticipated expiration: 2042-01-27
Also published as: CN114446326B

Abstract

The application relates to a swallowing disorder patient identification method and equipment based on time-frequency resolution, wherein the method comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for input classifier training in the present application include at least: a distribution difference characteristic and a speech prosody difference characteristic in frequency domain energy. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.

Description

Swallowing disorder patient identification method and device based on time-frequency resolution

Technical Field

The application relates to the technical field of dysphagia classification, in particular to a dysphagia patient identification method and equipment based on time-frequency resolution.

Background

When a patient with swallowing disorder speaks, due to the imperfect swallowing function, the speech signal of the patient may have distribution difference with the speech signal of a normal person, such as the change of the energy concentration frequency band of the speech signal, the increase of the noise component, the change of the rhythm of speaking rhythm, and the like. Significant differences in fundamental frequency distribution and harmonic-to-noise ratio distribution of speech signals of dysphagia patients have been demonstrated in the prior art. The spectrogram is a combination of each frame spectrum of the speech signal, and comprises characteristics of a plurality of speech signals, such as fundamental frequency, energy distribution characteristics of the speech signal in each frequency band, formants and the like, and can reflect whether the speech signal is a silent section or a vocal section, reflect changes of pronunciation positions of the speech signal and the like. In the prior art, classification tests are carried out on the existing classical speech features generally used in dysphagia classification, such as MFCC parameters, HNR and the like, a series of classical speech features are used as classifier inputs for classification experiments, and key features for representing dysphagia patient speech changes are not explored.

Disclosure of Invention

In order to overcome the problem that the existing classical speech characteristics used in the related technology are classified and tested to some extent, and key characteristics for representing speech changes of dysphagia patients are not explored, the application provides a swallowing disorder patient identification method based on time-frequency resolution.

The scheme of the application is as follows:

according to a first aspect of the embodiments of the present application, there is provided a swallowing disorder patient identification method based on time-frequency resolution, including:

acquiring training data, wherein the training data comprises normal human voice data and dysphagia patient voice data;

preprocessing the training data based on a time domain;

extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; the features at least include: a distribution difference characteristic and a voice prosody difference characteristic on frequency domain energy;

training an identification model according to a feature set formed by a plurality of groups of features;

and recognizing the voice data to be recognized based on the recognition model.

Preferably, in an implementable manner of the present application, the preprocessing the training data based on the time domain includes:

performing high bit clipping on the training data based on a time domain.

Preferably, in an implementable manner of the present application, the performing high bit clipping on the training data based on the time domain includes:

taking an absolute value for each data point of the training data;

calculating the average value of all data points of the training data after the absolute value is taken;

obtaining a high clipping self-adaptive threshold value based on a preset high clipping coefficient and the average value of all data points;

traversing each data point in the training data, the data point being retained when an absolute value of the data point is not above the high clipping adaptive threshold; replacing the data value of the data point with 0 when the absolute value of the data point is above the high clip adaptive threshold;

and outputting the training data after the high-order clipping.

Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain includes:

carrying out amplitude normalization on the preprocessed training data;

performing 2048-point Fourier transform on each data point of the training data after the amplitude normalization, and taking the first 1024 points as energy coefficients;

and (3) taking the first 200 points of the Fourier transform coefficient of each data point to carry out significance difference test, and obtaining a frequency band with significance difference in energy distribution of the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, wherein the frequency band is used as a first classification characteristic.

Preferably, in an implementation manner of the present application, the extracting multiple sets of features from the preprocessed training data based on the frequency domain further includes:

calculating normalized spectral coefficient envelope areas of the normal human voice data and the dysphagia patient voice data, respectively, based on the energy coefficients; the ordinate of the normalized spectral coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;

and taking the corresponding relation between each group of energy coefficients and the frequency components as a second classification characteristic.

and calculating the distribution difference of the normal human voice data and the dysphagia patient voice data in different frequency bands in a frequency spectrum based on a preset algorithm to serve as a third classification characteristic.

performing short framing on the normal human voice data and the dysphagia patient voice data;

grouping corresponding frame signals of the normal human voice data and the dysphagia patient voice data;

and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, and taking the voice features corresponding to the frame sequence as fourth classification features.

Preferably, in an implementable manner of the present application, the preset algorithm includes:

determining an index A for evaluating amplitude variations of respective frequency components_total；

Wherein fs represents the sampling frequency; s represents a spectral coefficient obtained by Fourier transform of the current voice data; f represents a corresponding frequency index; d represents the center of symmetry of the frequency region and is an integer;

introducing a weight factor W, wherein the weight factor W is a logarithmic value of which the corresponding frequency coordinate scale takes 2 as a base;

calculating the third classification characteristic ILOG-SSDL:

preferably, in an implementable manner of the present application, the training data comprises a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of the dysphagia patient voice data is used to extract a plurality of sets of features and a second set of the dysphagia patient voice data is used to validate the recognition model.

According to a second aspect of embodiments of the present application, there is provided a swallowing disorder patient identification device based on time-frequency resolution, comprising:

a processor and a memory;

the processor and the memory are connected through a communication bus:

the processor is used for calling and executing the program stored in the memory;

the memory for storing a program for at least performing a time-frequency resolution based dysphagia patient identification method as claimed in any of the above.

The technical scheme provided by the application can comprise the following beneficial effects: the swallowing disorder patient identification method based on the time-frequency resolution comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for input classifier training in the present application include at least: a distribution difference characteristic and a speech prosody difference characteristic in frequency domain energy. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a swallowing disorder patient identification method based on time-frequency resolution according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating calculation of a third classification characteristic parameter in a swallowing disorder patient identification method based on time-frequency resolution according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a swallowing disorder patient identification device based on time-frequency resolution according to an embodiment of the present application.

Reference numerals: a processor-21; a memory-22.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

A swallowing disorder patient identification method based on time-frequency resolution, referring to fig. 1, includes:

s11: acquiring training data, wherein the training data comprises normal human voice data and dysphagia patient voice data;

preferably, in this embodiment, the training data includes a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of dysphagia patient voice data is used to extract a plurality of sets of features and a second set of dysphagia patient voice data is used to validate the recognition model.

Preferably, in the present embodiment, 40 cases of data are included in one group of normal human voice data, and 92 cases of voice data are included in two groups of dysphagia patients, wherein each group is 46 cases.

S12: preprocessing training data based on a time domain;

the voice data generally has mute sections with different lengths, and before and after mute sections generated when the voice data is recorded and a longer mute section between two sentences in the voice data need to be removed before feature extraction. The speech data is filtered after removing the silence segments. The signal filtering adopts a Butterworth high-pass filter to filter out components with frequency components lower than 500Hz, and the order of the filter is 10.

Specifically, the preprocessing of the training data based on the time domain includes:

performing high bit clipping on training data based on a time domain, comprising:

taking an absolute value for each data point of the training data;

obtaining a high clipping adaptive threshold value based on a preset high clipping coefficient and the mean value of all data points;

traversing each data point in the training data, and retaining the data point when the absolute value of the data point is not higher than the high-order wavelet truncation adaptive threshold; when the absolute value of the data point is higher than the high-order clipping adaptive threshold value, replacing the data value of the data point with 0;

and outputting the training data after the high-order clipping.

In the embodiment, a concept of performing high-order clipping on training data based on a time domain is provided, that is, points with very significant amplitude values in the training data are removed in a self-adaptive manner, so that the feature extraction process is concentrated in the more subtle difference of the distribution of the training data. For each data point in the training data (i.e., each speech signal), the adaptive high-bit clipping is computed as follows:

1) taking an absolute value of each data point of the training data;

2) calculating the mean value m of all data points of the training data after the absolute value is taken;

3) based on a preset high clipping coefficient r (e.g., preset to 0.6) and the mean m of all data points, a high clipping adaptive threshold T1 ═ r × m is obtained.

4) Traversing each data point in the training data, and retaining the data point when the absolute value of the data point is not higher than the high-order wavelet truncation adaptive threshold; the data value of the data point is replaced with 0 when the absolute value of the data point is above the high clip adaptive threshold.

The training data that is finally output is training data that is subjected to high-order clipping.

S13: extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; characterized by at least comprising: a distribution difference characteristic and a voice prosody difference characteristic on frequency domain energy;

the features in this embodiment mainly include four types of speech features, which are fast fourier transform coefficients (FFT-8000), Normalized Spectral areas (NS-area), Improved Log Symmetric Spectral Difference coefficients (ILOG-SSDL), and Dynamic prosodic Difference feature Sets (DRDs), respectively, that characterize the energy distribution characteristics of the key band, and these feature parameters reflect the energy distribution characteristics and prosodic characteristics of the speech signal from different angles.

Specifically, extracting multiple groups of features from the preprocessed training data based on the frequency domain includes:

1) carrying out amplitude normalization on the preprocessed training data;

and (3) taking the first 200 points of the Fourier transform coefficient of each data point to carry out significance difference test, and obtaining a frequency band with significance difference in energy distribution of normal human voice data and dysphagia patient voice data based on preset confidence coefficient, wherein the frequency band is used as a first classification characteristic.

FFT-8000 has been dedicated to exploring differences in the distribution of energy in the various main bands of sound in dysphagia patients and normal human speech signals. For each sentence in the processed training data, in order to avoid the influence caused by the volume, amplitude normalization is firstly carried out, then 2048-point Fourier transform is carried out on each data point, the first 1024 points are taken as the energy coefficient to be solved due to the symmetry of the frequency spectrum, and the frequency bandwidth represented by each point is about 43 Hz. Since the frequency of human voice is low, the first 8000Hz speech signal already contains most useful information, so in this embodiment, the first 200(200 × 43 ═ 8600Hz) points of the fourier transform coefficient of each speech signal are tested for significance difference, and frequency components with significance difference between groups are searched, and in this embodiment, a total of 200 frequency components need to be tested between groups.

The significance difference test mode is T test, and the confidence coefficient is 99.5%. Through significance test, the frequency combinations with difference between groups are the most from the 100 th group to the 160 th group, namely, the speech data of the dysphagia patient and the normal human speech data have significant energy distribution difference in the frequency band of 4000Hz to 6400Hz, and the frequency characteristic combinations with significant difference are used as the first classification characteristics.

The significance test is to make an assumption about the parameters of the population (random variables) or the distribution form of the population in advance, and then use the sample information to judge whether the assumption (alternative assumption) is reasonable, i.e. whether the true situation of the population is significantly different from the original assumption. Alternatively, the significance test determines whether the difference between the sample and the hypothesis made for the population is a purely opportunistic variation or is caused by a discrepancy between the hypothesis made and the overall true situation. The significance test is to test the total hypothesis, and the principle is the 'small probability event real impossibility principle' to accept or reject the hypothesis.

Extracting a plurality of groups of features from the preprocessed training data based on the frequency domain, and further comprising:

2) respectively calculating normalized spectral coefficient envelope areas of the voice data of the normal person and the voice data of the dysphagia patient on the basis of the energy coefficients; the ordinate of the normalized spectral coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;

The NS-area may reflect the overall energy distribution of the speech signal, and in this embodiment, the normalized spectral coefficient envelope area is calculated based on the 1024 energy coefficients obtained in the previous step. Specifically, the area under the spectral coefficient curve is calculated by adopting a trapezoidal numerical integration calculation method. In the trapezoidal numerical integration calculation, the abscissa is the frequency component corresponding to each energy coefficient, and the ordinate is the corresponding energy coefficient. After integration, for each set of energy coefficients, a second classification feature NS-area is obtained.

3) and calculating the distribution difference of the normal human voice data and the dysphagia patient voice data in different frequency bands in the frequency spectrum based on a preset algorithm to serve as a third classification characteristic.

The preset algorithm comprises the following steps:

determining an index A for evaluating amplitude variation of each frequency component based on equation (1)_total；

calculating a third classification characteristic ILOG-SSDL based on equation (2):

the spectrally-related characteristics of dysphagia patient speech data may differ from those of normal human speech data. These differences include the distribution of frequency components and their corresponding amplitudes in the speech spectrum. An algorithm is proposed in this embodiment to emphasize these differences in the speech and normal speech spectra of dysphagia patients, taking into account the variation in frequency content. In general, the difference of the speech spectrum is reflected on the distribution of the respective frequency components, which can be determined by their positions and corresponding amplitudes. In consideration of the amplitude of energy, there is proposed a method of determining an index A for evaluating the amplitude variation of each frequency component_total。

D is set to be 2 in the experiment, namely the range participating in calculation is the whole frequency spectrum (fs/2) × 2 ═ fs; if d is set to 4, (fs/4) × 2 ═ fs/2 then the frequency range involved in the calculation is the top 1/2 frequency component range, and so on. As shown in fig. 2 for an example where d is equal to 2, where the axis of symmetry is fs/2, and the signal values at equal distances from the axis of symmetry are a pair of symmetric sequences.

And obtaining the value of the symmetric spectrum difference of the amplitude of each frequency component through the formula (1). Since the spectral change not only relates to the amplitude but also to the position of the amplitude distribution, at A_totalA weighting factor is introduced in the calculation process of (2). In the embodiment, the weighting factor in SSDL is improved, the introduced weighting factor is a logarithmic value with base 2 corresponding to the frequency coordinate scale, and since the symmetry axis is fs/2, the coordinate weighting matrix is adjusted to emphasize the distribution difference between the low frequency component and the high frequency component. Differences occurring in higher frequency regions may be weighted by higher weights. As shown in equation (2).

And (4) calculating to obtain the final third classification characteristic ILOG-SSDL through the formula (2). The feature combines the difference of the distribution of the voice frequency components in the amplitude distribution and the corresponding position distribution, and emphasizes the distribution difference of the voice data of normal people and the voice data of dysphagia patients in different frequency bands in the frequency spectrum through improved weighting factors.

4) carrying out short framing on the voice data of a normal person and the voice data of a dysphagia patient;

grouping the corresponding frame signals of the voice data of the normal human and the voice data of the dysphagia patient;

and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between normal human voice data and the voice data of the dysphagia patient based on preset confidence, and taking the voice features corresponding to the frame sequence as fourth classification features.

The three characteristics mainly explore the difference between the speech data of the dysphagia patient and the speech data of the normal person in the detailed energy frequency band and the overall energy distribution. DRDs are different from the previous features in that it improves the time resolution by a short framing (each frame of speech has a length in the range of about 5-15 ms), calculates a spectrogram based on the short framing, and clearly identifies the locations of voiced segments and unvoiced segments on the spectrogram, thereby making it possible to reflect the prosody change characteristics of speech through the spectrogram.

Based on the short framing technique, the length of each frame of voice data is 1/1000 of the length of the signal itself, the frame is shifted 1/4000 of the length of the voice data itself, and the average value of the frequency spectrum of each frame of signal in the first 1300 frames of voice data is calculated.

In this embodiment, in order to explore a place where speech data generates prosody differences, corresponding speech data frame signals of speech data of a patient with dysphagia and normal human speech data are grouped, inter-group difference tests are performed on the spectral mean values of 1300 groups of speech signals by using T tests, respectively, a confidence coefficient is set to be 95%, and a frame sequence with significant differences in a set is located. Thereby finding out the characteristic difference of the dysphagia patient and the normal person when expressing the same content sentence. In this embodiment, voice data of a word that a patient with dysphagia and a normal person eat grape skins without spitting grape skins and grape skins with spitting of grape skins reversely are collected, and the result shows that the patient with dysphagia reads that the normal person eats grape skins without spitting grape skins and grape skins with spitting of grape skins, a plurality of feature groups with significant differences exist at the initial pronunciation position and at the end of the word, and the feature groups corresponding to the frame signal sequences with significant differences are used as fourth classification features.

S14: training an identification model according to a feature set consisting of a plurality of groups of features;

the feature set is formed based on the four types of features extracted previously, the model can be, but is not limited to, an SVM (support vector machine) classifier, and an optimal classification surface in an SVM algorithm is provided based on a linear separable condition. The optimal classification surface requires that the classification surface not only can separate the two types of sample points as error-free as possible, but also can maximize the classification gap of the two types, has outstanding advantages in the processing of two-classification signals, and is a typical classifier suitable for two-type identification. The gaussian kernel of the SVM classifier is used in this embodiment. The classification correctness was 81.4%, sensitivity was 85%, and specificity was 80% as verified on the speech data of the second group of dysphagia patients. The verification result proves that the method provided by the embodiment can effectively realize the correct classification of dysphagia patients and normal persons through voice data, and obtains better classification performance in the aspects of classification accuracy, feature pertinence and the like compared with the prior art.

S15: and recognizing the voice data to be recognized based on the recognition model.

The method for identifying a dysphagia patient based on time-frequency resolution in the embodiment comprises the following steps: training data is acquired, the training data including normal human voice data and dysphagia patient voice data. The training data is preprocessed based on the time domain, and a plurality of groups of features are extracted from the preprocessed training data based on the frequency domain. And training an identification model according to a feature set formed by a plurality of groups of features, and identifying the voice data to be identified based on the identification model. Since the features used for training the input classifier in this embodiment at least include: a distribution difference characteristic in frequency domain energy and a speech prosody difference characteristic. The characteristic parameters reflect the energy distribution characteristics and prosodic characteristics of the voice signals from different angles, and can better represent the difference of dysphagia patients and normal people in voice expression.

A swallowing disorder patient identification device based on time-frequency resolution, referring to fig. 3, comprising:

a processor 21 and a memory 22;

the processor 21 is connected to the memory 22 by a communication bus:

the processor 21 is configured to call and execute a program stored in the memory 22;

a memory 22 for storing a program for performing at least the time-frequency resolution based dysphagia patient identification method in the above embodiment.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims

1. A swallowing disorder patient identification method based on time-frequency resolution is characterized by comprising the following steps:

preprocessing the training data based on a time domain;

extracting a plurality of groups of features from the preprocessed training data based on the frequency domain; the features at least include: distribution difference characteristics and voice rhythm difference characteristics on frequency domain energy;

and recognizing the voice data to be recognized based on the recognition model.

2. The method of claim 1, wherein the pre-processing the training data based on the time domain comprises:

performing high bit clipping on the training data based on a time domain.

3. The method of claim 2, wherein the high bit clipping the training data based on the time domain comprises:

taking an absolute value for each data point of the training data;

obtaining a high clipping adaptive threshold value based on a preset high clipping coefficient and the mean value of all the data points;

and outputting the training data after the high-order clipping.

4. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain comprises:

carrying out amplitude normalization on the preprocessed training data;

5. The method of claim 4, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:

calculating normalized spectral coefficient envelope areas of the normal human voice data and the dysphagia patient voice data, respectively, based on the energy coefficients; the ordinate of the normalized spectrum coefficient envelope area is an energy coefficient, and the abscissa is a frequency component corresponding to each energy coefficient;

6. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:

7. The method of claim 1, wherein extracting sets of features from the pre-processed training data based on the frequency domain further comprises:

and performing significance difference test on the extracted features of each group of frame signals, determining a frame signal sequence with significance difference between the normal human voice data and the dysphagia patient voice data based on a preset confidence coefficient, and taking a feature group corresponding to a frame sequence as a fourth classification feature.

8. The method of claim 6, wherein the predetermined algorithm comprises:

introducing a weight factor W, wherein the weight factor W is a logarithmic value of the corresponding frequency coordinate scale with a base of 2;

calculating the third classification characteristic ILOG-SSDL:

9. the method of claim 1, wherein the training data comprises a set of normal human voice data and two sets of dysphagia patient voice data; wherein a first set of the dysphagia patient voice data is used to extract a plurality of sets of features and a second set of the dysphagia patient voice data is used to validate the recognition model.

10. A time-frequency resolution based dysphagia patient identification device, comprising:

a processor and a memory;

the processor and the memory are connected through a communication bus:

the memory for storing a program for performing at least a time-frequency resolution based dysphagia patient identification method of any of claims 1-9.