[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118155663A - Big data cleaning method based on artificial intelligence - Google Patents

Big data cleaning method based on artificial intelligence Download PDF

Info

Publication number
CN118155663A
CN118155663A CN202410564809.0A CN202410564809A CN118155663A CN 118155663 A CN118155663 A CN 118155663A CN 202410564809 A CN202410564809 A CN 202410564809A CN 118155663 A CN118155663 A CN 118155663A
Authority
CN
China
Prior art keywords
text
voice
voice information
preset
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410564809.0A
Other languages
Chinese (zh)
Other versions
CN118155663B (en
Inventor
杨光春
潘颖年
舒立宏
时晓前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bocheng Jingwei Software Technology Co ltd
Original Assignee
Bocheng Jingwei Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bocheng Jingwei Software Technology Co ltd filed Critical Bocheng Jingwei Software Technology Co ltd
Priority to CN202410564809.0A priority Critical patent/CN118155663B/en
Publication of CN118155663A publication Critical patent/CN118155663A/en
Application granted granted Critical
Publication of CN118155663B publication Critical patent/CN118155663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the technical field of data cleaning, in particular to a big data cleaning method based on artificial intelligence, which comprises the steps of collecting voice information in a voice interaction process, and performing audio quality inspection on the voice information to determine the qualification of the voice information; performing voice endpoint detection on qualified voice information, and determining a starting process and an ending process of voice data through the voice endpoint detection so as to convert the voice data into text; performing qualification analysis on the converted text, and determining the qualification of the text according to the length of the converted text; sensitive word detection is carried out on the preliminarily qualified text so as to determine a cleaning mode of the text; performing quality evaluation on the cleaned text, and determining adjustment on the text cleaning process according to an evaluation result; according to the method, the accuracy of the text conversion process by voice is improved, so that the control accuracy of the big data cleaning process is improved, and the high-accuracy cleaning text is obtained.

Description

Big data cleaning method based on artificial intelligence
Technical Field
The invention relates to the technical field of data cleaning, in particular to a big data cleaning method based on artificial intelligence.
Background
Along with the increasing maturity of big data technology, the big data has the characteristics of huge data volume, high complexity and association degree and the like in internet plates, the quality of the data is required to be improved in a data cleaning stage in order to obtain high-quality data, and especially the problem of inaccurate cleaned data exists in the big data cleaning technology in the voice interaction process.
Chinese patent application publication No.: CN115687321a discloses a big data cleaning method and system based on artificial intelligence, which comprises an initial data importing module, a data classifying module, a filtering information importing module, a filtering selecting module, a first filtering module, a second filtering module, a third filtering module, a fourth filtering module, a secondary filtering module, a result outputting module and a comprehensive evaluating module; the initial data importing module is used for importing data to be cleaned by a user, and the data to be cleaned is sent to the data classifying module; the data classification module is used for processing the data to be cleaned to obtain data classification information, wherein the data classification information comprises single classification data and mixed classification data, and the data types of the single classification data and the mixed classification data comprise video data, audio data, text data and picture data; the invention can filter and clean data more quickly and accurately.
It can be seen that the problem of low accuracy of the final text cleaning process caused by low accuracy of the big data cleaning process control due to low accuracy of the text conversion process in the prior art.
Disclosure of Invention
Therefore, the invention provides a big data cleaning method based on artificial intelligence, which is used for solving the problem that the accuracy of the final cleaning text is not high due to the fact that the accuracy of the control of the big data cleaning process is low due to the fact that the accuracy of the voice conversion text process is low in the prior art.
In order to achieve the above object, the present invention provides an artificial intelligence based big data cleaning method, comprising:
Collecting voice information in a voice interaction process, and performing audio quality check on the voice information to determine the qualification of the voice information;
performing voice endpoint detection on qualified voice information, and determining a starting process and an ending process of voice data through the voice endpoint detection so as to convert the voice data into text;
performing qualification analysis on the converted text, and determining the qualification of the text according to the length of the converted text;
Sensitive word detection is carried out on the preliminarily qualified text so as to determine a cleaning mode of the text;
And carrying out quality evaluation on the cleaned text, and determining adjustment on the text cleaning process according to the evaluation result.
Further, determining the qualification of the voice information includes determining that the voice information is unqualified according to a comparison result that the average volume of the voice information is not in a preset volume range, and determining the qualification of the voice information according to a comparison result that the average volume of the voice information is in the preset volume range.
Further, under the condition that the qualification of the voice information is determined to be secondarily judged, the voice information is determined to be qualified according to the comparison result that the noise level of the voice information is smaller than or equal to the preset noise level, and the voice information is determined to be unqualified according to the comparison result that the noise level of the voice information is larger than the preset noise level.
Further, under the condition that the voice information is unqualified, determining to perform volume standardization processing and noise reduction processing on the voice information, under the condition that the voice information is qualified, determining to perform voice endpoint detection on the qualified voice information, and determining the starting and ending processes of voice data through voice endpoint detection so as to convert the voice data into text.
Further, under the condition that the eligibility analysis is carried out on the converted text, determining that the text is unqualified according to the comparison result that the converted text length is not in the preset length range, and determining that the text is qualified according to the comparison result that the converted text length is in the preset length range.
Further, under the condition that the text is unqualified, the number of devices for collecting voice information in the voice interaction process is increased by a first adjustment coefficient according to the comparison result that the absolute value of the difference value of the text length and the preset length range is smaller than or equal to a first preset difference value.
Further, under the condition that the text is qualified, determining to delete punctuation marks, blank characters, stop words and repeated speech segments of the text according to the comparison result that the number of the sensitive words in the text is smaller than the preset number, determining to delete punctuation marks, blank characters, stop words and repeated speech segments of the text according to the comparison result that the number of the sensitive words in the text is larger than or equal to the preset number, and simultaneously deleting all the content parts containing the sensitive words.
Further, under the condition that the quality evaluation is carried out on the cleaned text, the adjustment of the text cleaning process is determined according to the comparison result that the accuracy evaluation value of the cleaned text is smaller than the preset accuracy evaluation value.
Further, under the condition of adjusting the text cleaning process, the preset number is determined to be adjusted by a second adjustment coefficient according to a comparison result that the difference value between the accuracy degree evaluation value and the preset accuracy degree evaluation value is smaller than or equal to a second preset difference value.
Further, under the condition of adjusting the text cleaning process, the preset length range and the preset volume range are determined to be adjusted by a third adjustment coefficient according to the comparison result that the difference value between the accuracy evaluation value and the preset accuracy evaluation value is larger than the second preset difference value.
Compared with the prior art, the method has the advantages that the qualification of the voice information is determined through the audio quality inspection of the collected voice information, so that the effectiveness of subsequent cleaning is guaranteed, the control accuracy of the big data cleaning process is improved, meanwhile, the qualification of the voice information is primarily determined according to the volume of the voice information, the phenomenon that the text converted by voice is inaccurate due to the fact that the volume of the voice information is smaller than or larger than a preset volume range is avoided, and the control accuracy of the big data cleaning process based on artificial intelligence is improved.
Further, the voice information with qualified volume is analyzed by noise, the qualification of the voice information is determined according to the noise level of the voice information and the preset noise level, the accuracy of the voice-to-text process is improved, and the control accuracy of the big data cleaning process is further improved.
Furthermore, the invention obtains the qualified voice information by carrying out volume standardization processing and noise reduction processing on the unqualified voice information, avoids idle work which is done when the voice information is processed later because of the unqualified voice information, reduces resource waste, and improves the accuracy of the voice conversion text process by decomposing the voice information into paragraphs and carrying out piecewise voice conversion by carrying out voice endpoint detection on the qualified voice information, thereby improving the control accuracy of the big data cleaning process and obtaining the high-accuracy clear text.
Further, the method and the device for determining the text conversion accuracy determine the accuracy of the text conversion process through analyzing the converted text length, if the converted text length is not in the preset length range, the phenomenon that the converted text has lost data or has noise interference data is indicated, the process of determining the text conversion is unqualified, the control accuracy of the text conversion process is improved through the method, and the accuracy of the big data cleaning process is further improved, so that the high-accuracy text is obtained.
Furthermore, the accuracy of the text to be cleaned is improved by adjusting the voice endpoint detection process so as to improve the accuracy of the text conversion process, the accuracy of the text after cleaning is improved, the number of devices for collecting voice information in the voice interaction process is increased by a first coefficient, the voice information can be obtained more clearly and completely so as to improve the accuracy of the voice endpoint detection process, and the integrity of the voice can be improved so as to improve the accuracy of the text conversion process by post-processing and optimizing voice paragraphs after voice endpoint detection.
Further, the method and the device further determine the cleaning mode of the text by analyzing the number of the sensitive words in the text, if the number of the sensitive words in the text is smaller than the preset number, the text is determined to be subjected to punctuation mark deletion, blank character deletion, stop word and repeated speech segment processing, and if the number of the sensitive words in the text is larger than or equal to the preset number, the content part containing the sensitive words is completely deleted on the basis of the first cleaning mode, so that the clear text with qualified quality can be obtained by the method.
Further, the method and the device for cleaning the text determine the accuracy evaluation value of the cleaned text by analyzing the cleaned text and the original text, and determine the adjustment of the text cleaning process according to the comparison result of the accuracy evaluation value of the text and the preset accuracy evaluation value, if the accuracy evaluation value of the text is smaller than the preset accuracy evaluation value, the fact that the cleaned text is large in difference from the original text and the text cleaning process needs to be adjusted is indicated, and the accuracy of control of the big data cleaning process is improved by the method, so that the high-accuracy text is obtained.
Further, the invention determines an adjustment mode through the comparison result of the difference value of the accuracy evaluation value and the preset accuracy evaluation value and the second preset difference value, adjusts the preset number of the sensitive words by the second adjustment coefficient if the difference value is smaller than or equal to the second preset difference value, reduces the deleting proportion of the text to increase the integrity of the text, adjusts the qualification judging parameter of the voice information by the third adjustment coefficient if the difference value is larger than or equal to the second preset difference value, improves the accuracy of the voice information, and further improves the accuracy of the control of the big data cleaning process by the method to obtain the high-accuracy cleaning text.
Drawings
FIG. 1 is a workflow diagram of an artificial intelligence based big data cleansing method in accordance with an embodiment of the present invention;
FIG. 2 is a workflow diagram of determining eligibility of converted text based on an artificial intelligence based big data cleansing method in accordance with an embodiment of the present invention;
FIG. 3 is a flowchart of an embodiment of the invention for determining adjustments to a text cleaning process based on an artificial intelligence based big data cleaning method.
Detailed Description
In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.
Referring to fig. 1-3, fig. 1 is a workflow diagram of an artificial intelligence based big data cleaning method according to an embodiment of the present invention; FIG. 2 is a workflow diagram of determining eligibility of converted text based on an artificial intelligence based big data cleansing method in accordance with an embodiment of the present invention; FIG. 3 is a flowchart of an embodiment of the invention for determining adjustments to a text cleaning process based on an artificial intelligence based big data cleaning method.
The big data cleaning method based on artificial intelligence provided by the embodiment of the invention comprises the following steps:
s1, collecting voice information in a voice interaction process, and performing audio quality inspection on the voice information to determine the qualification of the voice information;
S2, performing voice endpoint detection on qualified voice information, and determining a starting process and an ending process of voice data through the voice endpoint detection so as to convert the voice data into texts;
s3, performing qualification analysis on the converted text, and determining the qualification of the text according to the converted text length;
s4, detecting sensitive words of the preliminarily qualified text to determine a cleaning mode of the text;
and S5, performing quality evaluation on the cleaned text, and determining adjustment on the text cleaning process according to an evaluation result.
Specifically, determining the eligibility of the voice information includes determining the eligibility of the voice information according to a comparison result of the average volume D of the voice information and a preset volume range D0;
if D is not in D0, determining that the voice information is unqualified;
If D is in D0, determining to secondarily judge the qualification of the voice information;
The preset volume range D0 is determined by one third of the historical maximum value and nine tenths of the historical maximum value of the volume broadcasted in the automobile, but the above values are not limited thereto, and the person skilled in the art can adjust the values according to actual needs.
Specifically, the invention determines the qualification of the voice information by carrying out audio quality inspection on the collected voice information so as to ensure the effectiveness of subsequent cleaning, improves the control accuracy of the big data cleaning process, simultaneously, preliminarily determines the qualification of the voice information according to the volume of the voice information, avoids the phenomenon of inaccurate text caused by voice conversion because the volume of the voice information is smaller than or larger than a preset volume range, and improves the control accuracy of the big data cleaning process based on artificial intelligence.
Specifically, under the condition of determining the qualification of the voice information for secondary judgment, determining the qualification of the voice information according to the comparison result of the noise level R of the voice information and the preset noise level R0;
if R is less than or equal to R0, determining that the voice information is qualified;
if R is more than R0, determining that the voice information is unqualified;
The preset noise level R0 is-50, but the above values are not limited thereto, and those skilled in the art can adjust the values according to actual needs.
The value of the noise level in the embodiment of the invention is determined by the signal-to-noise ratio of the audio, and the value of the preset noise level is determined by the noise level of the audio which is generally required to be lower than-50 dBus in the voice conference.
Specifically, the invention determines the qualification of the voice information according to the noise level of the voice information and the preset noise level by carrying out noise analysis on the voice information with qualified volume, improves the accuracy of the voice conversion text process, and further improves the control accuracy of the big data cleaning process.
Specifically, under the condition that the voice information is unqualified, the volume normalization processing and the noise reduction processing are determined to be performed on the voice information.
The volume normalization process in the embodiment of the invention includes, but is not limited to, "volume normalization process of audio using audio processing software or library", and the noise reduction process includes, but is not limited to, "filter noise reduction, spectral subtraction noise reduction, and estimation of the relationship between the audio signal and the noise signal using a statistical model and a probability algorithm, and noise reduction by maximum likelihood estimation or bayesian inference, etc.
Specifically, under the condition that the voice information is qualified, determining to perform voice endpoint detection on the qualified voice information, and converting the voice data into text according to the starting and ending processes of the voice data determined by the voice endpoint detection.
Specifically, the invention obtains qualified voice information by carrying out volume standardization processing and noise reduction processing on unqualified voice information, avoids idle work which is done when the voice information is processed later due to unqualified voice information, reduces resource waste, and improves the accuracy of a text conversion process by decomposing the voice information into paragraphs and carrying out section-by-section voice conversion by carrying out voice endpoint detection on the qualified voice information, thereby improving the control accuracy of a big data cleaning process and obtaining high-accuracy clear text.
Specifically, under the condition of determining that the converted text is subjected to qualification analysis, determining the preliminary qualification of the text according to the comparison result of the converted text length L and the preset length range L0;
If L is not in L0, determining that the text is unqualified;
And if L is in L0, determining that the text is qualified.
In the embodiment of the present invention, the value range of the preset length range L0 is determined by the voice information length in voice interaction, the voice information length is calculated according to the average speech speed and the voice duration after voice endpoint detection, the minimum value of the preset length range L0 is four fifths of the voice information length, and the maximum value of the preset length range L0 is six fifths of the voice information length, but the value is not limited thereto, and the person skilled in the art can also adjust the value according to actual needs.
Specifically, the method and the device for determining the text conversion accuracy of the voice through analyzing the converted text length determine the accuracy of the voice conversion text process, and if the converted text length is not in the preset length range, the phenomenon that the converted text has lost data or has noise interference data is indicated, the process of determining the voice conversion text is unqualified, and the method improves the control accuracy of the voice conversion text process, further improves the accuracy of the big data cleaning process, and obtains the high-accuracy text.
Specifically, an adjustment to the speech endpoint detection process is determined under conditions in which the text fails.
Specifically, under the condition of determining the adjustment of the voice endpoint detection process, determining an adjustment mode according to the comparison result of the absolute value delta L of the difference value between the text length and the preset length range and the first preset difference value delta L0;
If delta L is less than or equal to delta L0, adjusting the voice endpoint detection process in a first adjustment mode;
if DeltaL > DeltaL0, adjusting the voice endpoint detection process in a second adjustment mode;
the value of the first preset difference Δl0 is one twentieth of the minimum value of the preset length range, but the value is not limited thereto, and a person skilled in the art can adjust the value according to actual needs.
Specifically, under the condition that the voice endpoint detection process is adjusted in a first adjustment mode, increasing the number X of devices for collecting voice information in the voice interaction process by a first adjustment coefficient K1;
and under the condition that the voice endpoint detection process is adjusted in a second adjustment mode, post-processing and optimizing are carried out on the voice paragraphs after voice endpoint detection.
The device for collecting the voice information in the voice interaction process in the embodiment of the invention comprises a microphone and a sensor, and the post-processing and optimization comprises a method of applying rules of voice activity, duration limitation, continuity judgment and the like, and the method is used for filtering and optimizing detection results.
Specifically, the first adjustment coefficient K1 is calculated by the following formula, and is set:
The adjusted number of devices for collecting voice information in the voice interaction process is set to be X' =K1×X.
Specifically, the accuracy of the text to be cleaned is improved by adjusting the voice endpoint detection process so as to improve the accuracy of the text conversion process, the accuracy of the text after cleaning is improved, the number of devices for collecting voice information in the voice interaction process is increased by a first coefficient, the voice information can be obtained more clearly and completely so as to improve the accuracy of the voice endpoint detection process, and the integrity of the voice can be improved so as to improve the accuracy of the text conversion process by post-processing and optimizing voice paragraphs after voice endpoint detection.
Specifically, under the condition that the text is qualified, determining a cleaning mode of the text according to the comparison result of the number Y of the sensitive words in the text and the preset number Y0;
If Y is less than Y0, determining to wash the text in a first washing mode;
If Y is more than or equal to Y0, determining to clean the text in a second cleaning mode;
The value of the preset number Y0 is one percent of the text length, but the value is not limited thereto, and the person skilled in the art can adjust the value according to actual needs.
Specifically, under the condition that the text is cleaned in a first cleaning mode, punctuation marks, blank characters, stop words and repeated speech segments are deleted from the text;
and under the condition of the text in the second cleaning mode, deleting punctuation marks, blank characters, stop words and repeated speech segments of the content which does not contain sensitive words in the text, and simultaneously deleting all the content parts which contain the sensitive words.
Specifically, the method and the device further determine the cleaning mode of the text by analyzing the number of the sensitive words in the text, if the number of the sensitive words in the text is smaller than the preset number, the text is determined to be subjected to punctuation mark deletion, blank character deletion, stop word and repeated speech segment processing, and if the number of the sensitive words in the text is larger than or equal to the preset number, the content part containing the sensitive words is completely deleted on the basis of the first cleaning mode, so that the clear text with qualified quality can be obtained by the method.
The sensitive words in the embodiment of the invention include but are not limited to 'dirty words, national privacy and enterprise confidentiality'.
Specifically, under the condition of carrying out quality evaluation on the cleaned text, determining adjustment on the text cleaning process according to the comparison result of the accuracy degree evaluation value P of the cleaned text and the preset accuracy degree evaluation value P0;
If P is more than or equal to P0, not adjusting the text cleaning process;
if P is less than P0, adjusting the text cleaning process;
The value of the preset accuracy evaluation value P0 is a historical average value of the accuracy evaluation value of the washing text, but the value is not limited thereto, and a person skilled in the art can adjust the value according to actual needs.
Specifically, the accuracy evaluation value P is calculated by the following formula, and is set:
wherein G represents the cleaned text data amount, G0 represents the preset text data amount, the value of the preset text data amount is the historical average value of the cleaned text data amount within the same voice duration range, H represents the semantic similarity between the cleaned text and the original text, H0 represents the preset semantic similarity, and the value of the preset semantic similarity is the historical average value of the semantic similarity between the cleaned text and the original text.
In the embodiment of the invention, the original text is a text converted from voice data after voice endpoint detection, and the value of the same voice duration range is nine to eleven tenths of the voice duration of the text of which the accuracy evaluation value is to be determined, but the value is not limited to the nine to eleven tenths, and a person skilled in the art can adjust the value according to actual needs.
Specifically, the method and the device for cleaning the text determine the accuracy evaluation value of the cleaned text by analyzing the cleaned text and the original text, and determine the adjustment of the text cleaning process according to the comparison result of the accuracy evaluation value of the text and the preset accuracy evaluation value, if the accuracy evaluation value of the text is smaller than the preset accuracy evaluation value, the fact that the cleaned text has large difference from the original text and the text cleaning process needs to be adjusted is indicated, and the accuracy of the control of the big data cleaning process is improved by the method so as to obtain the high-accuracy text.
Specifically, under the condition of adjusting the text cleaning process, determining an adjusting mode according to the comparison result of the difference value delta P of the accuracy evaluation value and the preset accuracy evaluation value and the second preset difference value delta P0;
if delta P is less than or equal to delta P0, adjusting the text cleaning process in a third adjustment mode;
If delta P > -delta P0, adjusting the text cleaning process in a fourth adjustment mode;
the value of the second preset difference Δp0 is one eighth of the preset accuracy evaluation value, but the value is not limited thereto, and a person skilled in the art can adjust the value according to actual needs.
Specifically, under the condition that the text cleaning process is adjusted in a third adjustment manner, adjusting the preset number by a second adjustment coefficient K2;
and under the condition that the text cleaning process is adjusted in a fourth adjustment mode, adjusting a preset length range and a preset volume range by a third adjustment coefficient K3.
Specifically, the second adjustment coefficient K2 is calculated by the following formula, and is set:
the third adjustment coefficient K3 is calculated by the following formula, and is set:
Setting the adjusted preset number to Y0' =k2×y0;
Setting the adjusted preset length range to L0' =k3×l0;
the adjusted preset volume range is set to D0' =k3×d0.
Specifically, the invention determines the adjustment mode through the comparison result of the difference value of the accuracy evaluation value and the preset accuracy evaluation value and the second preset difference value, adjusts the preset number of the sensitive words by the second adjustment coefficient if the difference value is smaller than or equal to the second preset difference value, reduces the deletion proportion of the text to increase the integrity of the text, adjusts the qualification judgment parameter of the voice information by the third adjustment coefficient if the difference value is larger than or equal to the second preset difference value, improves the accuracy of the voice information, and improves the accuracy of the control of the big data cleaning process by the method so as to obtain the high-accuracy cleaning text.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.
The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention; various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The big data cleaning method based on artificial intelligence is characterized by comprising the following steps:
Collecting voice information in a voice interaction process, and performing audio quality check on the voice information to determine the qualification of the voice information;
performing voice endpoint detection on qualified voice information, and determining a starting process and an ending process of voice data through the voice endpoint detection so as to convert the voice data into text;
performing qualification analysis on the converted text, and determining the qualification of the text according to the length of the converted text;
Sensitive word detection is carried out on the preliminarily qualified text so as to determine a cleaning mode of the text;
And carrying out quality evaluation on the cleaned text, and determining adjustment on the text cleaning process according to the evaluation result.
2. The artificial intelligence based big data cleaning method of claim 1, wherein the determining the eligibility of the voice information includes determining that the voice information is not acceptable according to a comparison result that the average volume of the voice information is not within a preset volume range, and determining that the eligibility of the voice information is secondarily determined according to a comparison result that the average volume of the voice information is within the preset volume range.
3. The artificial intelligence based big data cleaning method of claim 2, wherein the voice information is determined to be qualified according to a comparison result that the noise level of the voice information is less than or equal to a preset noise level under the condition that the qualification of the voice information is determined to be secondarily determined, and the voice information is determined to be unqualified according to a comparison result that the noise level of the voice information is greater than the preset noise level.
4. The artificial intelligence based big data cleansing method according to claim 2 or 3, wherein it is determined that the volume normalization process and the noise reduction process are performed on the voice information in a condition that the voice information is not qualified, and in a condition that the voice information is qualified, it is determined that the voice information is qualified, a voice endpoint detection is performed on the voice information, and a start and end process of the voice data is determined through the voice endpoint detection to convert the voice data into text.
5. The artificial intelligence based big data cleaning method of claim 4, wherein under the condition of determining that the converted text is subjected to qualification analysis, determining that the text is unqualified according to a comparison result that the converted text length is not in a preset length range, and determining that the text is qualified according to a comparison result that the converted text length is in the preset length range.
6. The artificial intelligence based big data cleaning method of claim 5, wherein the number of devices for collecting the voice information in the voice interaction process is increased by the first adjustment coefficient according to the comparison result that the absolute value of the difference between the text length and the preset length range is smaller than or equal to the first preset difference under the condition that the text is unqualified.
7. The big data cleaning method based on artificial intelligence according to claim 5, wherein under the condition that the text is qualified, determining to delete punctuation marks, blank characters, stop words and repeated speech segments of the text according to the comparison result that the number of the sensitive words in the text is smaller than the preset number, and determining to delete punctuation marks, blank characters, stop words and repeated speech segments of the text according to the comparison result that the number of the sensitive words in the text is larger than or equal to the preset number, and simultaneously deleting all content parts containing the sensitive words.
8. The artificial intelligence based big data cleaning method according to claim 1, wherein the text cleaning process adjustment is determined according to a comparison result that the accuracy evaluation value of the cleaned text is smaller than a preset accuracy evaluation value under the condition that the quality evaluation of the cleaned text is determined.
9. The artificial intelligence based big data cleaning method of claim 8, wherein the predetermined number is determined to be adjusted by a second adjustment coefficient according to a comparison result that a difference between an accuracy degree evaluation value and a predetermined accuracy degree evaluation value is less than or equal to a second predetermined difference value under the condition of adjusting the text cleaning process.
10. The artificial intelligence based big data cleaning method of claim 8, wherein the third adjustment coefficient is used to adjust the preset length range and the preset volume range according to a comparison result that the difference between the accuracy evaluation value and the preset accuracy evaluation value is greater than the second preset difference value under the condition of adjusting the text cleaning process.
CN202410564809.0A 2024-05-09 2024-05-09 Big data cleaning method based on artificial intelligence Active CN118155663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410564809.0A CN118155663B (en) 2024-05-09 2024-05-09 Big data cleaning method based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410564809.0A CN118155663B (en) 2024-05-09 2024-05-09 Big data cleaning method based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN118155663A true CN118155663A (en) 2024-06-07
CN118155663B CN118155663B (en) 2024-08-09

Family

ID=91298931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410564809.0A Active CN118155663B (en) 2024-05-09 2024-05-09 Big data cleaning method based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN118155663B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229456A1 (en) * 2013-02-12 2014-08-14 International Business Machines Corporation Data quality assessment
CN105513588A (en) * 2014-09-22 2016-04-20 联想(北京)有限公司 Information processing method and electronic equipment
CN107799124A (en) * 2017-10-12 2018-03-13 安徽咪鼠科技有限公司 A kind of VAD detection methods applied to intelligent sound mouse
CN113722416A (en) * 2021-08-24 2021-11-30 苏州浪潮智能科技有限公司 Data cleaning method, device and equipment and readable storage medium
CN114171029A (en) * 2021-12-07 2022-03-11 广州虎牙科技有限公司 Audio recognition method and device, electronic equipment and readable storage medium
CN115687321A (en) * 2022-10-29 2023-02-03 慕学星凡(成都)科技有限公司 Big data cleaning method and system based on artificial intelligence
CN115831155A (en) * 2021-09-16 2023-03-21 腾讯科技(深圳)有限公司 Audio signal processing method and device, electronic equipment and storage medium
US20230260535A1 (en) * 2020-07-13 2023-08-17 Zoundream Ag A computer-implemented method of providing data for an automated baby cry assessment
CN117992439A (en) * 2024-01-24 2024-05-07 中国科学院香港创新研究院人工智能与机器人创新中心有限公司 Text cleaning method, device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229456A1 (en) * 2013-02-12 2014-08-14 International Business Machines Corporation Data quality assessment
CN105513588A (en) * 2014-09-22 2016-04-20 联想(北京)有限公司 Information processing method and electronic equipment
CN107799124A (en) * 2017-10-12 2018-03-13 安徽咪鼠科技有限公司 A kind of VAD detection methods applied to intelligent sound mouse
US20230260535A1 (en) * 2020-07-13 2023-08-17 Zoundream Ag A computer-implemented method of providing data for an automated baby cry assessment
CN113722416A (en) * 2021-08-24 2021-11-30 苏州浪潮智能科技有限公司 Data cleaning method, device and equipment and readable storage medium
CN115831155A (en) * 2021-09-16 2023-03-21 腾讯科技(深圳)有限公司 Audio signal processing method and device, electronic equipment and storage medium
CN114171029A (en) * 2021-12-07 2022-03-11 广州虎牙科技有限公司 Audio recognition method and device, electronic equipment and readable storage medium
CN115687321A (en) * 2022-10-29 2023-02-03 慕学星凡(成都)科技有限公司 Big data cleaning method and system based on artificial intelligence
CN117992439A (en) * 2024-01-24 2024-05-07 中国科学院香港创新研究院人工智能与机器人创新中心有限公司 Text cleaning method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵斐 , 徐勇 , 成立新: "PESQ及其应用", 电子设计应用, no. 03, 1 March 2003 (2003-03-01) *

Also Published As

Publication number Publication date
CN118155663B (en) 2024-08-09

Similar Documents

Publication Publication Date Title
US6959276B2 (en) Including the category of environmental noise when processing speech signals
CN107273585B (en) On-load tap-changer fault detection method and device
US8239203B2 (en) Adaptive confidence thresholds for speech recognition
US20070129941A1 (en) Preprocessing system and method for reducing FRR in speaking recognition
CN107305774A (en) Speech detection method and device
CN112669851A (en) Voice recognition method and device, electronic equipment and readable storage medium
CN109036412A (en) voice awakening method and system
CN113628627B (en) Electric power industry customer service quality inspection system based on structured voice analysis
US20060200346A1 (en) Speech quality measurement based on classification estimation
CN113823293A (en) Speaker recognition method and system based on voice enhancement
CN106483550A (en) A kind of simulation spectrum curve emulation mode
CN118155663B (en) Big data cleaning method based on artificial intelligence
CN113782036B (en) Audio quality assessment method, device, electronic equipment and storage medium
CN115394318A (en) Audio detection method and device
CN110299133B (en) Method for judging illegal broadcast based on keyword
CN117457017B (en) Voice data cleaning method and electronic equipment
CN118101998A (en) Live broadcast risk behavior monitoring and early warning system and method
CN113782051B (en) Broadcast effect classification method and system, electronic equipment and storage medium
CN115937040A (en) Image denoising method for assisting file digitization
CN118410201B (en) Voice data classified storage method and system based on Internet of things platform
CN114915845A (en) System and method for predicting IPTV user declaration
CN116192815B (en) Online live broadcast and voice interaction job conference management method for staff members
CN114070601B (en) LDoS attack detection method based on EMDR-WE algorithm
CN112019786B (en) Intelligent teaching screen recording method and system
CN117457016B (en) Method and system for filtering invalid voice recognition data

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant