CN112489619A

CN112489619A - Voice processing method, terminal device and storage medium

Info

Publication number: CN112489619A
Application number: CN202011334384.2A
Authority: CN
Inventors: 刘沙沙
Original assignee: Shanghai Chuanying Information Technology Co Ltd
Current assignee: Shanghai Chuanying Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-03-12

Abstract

The application discloses a voice processing method, terminal equipment and a storage medium. The voice processing method comprises the following steps: s11, acquiring preset characteristic information of a user; and S12, determining the target voice style during voice broadcasting according to the preset characteristic information. According to the method and the device, the adaptive voice styles can be automatically switched according to the current state of the user, rich voice styles can be provided, and the voice playing quality can be improved.

Description

Voice processing method, terminal device and storage medium

Technical Field

The present application relates to the field of speech processing and synthesis technologies, and in particular, to a speech processing method, and a terminal device and a readable storage medium based on the speech processing method.

Background

In recent years, with the continuous development of the online audio-video market, speech processing and speech synthesis technologies have been widely applied to daily life of people, such as online children's reading, online novel, online review, online news, and the like. However, most of the current speech playback based on speech processing and synthesis techniques is mechanical, emotional, intuitive, and waveless, and is limited to converting characters into speech that can be understood by people, and lacks rich speech styles, and cannot automatically switch an appropriate speech style according to the current state of a user, and thus cannot further improve the quality of speech playback.

The foregoing description is provided for general background information and is not admitted to be prior art.

Disclosure of Invention

In view of this, the present application provides a voice processing method, a terminal device and a storage medium to solve the problem that voice playing cannot be realized by selecting an appropriate voice style according to a user state.

The application provides a voice processing method, which comprises the following steps:

s11, acquiring preset characteristic information of a user;

and S12, determining the target voice style during voice broadcasting according to the preset characteristic information of the user.

Optionally, the preset feature information includes at least one of: work and rest information, situation information, emotional characteristics, character characteristics, gender and age.

Optionally, the step of S11 includes at least one of:

acquiring preset characteristic information according to the selection operation and/or the input operation;

acquiring preset characteristic information according to historical habits and/or sensors;

and acquiring voice data of a user, and acquiring preset characteristic information according to the voice data.

Optionally, before the step S12, the method further includes: selecting a matched document to be played according to the target voice style, identifying and extracting text content in the document to be played, and/or performing voice synthesis on the text content to generate a voice document with the target voice style;

optionally, the method further comprises: recognizing and extracting text contents of resources to be played;

after the step of S12, the method includes: and performing voice synthesis on the text content to generate a voice document with a target voice style.

Optionally, after the step of S12, the method further includes:

selecting an adaptive document to be played according to preset characteristic information;

recognizing and extracting text contents in a document to be played; and/or performing voice synthesis on the text content to generate a voice document with a target voice style.

Optionally, before the step S12, the method further includes: acquiring preset characteristic information of a document to be played;

judging whether the preset characteristic information of the document to be played conflicts with the preset characteristic information of the user or not;

if not, executing the step S12; and/or the presence of a gas in the gas,

and if so, executing a preset strategy.

Optionally, the obtaining of the preset feature information of the document to be played includes at least one of:

acquiring preset characteristic information of a document to be played according to a preset classification label of the document;

and determining preset characteristic information according to the text content of the document to be played.

Optionally, the preset policy includes at least one of:

executing the step of S12;

determining a target voice style according to the selection instruction;

determining a target voice style according to preset characteristic information of a document to be played;

and taking the default voice style as a target voice style during voice playing.

s21, acquiring preset characteristic information of the document to be played;

and S22, determining the target voice style during voice broadcasting according to the preset characteristic information of the document to be played.

Optionally, the step of S21 includes at least one of:

Optionally, before the step S22, the method further includes: acquiring preset characteristic information of a user;

judging whether the document to be played conflicts with preset characteristic information of a user or not;

if not, executing the step S22; and/or the presence of a gas in the gas,

and if so, executing a preset strategy.

Optionally, the preset policy includes at least one of:

executing the step of S22;

determining a target voice style according to the selection instruction;

determining a target voice style according to preset characteristic information of a user;

The terminal device provided by the application comprises a memory and a processor, wherein the memory stores a voice processing program and is used for realizing the steps of any one of the voice processing methods when being executed by the processor.

The present application provides a readable storage medium storing a computer program for implementing the steps of any one of the above-mentioned speech processing methods when executed by a processor.

As described above, the voice processing method, the terminal device and the storage medium of the present application determine the target voice style during voice broadcasting according to the preset feature information of the user, and the preset feature information identifies the current state of the user, so that not only can the adaptive voice style be automatically switched according to the current state of the user, but also rich voice styles can be provided, which is beneficial to improving the quality of voice broadcasting.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware structure of a terminal device for implementing various embodiments of the present application;

fig. 2 is a communication network system architecture diagram according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a speech processing method according to a first embodiment of the present application;

FIG. 4 is a schematic view of an operation interface according to an embodiment of obtaining preset feature information of a user;

FIG. 5 is a schematic view of an operation interface of another embodiment for acquiring preset feature information of a user according to the present application;

FIG. 6 is a schematic view of an operation interface of another embodiment of the present application for acquiring preset feature information of a user;

FIG. 7 is a schematic diagram of an exemplary embodiment of an interface for determining a target speech style;

FIG. 8 is a flowchart illustrating a speech processing method according to a second embodiment of the present application;

FIG. 9 is a flowchart illustrating a speech processing method according to a third embodiment of the present application;

FIG. 10 is a flowchart illustrating a speech processing method according to a fourth embodiment of the present application;

FIG. 11 is a flowchart illustrating a speech processing method according to a fifth embodiment of the present application;

fig. 12 is a flowchart illustrating a speech processing method according to a sixth embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the recitation of an element by the phrase "comprising an … …" does not exclude the presence of additional like elements in the process, method, article, or apparatus that comprises the element, and further, where similarly-named elements, features, or elements in different embodiments of the disclosure may have the same meaning, or may have different meanings, that particular meaning should be determined by their interpretation in the embodiment or further by context with the embodiment.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.

It should be noted that step numbers such as S11 and S12 are used herein for the purpose of more clearly and briefly describing the corresponding content, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform S12 first and then S11 in specific implementation, which should be within the scope of the present application.

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module", "component" or "unit" may be used mixedly.

The terminal device may be implemented in various forms. For example, the terminal devices described in the present application may include mobile terminals such as a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, and the like, and fixed terminals such as a Digital TV, a desktop computer, and the like.

The following description will be given taking a mobile terminal as an example, and it will be understood by those skilled in the art that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly used for mobile purposes.

Referring to fig. 1, which is a schematic diagram of a hardware structure of a mobile terminal for implementing various embodiments of the present application, the mobile terminal 100 may include: RF (Radio Frequency) unit 101, WiFi module 102, audio output unit 103, a/V (audio/video) input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 1 is not intended to be limiting of mobile terminals, which may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile terminal in detail with reference to fig. 1:

the radio frequency unit 101 may be configured to receive and transmit signals during information transmission and reception or during a call, and specifically, receive downlink information of a base station and then process the downlink information to the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 can also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA2000(Code Division Multiple Access 2000), WCDMA (Wideband Code Division Multiple Access), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access), FDD-LTE (Frequency Division duplex Long Term Evolution), and TDD-LTE (Time Division duplex Long Term Evolution).

WiFi belongs to short-distance wireless transmission technology, and the mobile terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 102, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 102, it is understood that it does not belong to the essential constitution of the mobile terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the mobile terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 may include a speaker, a buzzer, and the like.

The a/V input unit 104 is used to receive audio or video signals. The a/V input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, the Graphics processor 1041 Processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 may receive sounds (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, or the like, and may be capable of processing such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.

The mobile terminal 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that may optionally adjust the brightness of the display panel 1061 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 1061 and/or the backlight when the mobile terminal 100 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The display unit 106 is used to display information input by a user or information provided to the user. The Display unit 106 may include a Display panel 1061, and the Display panel 1061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect a touch operation performed by a user on or near the touch panel 1071 (e.g., an operation performed by the user on or near the touch panel 1071 using a finger, a stylus, or any other suitable object or accessory), and drive a corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Optionally, the touch detection device detects a touch orientation of a user, detects a signal caused by a touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and can receive and execute commands sent by the processor 110. In addition, the touch panel 1071 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1071, the user input unit 107 may include other input devices 1072. In particular, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like, and are not limited to these specific examples.

Further, the touch panel 1071 may cover the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch panel 1071 transmits the touch operation to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of the touch event. Although the touch panel 1071 and the display panel 1061 are shown in fig. 1 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 and the display panel 1061 may be integrated to implement the input and output functions of the mobile terminal, and is not limited herein.

The interface unit 108 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and external devices.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a program storage area and a data storage area, and optionally, the program storage area may store an operating system, an application program (such as a sound playing function, an image playing function, and the like) required by at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 110 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the mobile terminal. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor and a modem processor, optionally, the application processor mainly handles operating systems, user interfaces, application programs, etc., and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The mobile terminal 100 may further include a power supply 111 (e.g., a battery) for supplying power to various components, and preferably, the power supply 111 may be logically connected to the processor 110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown in fig. 1, the mobile terminal 100 may further include a bluetooth module or the like, which is not described in detail herein.

In order to facilitate understanding of the embodiments of the present application, a communication network system on which the mobile terminal of the present application is based is described below.

Referring to fig. 2, fig. 2 is an architecture diagram of a communication Network system according to an embodiment of the present disclosure, where the communication Network system is an LTE system of a universal mobile telecommunications technology, and the LTE system includes a UE (User Equipment) 201, an E-UTRAN (Evolved UMTS Terrestrial Radio Access Network) 202, an EPC (Evolved Packet Core) 203, and an IP service 204 of an operator, which are in communication connection in sequence.

Specifically, the UE201 may be the terminal 100 described above, and is not described herein again.

The E-UTRAN202 includes eNodeB2021 and other eNodeBs 2022, among others. Alternatively, the eNodeB2021 may be connected with other enodebs 2022 through a backhaul (e.g., X2 interface), the eNodeB2021 is connected to the EPC203, and the eNodeB2021 may provide the UE201 access to the EPC 203.

The EPC203 may include an MME (Mobility Management Entity) 2031, an HSS (Home Subscriber Server) 2032, other MMEs 2033, an SGW (Serving gateway) 2034, a PGW (PDN gateway) 2035, and a PCRF (Policy and Charging Rules Function) 2036, and the like. Optionally, the MME2031 is a control node that handles signaling between the UE201 and the EPC203, providing bearer and connection management. HSS2032 is used to provide registers to manage functions such as home location register (not shown) and holds subscriber specific information about service characteristics, data rates, etc. All user data may be sent through SGW2034, PGW2035 may provide IP address assignment for UE201 and other functions, and PCRF2036 is a policy and charging control policy decision point for traffic data flow and IP bearer resources, which selects and provides available policy and charging control decisions for a policy and charging enforcement function (not shown).

The IP services 204 may include the internet, intranets, IMS (IP Multimedia Subsystem), or other IP services, among others.

Although the LTE system is described as an example, it should be understood by those skilled in the art that the present application is not limited to the LTE system, but may also be applied to other wireless communication systems, such as GSM, CDMA2000, WCDMA, TD-SCDMA, and future new network systems.

Based on the above mobile terminal hardware structure and communication network system, various embodiments of the present application are provided.

Please refer to fig. 3, which is a flowchart illustrating a speech processing method according to an embodiment of the present application. The main body of the method may be the aforementioned terminal device 100, and may include the following steps S11 and S12.

And S11, acquiring preset characteristic information of the user.

The preset feature information may be considered as information that can be used to indicate a user status, and the type of the preset feature information may be set by a user according to an actual use condition, or may be set by a default of the terminal device. The reference dimension set by the user and the terminal device by default includes but is not limited to at least one of work and rest information, situation information, emotional characteristics, character characteristics, gender and age.

The rest information may indicate time information when the user performs activities such as sleeping, exercising, three meals, working, and the like. The context information may indicate the current behavior of the user and the environmental conditions under which the current behavior occurs, such as indoors, outdoors, weather, running, walking, etc. The emotional characteristics indicate the user's current mood, such as excited, depressed, angry, etc. The personality traits indicate the personality of the user, including but not limited to active, steady, humorous, lovely, more genuine, and the like. The voice styles of people in different age groups have great difference, so that four stages of ages including babies, young children, middle-aged and old people can be set.

The manner of obtaining the preset feature information includes, but is not limited to, at least one of the following:

first, preset feature information is acquired according to a selection operation and/or an input operation. That is, the terminal device acquires the preset feature information according to a manual operation by the user.

In one embodiment, the terminal device may provide an operation interface, as shown in fig. 4, where the operation interface displays several dimensional categories, such as the context information, the character feature and the age shown in the figure.

For scenes with a large number of dimension types and in which the same frame cannot be completely displayed, the terminal device may display through multiple operation interfaces, each operation interface displays one dimension type and specific options thereof, please refer to fig. 5, where work and rest information, context information, emotional characteristics, and personality characteristics, and the specific options of the four dimension types are respectively displayed on one operation interface, and after the user completes the selection operation of each dimension type, the terminal device switches to the operation interface of the next dimension type and the specific options thereof.

The user clicks and selects specific options of part or all of the dimension types, and then the preset characteristic information of the user can be obtained. For example, in a specific scenario shown in fig. 5, if the user clicks and selects that the work and rest information is motion, the situation information is outdoors, the emotional characteristic is excitement, and the personality characteristic is active, the preset characteristic information recorded and acquired by the terminal device is: sports, outdoor, excited, active.

In another embodiment, referring to fig. 6, a user-defined input option may be displayed on an operation interface provided by the terminal device, after the user clicks the user-defined input option, the terminal device switches to another interface, and the interface displays an input box, so as to obtain the preset feature information according to the input operation of the user.

Preferably, text descriptions of various dimensional categories, such as "context information", "emotional characteristics", "character characteristics", "sex", "age", shown in fig. 6, may be displayed in the input box to prompt the user.

The input mode of the user is not limited in the embodiments of the present application, and may be, for example, a touch keypad input (including but not limited to pinyin input, stroke input), or a voice input.

The voice input can facilitate the operation of users such as visually impaired people, and further, before each dimension category is input, the terminal device can play the text description of the dimension category in a voice mode. For example, when switching to the input box interface shown in fig. 6, the terminal device plays a sound containing "please input the current context information", then collects a voice instruction responded by the user, and voice-recognizes the information therein.

Of course, the voice playing manner and the voice recognition manner may also be applicable to the foregoing selection operation, for example, when switching to the operation interface shown in fig. 4 or fig. 5, the terminal device plays a sound containing "please select the current context information", then voice plays the specific content included in the dimension category, then collects the voice instruction responded by the user, and voice recognizes the specific content selected by the user.

It should be understood that the terminal device may also obtain the preset feature information of the user by combining the aforementioned selection operation and input operation, and the specific implementation manner may be described with reference to fig. 4 to 6.

And secondly, acquiring preset characteristic information according to historical habits and/or sensors.

And obtaining the work and rest information of the user by adopting an AI technology according to the work and rest history data (history habits) of the user. For example, 24 hours a day, the most frequent periods of sleep, exercise, three meals, work, etc. are performed.

The sensor includes but is not limited to at least one of a temperature sensor, a gravity sensor, a camera, a 3D face recognition sensor, a microphone, and the like, and the terminal device may collect corresponding preset feature information of the user through one or more sensors. For example, the environmental condition of the user is identified through a gravity sensor and a camera, so as to obtain the situation information of the user; identifying the gender and age of the user through the camera; acquiring facial features of a user through a 3D facial recognition sensor, and determining emotional features; voice data of the user is collected through a microphone, and emotional characteristics, character characteristics, gender and age are identified according to the voice data.

The terminal equipment can acquire the preset characteristic information of the user by combining the historical habits and the sensor. When the recognition results of the two modes conflict with each other, the recognition result of one mode may be used as a reference, and for example, the recognition result of the sensor is preferably used as a reference. The conflict means that the recognition results for the same dimension type are completely different, for example, if the recognition results for the dimension type of "work and rest information" are conflicting when the user is in a sleep state in the current period obtained by learning from historical habits and the user is moving obtained by the sensor.

When the recognition results of the two modes are not in conflict, the recognition results of the two modes can be integrated, for example, the fact that the user eats dinner according to historical habit learning is obtained, the sensor recognition result is that the user eats indoors and is depressed in mood for a adolescent year, and finally the preset characteristic information of the user is obtained as follows: a young girl eats dinner indoors and is depressed in mood. Therefore, the current state of the user can be more accurately reflected.

And thirdly, acquiring voice data of the user and acquiring preset characteristic information according to the voice data.

The method for acquiring the voice data of the user by the terminal device includes, but is not limited to, any one of the following: collected by a sensor such as a microphone, received from other devices, and downloaded from the cloud.

For example, the terminal device performs a man-machine conversation with the user or collects the voice of the user when the user issues a voice command, acquires the voice data of the user according to the voice command, inputs the voice data into a voice style prediction model, performs prediction judgment on the current state of the user, and outputs the preset characteristic information of the user.

It should be understood that the terminal device may also provide only any two or all of the above three ways, and when the identification results of the same dimension category conflict, the identification result of one way may be used as the standard, for example, the identification result of the sensor is preferably used as the standard; when the recognition results of the two modes are not in conflict, the recognition results can be integrated, and the current state of the user can be more accurately reflected.

In addition, the three acquisition manners are only exemplary, and in other embodiments of the present application, different acquisition manners and/or acquisition manners with kinds and numbers larger than three may be provided.

The voice style is represented by voice elements, that is, different voice styles and voice elements (at least one of a tone, a tone and a timbre) are different. The terminal device can adjust at least one of the tone, the tone and the timbre to obtain the target voice style. For convenience of description and illustration, the embodiment of the present application may divide the voice style to include at least one of the following: steady voice style, active voice style, humorous voice style, lovely voice style, generous emotional voice style.

In a specific scenario, the terminal device may notify the user of the target voice style automatically obtained according to the preset feature information, please refer to fig. 7, the terminal device pops up a dialog box, and the dialog box displays the currently determined voice style, for example, a "steady voice style", so that the terminal of the visually impaired people can perform voice broadcast, and the user may perform voice reply to determine whether the currently determined voice style is the target voice style, or may touch and click a "yes" option to determine the target voice style, or click a "no" option without determining the target voice style, and determine the voice style again according to the preset feature information.

The preset feature information represents the current state of the user, and each user state may correspond to a voice style. Under a certain user state, the terminal equipment can switch and execute a corresponding voice style according to the preset corresponding relation, the voice style is rich, and the voice playing quality is favorably improved.

For example, for a scene in which the emotional characteristics of the user are learned through the face characteristics, the terminal device can automatically switch between the personalized tone and the personalized tone to read aloud. When the user is happy, the book can automatically switch the reading such as relaxed and happy tone and timbre, emotional reading travel and the like; when the user is depressed, the user can automatically switch the easy and relaxed tone and timbre, and the emotional reading is easy and funny.

For another example, when the user outdoor running is recognized, the terminal device can automatically switch generous tones and timbres, and play the vocal books which are interesting to the user; when the user is identified to walk at sea, the sound tones and timbres of the user can be automatically switched to play the audio books interested by the user.

For another example, when recognizing that the user falls asleep, the terminal device may automatically inquire whether the user needs to play the audio book, and if receiving a response instruction indicating the need fed back by the user, select a relaxed tone and timbre to play the sleep-assisting audio book in a sensible manner, which the user is interested in. Preferably, the terminal device can obtain the daily eye use time of the user according to the recorded work and rest information, and automatically inquire whether the user needs to play the audio book when the daily eye use time exceeds the healthy eye use time so as to protect eyesight.

Referring to fig. 8, for a voice playing scenario, based on the method described in fig. 3, the voice processing method of the present application may further include steps S13 and S14.

And S13, recognizing and extracting the text content of the resource to be played.

And S14, performing voice synthesis on the text content to generate a voice document with a target voice style.

The resource to be played comprises at least one of the following: books, pictures, web pages, websites. In summary, the resources to be played include two types, character documents and picture documents. For the picture document, characters in the picture document can be recognized and converted into the character document according to the display sequence in the picture.

The text content of the resource to be played can be a character document, a voice document or a combination of the two.

The text content of the resource to be played is converted into a voice document (called as an initial voice document) by adopting a voice processing technology, and then the voice document with a target voice style (called as an output voice document) is generated by adopting a voice synthesis technology. Therefore, the voice playing style is rich, and the playing quality is high.

For example, the visually impaired people read books, and the visually impaired people read massive books in their own states, the abundant voice style is favorable for the visually impaired people to know the world more emotionally and comprehensively, and the self-feeling is improved. For example, for technical professional type books, the target voice style may be a steady voice style; for entertainment type books, the target voice style can be an active voice style; for the reading materials like the smile, the target voice style can be a humorous voice style; for a children's reading, the target voice style may be an lovely voice style.

Fig. 9 is a flowchart illustrating a speech processing method according to a third embodiment of the present application. Referring to fig. 9, the speech processing method of the present embodiment may include the following steps S11 to S14.

And S11, acquiring preset characteristic information of the user.

And S121, selecting the adaptive document to be played according to the target voice style.

And S13, recognizing and extracting the text content in the document to be played.

The foregoing embodiments described in fig. 3 and fig. 8 may not need to consider the adaptation relationship between the document to be played and the current state of the user (i.e. the preset feature information), but the embodiment selects the adapted document to be played according to the current state of the user after determining the target voice style, and since the current state of the user may identify the target voice style, it may be considered that the adapted document to be played is selected according to the target voice style, and then the document to be played is converted into the voice document with the target voice style by using a voice synthesis technology.

The implementation manner of the step S13 includes at least one of the following two manners:

in one method, the preset feature information of the document is obtained according to the preset classification label of the document, and the document matched with the target voice style obtained in the step S12 is determined as the document to be played.

In a specific scenario of a conventional electronic publication, the electronic publication generally classifies all documents thereof, and sets classification tags, such as emotion, entertainment, fun, and horror. Accordingly, the matched preset feature information can be determined according to the classification label.

For the document to be played without the classification tag, the terminal device may automatically determine the preset feature information according to the text content of the document, and then determine the adapted voice style, that is, another manner described below.

And determining preset characteristic information according to the text content of the document, determining an adaptive voice style according to the preset characteristic information, and determining the document with the same target voice style as the document to be played.

Techniques for classifying text content and generating classification tags may be referred to as display techniques. The method can be regarded as that the terminal equipment needs to automatically generate the classification label.

Fig. 10 is a flowchart illustrating a speech processing method according to a fourth embodiment of the present application. Referring to fig. 10, the speech processing method of the present embodiment may include the following steps S11, S111, S112, and S12.

And S11, acquiring preset characteristic information of the user.

And S111, acquiring preset characteristic information of the document to be played.

And S112, judging whether the preset characteristic information of the document to be played conflicts with the preset characteristic information of the user.

If the preset feature information of the document to be played does not conflict with the preset feature information of the user and is considered to be matched, the step S12 is executed. If so, step S113 is executed.

And S113, executing a preset strategy.

The embodiment described in the foregoing fig. 9 selects the adapted document to be translated according to the target voice style, and unlike this embodiment, the adapted document to be translated is selected according to the preset feature information.

In step S111, the obtaining of the preset feature information of the document to be played includes at least one of the following:

in one method, the preset feature information of the document is obtained according to the preset classification label of the document, and the document which is the same as or similar to the preset feature information obtained in the step S11 is determined as the document to be played.

For the document to be played without the classification tag, the terminal device may automatically determine the preset feature information of the document according to the text content of the document, that is, in another manner as described below.

And in the other mode, the preset characteristic information of the document is determined according to the text content of the document.

A technique of classifying according to preset feature information and generating a classification tag may refer to a display technique. The method can be regarded as that the terminal equipment needs to automatically generate the classification label.

In the step S113, the preset policy includes at least one of the following four:

first, the step of S12 is performed.

And secondly, determining a target voice style according to the selection instruction.

When the preset feature information of the document to be played conflicts with the preset feature information of the user, the voice style of the document to be played is not considered to be matched with the current state of the user, the user can issue a selection instruction for instructing voice playing according to the voice style of the document to be played, or the document to be played is voice played according to the voice style (namely, the target voice style) matched with the current state of the user, or the steps of S111 and S112 are re-executed, namely, the document matched with the target voice style is re-selected until the preset feature information of the selected document is the same as or similar to the preset feature information of the user.

And thirdly, determining the target voice style according to the preset characteristic information of the document to be played.

That is, the voice style of the document to be played itself is taken as the target voice style in voice playing.

And fourthly, taking the default voice style as the target voice style during voice playing.

The default voice style can be regarded as a voice style which is default in advance by the terminal device, and has no relation with the voice style of the document to be played and the target voice style determined according to the current state of the user.

In one implementation, the default speech style may be a speech style in which the conventional speech elements are not changed continuously, or a speech style that is preset but not associated with the user state.

Fig. 11 is a flowchart illustrating a speech processing method according to a fifth embodiment of the present application. Referring to fig. 11, the speech processing method of the present embodiment may include the following steps S21 and S22.

And S21, acquiring preset characteristic information of the document to be played.

The preset feature information may be considered information capable of indicating a document type that can be used to identify an applicable user state. For example, the preset feature information of the document to be played includes, but is not limited to, at least one of work and rest information, context information, emotional features, character features, gender and age.

The specific contents of the preset feature information of the document to be played and the manner of obtaining the preset feature information may refer to the description of the foregoing embodiment, and are not described herein again.

Please refer to the foregoing embodiments for the principle and description of determining a target speech style according to preset feature information.

The difference from the previous embodiment is that the present embodiment determines the target voice style according to the preset feature information of the document to be played, rather than the preset feature information of the user. The user can select one or some documents as the documents to be played first, and preset characteristic information is obtained from the text content of the documents to be played, namely the terminal device determines the target voice style according to the text content of the documents to be played.

Fig. 12 is a flowchart illustrating a speech processing method according to a sixth embodiment of the present application. Referring to fig. 12, the speech processing method of the present embodiment may include the following steps S21 to S23.

S211, acquiring preset characteristic information of a user;

s212, judging whether the preset characteristic information of the document to be played conflicts with the preset characteristic information of the user.

If the preset feature information of the document to be played does not conflict with the preset feature information of the user and is considered to be matched, the step S22 is executed. If so, step S23 is executed.

And S23, executing a preset strategy.

The preset characteristic information of the user is used for identifying the current state of the user, and the preset characteristic information of the document to be played is used for identifying the voice style applicable to the document to be played.

In the step S23, the preset strategy includes at least one of the following four strategies:

first, the step of S22 is performed.

When the preset feature information of the document to be played conflicts with the preset feature information of the user, the voice style of the document to be played is not considered to be matched with the current state of the user, and the user can issue a selection instruction for instructing voice playing according to the voice style of the document to be played, or voice playing is performed according to the voice style (namely, the target voice style) matched with the current state of the user on the document to be played, or the step of S21 is executed again, namely, the document matched with the current state of the user is selected again until the preset feature information of the selected document is the same as or similar to the preset feature information of the user.

And thirdly, determining a target voice style according to preset characteristic information of the user.

That is, the voice style of the user is taken as the target voice style at the time of voice playback.

The default voice style can be regarded as a voice style which is default in advance by the terminal device, and has no relation with the voice style of the document to be played and the voice style determined according to the current state of the user.

The present application further provides a mobile terminal device, which includes a memory, a processor, and a speech processing program stored in the memory and capable of running on the processor, wherein the interactive program implements the steps of the method in any of the above embodiments when executed by the processor.

The present application further provides a computer-readable storage medium, on which a voice processing program is stored, and the voice processing program, when executed by a processor, implements the steps of the voice processing method in any of the above embodiments.

In the embodiments of the mobile terminal device and the computer-readable storage medium provided in the present application, all technical features of the embodiments of the method are included, and the expanding and explaining contents of the specification are basically the same as those of the embodiments of the method, and are not described herein again.

Embodiments of the present application also provide a computer program product, which includes computer program code, when the computer program code runs on a computer, causes the computer to execute the method as described in the above various possible embodiments.

Embodiments of the present application further provide a chip, which includes a memory for storing a computer program and a processor for calling and executing the computer program from the memory, so that a device in which the chip is installed performs the method in the above various possible embodiments.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the present application, the same or similar term concepts, technical solutions and/or application scenario descriptions will be generally described only in detail at the first occurrence, and when the description is repeated later, the detailed description will not be repeated in general for brevity, and when understanding the technical solutions and the like of the present application, reference may be made to the related detailed description before the description for the same or similar term concepts, technical solutions and/or application scenario descriptions and the like which are not described in detail later.

In the present application, each embodiment is described with emphasis, and reference may be made to the description of other embodiments for parts that are not described or illustrated in any embodiment.

The technical features of the technical solution of the present application may be arbitrarily combined, and for brevity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present application should be considered as being described in the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a controlled terminal, or a network device) to execute the method of each embodiment of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A speech processing method, characterized in that the speech processing method comprises:

s11, acquiring preset characteristic information of a user;

2. The method of claim 1, comprising at least one of:

the preset feature information includes at least one of: work and rest information, situation information, emotional characteristics, character characteristics, gender and age;

the step of S11 includes at least one of: the method comprises the steps of obtaining preset characteristic information according to selection operation and/or input operation, obtaining the preset characteristic information according to historical habits and/or sensors, obtaining voice data of a user, and obtaining the preset characteristic information according to the voice data.

3. The method of claim 1, comprising at least one of:

before the step of S12, the method further includes: selecting a matched document to be played according to the target voice style, identifying and extracting text content in the document to be played, and/or performing voice synthesis on the text content to generate a voice document with the target voice style;

the method further comprises the following steps: identifying and extracting text content of the resource to be played, wherein after the step of S12, the method includes: and carrying out voice synthesis on the text content to generate a voice document with the target voice style.

4. The method according to any one of claims 1 to 3, wherein the step of S12 is preceded by the steps of: acquiring preset characteristic information of a document to be played;

if not, executing the step S12; and/or if so, executing a preset strategy.

5. The method of claim 4, comprising at least one of:

the acquiring of the preset feature information of the document to be played includes: acquiring preset characteristic information of the document to be played according to a preset classification label of the document to be played, and/or determining the preset characteristic information of the document to be played according to text content of the document to be played;

the preset strategy comprises at least one of the following: and executing the step S12, determining a target voice style according to a selection instruction, determining the target voice style according to the preset characteristic information of the document to be played, and taking the default voice style as the target voice style during voice playing.

6. A speech processing method, characterized in that the speech processing method comprises:

s21, acquiring preset characteristic information of the document to be played;

7. The method of claim 6, comprising at least one of:

the step of S21 includes: acquiring preset characteristic information of the document to be played according to a preset classification label of the document to be played, and/or determining the preset characteristic information of the document to be played according to text content of the document to be played;

before the step of S22, the method further includes: acquiring preset characteristic information of a user, judging whether the document to be played conflicts with the preset characteristic information of the user, if not, executing the step S22, and/or if so, executing a preset strategy.

8. The speech processing method of claim 7, wherein the predetermined policy comprises at least one of:

executing the step of S22;

determining a target voice style according to the selection instruction;

9. A terminal device, characterized in that the terminal device comprises a memory and a processor, the memory storing a speech processing program for implementing the steps of the speech processing method according to any one of claims 1 to 8 when executed by the processor.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program for implementing the steps of the speech processing method according to any of claims 1 to 8 when being executed by a processor.