CN117891918B

CN117891918B - Interactive data management system of Chinese text vectorization model based on AI PaaS platform

Info

Publication number: CN117891918B
Application number: CN202410070601.3A
Authority: CN
Inventors: 赵隽隽; 潘斌; 赵剑飞; 欧阳禄萍; 张怀仁; 范喆一
Original assignee: Zhixueyun Beijing Technology Co ltd
Current assignee: Zhixueyun Beijing Technology Co ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-09-03
Anticipated expiration: 2044-01-17
Also published as: CN117891918A

Abstract

The invention relates to the technical field of intelligent office, in particular to an interactive data management system of a Chinese text vectorization model based on an AI PaaS platform. The system comprises an AI PaaS platform module, a text vectorization module and an interactive data management module; the AI PaaS platform module is used for debugging the AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating an interactive data management function in the AI PaaS platform; the text vectorization module is used for constructing a vocabulary, and vectorizing and storing text information to be stored according to the vocabulary; the interactive data management module is used for constructing an interactive data management function, and a user searches and deletes the text data stored by the interactive data management module.

Description

Interactive data management system of Chinese text vectorization model based on AI PaaS platform

Technical Field

The invention relates to the technical field of intelligent office, in particular to an interactive data management system of a Chinese text vectorization model based on an AI PaaS platform.

Background

AI PaaS (artificial intelligence platform as a service) is a cloud computing service model that provides a series of tools and services for building, training and deploying artificial intelligence models. The AI PaaS platform aims to simplify the development process of artificial intelligence applications, enabling developers to more easily utilize advanced machine learning and deep learning techniques.

Text vectorization is the process of converting text data into numeric vectors so that a computer can better understand and process text information. The goal of the text vectorization model is to map semantic information in text into a high-dimensional vector space for machine learning and natural language processing tasks.

In the prior art, no technology is provided for constructing a text vectorization model technology in an AI PaaS platform and constructing an interactive data management system.

In view of this, the invention proposes an interactive data management system based on the chinese text vectorization model of the AI PaaS platform.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.

The present invention has been made in view of the above-described problems.

Therefore, the technical problems solved by the invention are as follows: based on an AI PaaS platform, a text vectorization model technology is constructed, and the vectorized text data is managed.

In order to solve the technical problems, the invention provides the following technical scheme:

The interactive data management system based on the Chinese text vectorization model of the AI PaaS platform comprises an AI PaaS platform module, a text vectorization module and an interactive data management module; the AI PaaS platform module comprises a debugging unit, a function integration unit, a data transmission unit and a data storage unit; the text vectorization module comprises a vocabulary unit, a word segmentation unit and a vectorization unit; the interactive data management module comprises a data searching unit and a data adjusting unit.

Preferably, the debugging unit is used for creating a visual interface, distributing the acquired performance of the AI PaaS platform, and displaying the performance index of the AI PaaS platform in real time through the visual interface;

The function integration unit is used for integrating a required function model on the AI PaaS platform; importing the created interactive data management system into an AI PaaS platform through a function integration unit;

The data transmission unit is used for receiving a request instruction sent by the user side to the AI PaaS platform and transmitting feedback data sent by the AI PaaS platform to the user side;

The data storage unit is used for storing vectorized text data transmitted to the AI PaaS platform by a user, a transmitted request instruction and log data generated during the operation of the associated function of the AI PaaS platform.

Preferably, the vocabulary unit is used for constructing a vector table of Chinese vocabulary; generating a unique corresponding vector value from the Chinese vocabulary to be stored, and integrating all vocabulary vector values into a vocabulary; vocabulary V is expressed as: v= { W ₁,W₂,...,W_n };

The word segmentation unit is used for segmenting the input text; the word segmentation is to divide an input text into characters and words, convert and store the words into data form according to vector values of a vocabulary, and if the input text is a character, the word segmentation is expressed as follows: t _fi＝[C₁,C₂,...,C_n ]; if the input text is vocabulary, the text is expressed as: t _ci＝[Z₁,Z₂,...,Z_n ]; vector space dimensions are added to the vector values for each word and vocabulary to represent the position of the characters and vocabulary in the text.

Preferably, the vectorization unit is configured to vectorize and store a text, and generate an initial identification vector S when a new text is input; when the text input is completed, an end identification vector E is generated; the stored vector data for a single text is represented as: x _i＝[S,T_fi,T_ci, E ];

the initial identification vector S stores name information of the vectorized text, storage position information of the vectorized text and time information for beginning to store the vectorized text; and the end identification vector E stores the time information for ending storage of the vectorized text, the space size occupation information of the vectorized text and the number statistical information of characters and words of the vectorized text.

Preferably, the data searching unit adopts a searching algorithm for searching the stored vectorized text; the search algorithm comprises a vocabulary search algorithm and a sentence search algorithm;

The steps of the constructed vocabulary searching algorithm are as follows:

For vectorized text data, calculating word frequency of each word in each text, and constructing a word-text matrix: matrix (X _i,t_j), where t _j represents the frequency of occurrence of the vocabulary in the vectorized text X _i;

For each vocabulary t _j, a text list in (t _j) is created containing the vocabulary t _i, expressed as: in (t _j)＝{X_i … }, expressed as the vocabulary t _i appears in the text list;

For each text X _i, the text weights are calculated using the TF-IDF method:

TF-IDF(t_j,d_i)＝TF(t_j,X_i)×IDF(t_j)

Wherein TF (t _j,X_i) represents the word frequency of the word t _j in the text X _i and IDF (t _j) represents the inverse text frequency; the calculation formula of the inverse text IDF (t _j) is:

Wherein N is the total number of texts, N is the number of texts comprising the vocabulary t _j;

And finding a text list containing the vocabulary t _j through the inverted index for the vocabulary t _j of the query, and sequencing and outputting the text list according to the text weight.

Preferably, the constructed sentence searching algorithm comprises the following steps:

designing a plurality of hash functions, and mapping the text vector into different hash buckets through the plurality of hash functions;

For an input query sentence, vectorizing the input sentence according to vocabulary data to generate a query vector;

For the query vector, mapping the query vector into a corresponding hash bucket using the constructed hash function;

Searching similar text vectors in a hash bucket mapped with the hash value of the query vector;

calculating the distance value between each similar text vector and the query vector;

and selecting the text vector with the nearest distance value for output.

Preferably, the data adjustment unit comprises adjusting the stored text content within the authority range of the user based on the identity and the authority of the user;

the adjustment comprises the steps of modifying and deleting the stored text content, and generating new vectorized text data according to the new text content after the user modifies and deletes the text content;

when a user modifies and deletes the text data stored by the user, a log file for modifying and deleting the text data is generated, and the log file is stored in a data storage unit of the AI PaaS platform.

A method of an interactive data management system based on a chinese text vectorization model of an AI PaaS platform, the method comprising,

Debugging an AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating an interactive data management function in the AI PaaS platform;

constructing a vocabulary, and vectorizing and storing the text information to be stored according to the vocabulary;

and constructing an interactive data management function, and searching and deleting the text data stored by the user.

A computer device comprising a memory and a processor, said memory storing a computer program, said processor implementing the steps of the AIPaaS-station based elastiscearch text vectorization search method when said computer program is executed.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the AIPaaS platform based elastsearch text vectorization search method.

The invention has the beneficial effects that: the invention is based on the AI PaaS platform, takes the AI PaaS platform as a base, develops an interactive data management system of the text vectorization model, and can simultaneously enable multiple users to manage text data.

Based on cloud service characteristics of the AI PaaS platform, the method greatly reduces the deployment cost and the operation cost of the system, optimizes the interactive function of text data management through a self-built management algorithm, improves the convenience of a user in managing the text data stored by the user, and improves the privacy and the safety of the user in storing and accessing the text data through the function setting of corresponding user authority and data safety.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a block diagram of an interactive data management system of a Chinese text vectorization model based on an AI PaaS platform;

FIG. 2 is a flow chart of an interactive data management method based on a text vectorization model of an AI PaaS platform;

FIG. 3 is a schematic diagram of an electronic device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computer-readable storage medium according to one embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1, a first embodiment of the present invention provides an interactive data management system based on a chinese text vectorization model of an AI PaaS platform.

Specifically, the system comprises an AI PaaS platform module, a text vectorization module and an interactive data management module.

The AI PaaS platform module comprises a debugging unit, a function integration unit, a data transmission unit and a data storage unit; the AI PaaS platform module is used for debugging the AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating the interactive data management function in the AI PaaS platform.

The text vectorization module comprises a vocabulary unit, a word segmentation unit and a vectorization unit; the text vectorization module is used for constructing a vocabulary, and vectorizing and storing the text information to be stored according to the vocabulary.

The interactive data management module comprises a data searching unit and a data adjusting unit; the interactive data management module is used for constructing an interactive data management function, and a user searches and prunes the text data stored by the user.

The debugging unit is used for creating a visual interface, distributing the acquired performance of the AI PaaS platform and displaying the performance index of the AI PaaS platform in real time through the visual interface.

The function integration unit is used for integrating a required function model on the AI PaaS platform; the created interactive data management system is imported into the AI PaaS platform through the function integration unit, and the function integration unit can be used for integrating other functions and expanding the functions of the constructed AI PaaS platform module.

Through the function integration unit, the trained model and the developed system can be deployed in the AI PaaS platform.

The data transmission unit is used for receiving a request instruction sent by the user side to the AI PaaS platform and transmitting feedback data sent by the AI PaaS platform to the user side.

The transmission unit encodes and decodes the data to be transmitted, supports different transmission protocols, selects a corresponding transmission protocol based on actual communication requirements, and encrypts and transmits the transmitted data.

The data storage unit is a database, and the performance of the database is configured according to actual requirements.

The vocabulary unit is used for constructing a vector table of Chinese vocabulary; generating a unique corresponding vector value from the Chinese vocabulary to be stored, and integrating all vocabulary vector values into a vocabulary; vocabulary V is expressed as: v= { W ₁,W₂,...,W_n }.

The vocabulary list is pre-generated according to the text data range which needs to be stored, when the vocabulary outside the range appears, the vocabulary outside the range is immediately vector-generated, and the vocabulary is recorded in the vocabulary list.

The vectorization unit is used for vectorizing and storing texts, and generating an initial identification vector S when a new text is input; when the text input is completed, an end identification vector E is generated; the stored vector data for a single text is represented as: x _i＝[S,T_fi,T_ci, E ].

The data searching unit adopts a searching algorithm and is used for searching the stored vectorized text; the search algorithm comprises a vocabulary search algorithm and a sentence search algorithm.

The steps of the constructed vocabulary searching algorithm are as follows:

For each vocabulary t _j, a text list in (t _j) is created containing the vocabulary t _i, expressed as: in (t _j)＝{X_i … }, expressed as the vocabulary t _j appears in the text list;

For each text X _i, the text weights are calculated using the TF-IDF method:

TF-IDF(t_j,d_i)＝TF(t_j,X_i)×IDF(t_j)

Wherein TF (t _j,X_i) represents the word frequency of the word t _j in the text X _i and IDF (t _j) represents the inverse text frequency;

the calculation formula of the inverse text IDF (t _j) is:

The constructed sentence searching algorithm comprises the following steps:

and selecting the text vector with the nearest distance value for output.

The data adjustment unit comprises the step of adjusting the stored text content within the authority range of the user based on the identity and the authority of the user.

The adjustment includes modifying and deleting the stored text content and generating new vectorized text data from the new text content after the user modifies and deletes the text content.

Example 2

The second embodiment of the invention provides an interactive data management method based on a text vectorization model of an AI PaaS platform.

S1: and debugging the AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating the interactive data management function in the AI PaaS platform.

S101: creating a visual interface, distributing the acquired performance of the AI PaaS platform, and displaying the performance index of the AI PaaS platform in real time through the visual interface.

S102: integrating a required functional model on an AI PaaS platform; and importing the created interactive data management system into the AI PaaS platform through the function integration unit.

S103: and receiving a request instruction sent by the user terminal to the AI PaaS platform and transmitting feedback data sent by the AI PaaS platform to the user terminal.

S104: and storing the vectorized text data transmitted to the AI PaaS platform by the user, the transmitted request instruction and log data generated during the working of the associated function of the AI PaaS platform.

S2: and constructing a vocabulary, and vectorizing and storing the text information to be stored according to the vocabulary.

S201: constructing a vector table of Chinese vocabulary; and generating a unique corresponding vector value from the Chinese vocabulary to be stored, and integrating all the vocabulary vector values into a vocabulary list.

S202: the input text is divided into characters and words, and the words are converted and stored into a data form according to the vector values of the vocabulary.

S203: vectorizing and storing the text, and generating an initial identification vector S when a new text is input; when the text input is completed, an end identification vector E is generated.

S3: and constructing an interactive data management function, and searching and deleting the text data stored by the user.

S301: a search algorithm is employed for searching the stored text content.

S302: and modifying and deleting the stored text content, and generating new vectorized text data according to the new text content after the user modifies and deletes the text content.

Example 3

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 3, an electronic device 500 is also provided in accordance with yet another aspect of the present application. The electronic device 500 may include one or more processors and one or more memories. Wherein the memory has stored therein computer readable code which, when executed by the one or more processors, can perform the multi-source heterogeneous data driven intelligent manufacturing decision method as described above.

The method or system according to embodiments of the application may also be implemented by means of the architecture of the electronic device shown in fig. 3. As shown in fig. 3, the electronic device 500 may include a bus 501, one or more CPUs 502, a Read Only Memory (ROM) 503, a Random Access Memory (RAM) 504, a communication port 505 connected to a network, an input/output component 506, a hard disk 507, and the like. A storage device in electronic device 500, such as ROM503 or hard disk 507, may store the multi-source heterogeneous data driven intelligent manufacturing decision method provided by the present application. The intelligent manufacturing decision method driven by the multi-source heterogeneous data comprises the following steps: debugging an AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating an interactive data management function in the AI PaaS platform; constructing a vocabulary, and vectorizing and storing the text information to be stored according to the vocabulary;

And constructing an interactive data management function, and searching and deleting the text data stored by the user. Further, the electronic device 500 may also include a user interface 508. Of course, the architecture shown in fig. 3 is merely exemplary, and one or more components of the electronic device shown in fig. 3 may be omitted as may be practical in implementing different devices.

Example 4

FIG. 4 is a schematic diagram of a computer-readable storage medium according to one embodiment of the present application. As shown in fig. 4, is a computer-readable storage medium 600 according to one embodiment of the application. Computer readable storage medium 600 has stored thereon computer readable instructions. When the computer readable instructions are executed by the processor, the multi-source heterogeneous data driven intelligent manufacturing decision method according to the embodiments of the present application described with reference to the above figures may be performed. Storage medium 600 includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. Volatile memory can include, for example, random Access Memory (RAM), cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.

In addition, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, the present application provides a non-transitory machine-readable storage medium storing machine-readable instructions executable by a processor to perform instructions corresponding to the method steps provided by the present application, such as: debugging an AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating an interactive data management function in the AI PaaS platform; constructing a vocabulary, and vectorizing and storing the text information to be stored according to the vocabulary; and constructing an interactive data management function, and searching and deleting the text data stored by the user.

The methods and apparatus, devices of the present application may be implemented in numerous ways. For example, the methods and apparatus, devices of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present application are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present application may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present application. Thus, the present application also covers a recording medium storing a program for executing the method according to the present application.

In addition, in the foregoing technical solutions provided in the embodiments of the present application, parts consistent with implementation principles of corresponding technical solutions in the prior art are not described in detail, so that redundant descriptions are avoided.

The purpose, technical scheme and beneficial effects of the invention are further described in detail in the detailed description. It is to be understood that the above description is only of specific embodiments of the present invention and is not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The interactive data management system of the Chinese text vectorization model based on the AI PaaS platform is characterized by comprising an AI PaaS platform module, a text vectorization module and an interactive data management module;

The AI PaaS platform module comprises a debugging unit, a function integration unit, a data transmission unit and a data storage unit; the AI PaaS platform module is used for debugging the AI PaaS platform, dividing the performance of the AI PaaS platform, displaying the performance index of the AI PaaS platform, and integrating an interactive data management function in the AI PaaS platform;

The text vectorization module comprises a vocabulary unit, a word segmentation unit and a vectorization unit; the text vectorization module is used for constructing a vocabulary, dividing input text words and vectorizing and storing text information to be stored according to the vocabulary;

The interactive data management module comprises a data searching unit and a data adjusting unit; the interactive data management module is used for constructing an interactive data management function, and a user searches and prunes the text data stored by the user;

the debugging unit is used for creating a visual interface, distributing the acquired performance of the AI PaaS platform and displaying the performance index of the AI PaaS platform in real time through the visual interface;

the data storage unit is used for storing vectorized text data transmitted to the AI PaaS platform by a user, a transmitted request instruction and log data generated during the operation of the associated function of the AI PaaS platform;

the vocabulary unit is used for constructing a vector table of Chinese vocabulary; generating a unique corresponding vector value from the Chinese vocabulary to be stored, and integrating all vocabulary vector values into a vocabulary; vocabulary V is expressed as: ；

the word segmentation unit is used for segmenting the input text; the word segmentation is to divide an input text into characters and words, convert and store the words into data form according to vector values of a vocabulary, and if the input text is a character, the word segmentation is expressed as follows: ; if the input text is vocabulary, the text is expressed as: ; increasing vector space dimension for each word and vocabulary vector value to represent the position of the character and vocabulary in the text;

The vectorization unit is used for vectorizing and storing texts, and generating an initial identification vector S when a new text is input; when the text input is completed, an end identification vector E is generated; the stored vector data for a single text is represented as: ；

The initial identification vector S stores name information of the vectorized text, storage position information of the vectorized text and time information for beginning to store the vectorized text; the end identification vector E stores time information for ending storage of the vectorized text, space size occupation information of the vectorized text and quantity statistical information of characters and words of the vectorized text;

the data searching unit adopts a searching algorithm and is used for searching the stored vectorized text; the search algorithm comprises a vocabulary search algorithm and a sentence search algorithm;

The steps of the constructed vocabulary searching algorithm are as follows:

for vectorized text data, calculating word frequency of each word in each text, and constructing a word-text matrix: Wherein Representing vocabulary in vectorized textIs a frequency of occurrence in the first and second embodiments;

For each vocabulary Creating a containing vocabularyText list of (c)Expressed as: expressed as words Appear in the text list;

For each text Text weights were calculated using TF-IDF:

；

wherein, Representation vocabularyIn textIs used for the word frequency of the word,Representing the inverse text frequency;

Reverse text The calculation formula of (2) is as follows:

；

Wherein N is the total text number, N is the word containing Is a text number of (a);

Vocabulary of queries Finding the containing vocabulary through the inverted indexSequencing according to the text weight and outputting;

The constructed sentence searching algorithm comprises the following steps:

and selecting the text vector with the nearest distance value for output.

2. The interactive data management system based on the chinese text vectorization model of the AI PaaS platform as claimed in claim 1, wherein the data adjustment unit comprises adjusting the stored text content within the self authority range based on the identity and authority of the user;

3. A method of an interactive data management system based on a Chinese text vectorization model of an AI PaaS platform as claimed in any of claims 1 and 2, wherein the method comprises,

4. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of claim 3 when executing the computer program.

5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method as claimed in claim 3.