CN118069812A

CN118069812A - Navigation method based on large model

Info

Publication number: CN118069812A
Application number: CN202410315557.8A
Authority: CN
Inventors: 杨利; 金海武; 郑熳
Original assignee: Hangzhou Yuanshu Technology Co ltd
Current assignee: Hangzhou Yuanshu Technology Co ltd
Priority date: 2024-03-19
Filing date: 2024-03-19
Publication date: 2024-05-24
Anticipated expiration: 2044-03-19
Also published as: CN118069812B

Abstract

The application discloses a navigation method based on a large model, which relates to the technical field of information and comprises the following steps: storing vector data by utilizing Milvus vector database to construct scenic spot knowledge base; acquiring voice data of a user and converting the voice data into text data; acquiring the first K knowledge base contents with highest similarity with text data; constructing question prompt data containing knowledge base information; inputting the constructed problem prompt data into a pre-trained large model for reasoning, extracting semantic information, and judging the type of the problem; selecting the problem type from preset external API interfaces, and acquiring external information of the API interface; fusing the acquired external information and the type of the question, and generating a final answer text by adopting a natural language generation algorithm based on a preset answer generation template and fusion rules. Aiming at the problem of low answer precision of scenic spot navigation service in the prior art, the application improves the precision of intelligent navigation answer content.

Description

Navigation method based on large model

Technical Field

The application relates to the technical field of information, in particular to a navigation method based on a large model.

Background

In recent years, the artificial intelligence technology is widely applied in the intelligent tourism field, and brings new development opportunities for scenic spot tour guide services. Some research institutions and enterprises begin to explore and apply knowledge graph, natural language processing, voice recognition and other technologies to scenic spot navigation, and research and develop novel navigation products such as intelligent tour guide robots, intelligent voice explanation systems and the like. The products realize semantic understanding and question-answering by constructing a scenic spot knowledge base and utilizing a natural language processing technology, so that the information quantity and the interactivity of the navigation service are improved to a certain extent.

However, some problems to be solved still exist in the existing intelligent scenic spot navigation technology. Firstly, the construction of a scenic spot knowledge base often depends on manual collection and arrangement, knowledge sources are limited, coverage is insufficient, and diversified information requirements of tourists are difficult to meet. Secondly, most of the existing knowledge representation and retrieval methods are based on keyword matching, semantic information of tourist problems cannot be deeply understood, and question answering accuracy and suitability are low. Furthermore, scenic spot navigation involves multiple heterogeneous data such as text, image, video, audio, etc., and the existing method lacks an effective multi-modal knowledge fusion mechanism, and fails to fully utilize complementary information of multi-source data. In addition, due to the lack of introduction and fusion of external knowledge, the existing navigation system is difficult to provide dynamic information and expansion services related to scenic spots, has single navigation content and limited functions.

In the related art, for example, chinese patent document CN117520524a provides an intelligent question-answering method and system for industry, by obtaining a question text input by a user; converting a question text input by a user into a query vector based on a pre-constructed encoder; matching the query vector with a pre-constructed industry knowledge base; when target resources with matching degree exceeding a threshold value exist in the industry knowledge base, returning the target resources to the user; when no target resource with the matching degree exceeding a threshold exists in the industry knowledge base, executing question and answer based on the query vector and a pre-adjusted industry large model to obtain intention information; and returning the resources corresponding to the intention information to the user. The application establishes an industry knowledge base, converts the question text of the user into a query vector in the process of executing the question and answer, firstly directly queries in the industry knowledge base, and executes the question and answer action through the adjusted industry large model when no matching resource exists in the industry knowledge base, thereby obtaining the target resource which accords with the user intention. However, the scheme constructs a universal industry knowledge base, and although common knowledge in the industry field is included, the coverage of specific knowledge required for scenic spot navigation is insufficient, so that the response accuracy of scenic spot navigation service needs to be further improved.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low answer precision of scenic spot navigation service in the prior art, the application provides a navigation method based on a large model, which improves the precision of intelligent navigation answer content based on the large model and the like through fusion expression of multi-source heterogeneous knowledge.

2. Technical proposal

The aim of the application is achieved by the following technical scheme.

The embodiment of the specification provides a large model-based navigation method, which comprises the following steps: collecting knowledge content of scenic spots or exhibition halls, converting the collected knowledge content into structured vector data, storing the vector data by utilizing Milvus vector database, and constructing a knowledge base of the scenic spots, wherein the knowledge content contains information of the scenic spots or exhibition halls; acquiring voice data of a user, converting the acquired voice data into text data, and preprocessing the text data; matching the preprocessed text data with the constructed scenic spot knowledge base through an embedded text search algorithm to obtain the first K knowledge base contents with the highest similarity with the text data; splicing the acquired first K knowledge base contents with the input text data to construct question prompting data containing knowledge base information; inputting the constructed problem prompt data into a pre-trained large model for reasoning, extracting semantic information, and judging the type of the problem; selecting an API interface corresponding to the problem type from preset external API interfaces, and acquiring external information of the API interface, wherein the external information comprises: scenic spot real-time weather, round trip ticket information, hotel information or collection information; fusing the acquired external information and the type of the question, adopting a natural language generation algorithm, generating a final answer text based on a preset answer generation template and a fusion rule, and returning the generated answer text to the user.

The knowledge content refers to various acquired information such as characters, pictures, videos and the like of scenic spots or exhibition halls. This information reflects knowledge of the history, culture, landscape, etc. of the scenic spot or exhibition hall. Milvus the vector database is a high-performance vector search engine. The method can convert the unstructured data such as collected text, pictures, videos and the like into vectors with fixed dimensions and store the vectors. In the application, knowledge content refers to information such as collected text introduction, pictures, videos and the like of scenic spots or exhibition halls. The information contains knowledge of basic conditions, histories, cultural features, scenic spot distribution, festival activities and the like of scenic spots or exhibition halls. The collected knowledge content is converted into a vector of fixed dimensions using natural language processing (natural language processing), computer vision, etc. These vector data are stored in Milvus vector databases. Each knowledge content record corresponds to a vector. In the front-end system, knowledge content of a scenic spot or exhibition that is similar to the user query may be retrieved from Milvus by a similarity search of vectors. And a structural and retrievable scenic spot knowledge base is constructed by utilizing Milvus strong vector searching capability. Various intelligent knowledge acquisition and recommendation applications can be supported.

The embedded text search algorithm is a similarity search algorithm based on word vectors. In the application, word vector conversion is carried out on text query input by a user to obtain a text query vector. In the vector space of Milvus knowledge bases, using indexes such as cosine similarity and the like, the TopK knowledge base vectors which are most similar to the text query vector are found. And returning the knowledge base content closest to the query text semantics according to the similarity ordering. By means of embedded text search, semantic searching of a knowledge base can be achieved, not just keyword-based matching. Specifically, it is possible to employ: word2Vec, gloVe, BERT or Doc2Vec, etc.

The large model refers to a language representation model obtained by pre-training a deep learning method on a large-scale text corpus. The large model can learn semantic and grammatical information of a language and can understand and generate text. In the application, the constructed question prompt text is used as input and is input into a pre-trained large model. The large model analyzes the semantics of the text and extracts the semantic information expressed by the text. Through language knowledge learned by the large model, the problem prompt can be judged to which type of problem the problem belongs. Automatic classification of problem types is achieved. Specifically, BERT, GTP, XLNet or the like may be employed.

Wherein the API interface refers to an application program interface (Application Programming Interface), which is a software interface for interaction and communication between software components. In the application, a group of external API interfaces corresponding to different question types are preset. The API interfaces can acquire external real-time data such as scenic spot weather, tickets, hotels and the like. When judging that the problem prompt belongs to a certain class, calling an API interface matched with the problem prompt. By calling the API, external real-time data related to this problem can be acquired. For example, if it is determined that a problem of scenic spot weather is queried, a weather API is called to obtain scenic spot real-time weather data. And if the inquiry arrival mode is judged, calling a ticket and a hotel API to acquire related information. And finally, calling different APIs according to the problem type to acquire different types of external information.

The natural language generation (Natural Language Generation, NLG) algorithm is a text generation technology, which can automatically convert structured data into natural language text conforming to grammar and semantics. In the present application, the question type and the acquired external API information are taken as input data. And selecting different preset answer templates according to different question types. The template comprises variable slots. And filling external information into slots in the template according to a predetermined generation rule. And finally generating a complete natural language answer for the question type and fusing external information. And returning the generated answer text to the user. In particular, a template-based approach may be employed: manually written text templates + filled-in slots are used. A planning-based method: semantic representations are constructed first and then converted to text. Neural network-based methods: sequence-to-sequence, transducer, etc. The mixing method comprises the following steps: template + deep learning, not only guarantee the quality, but also increase creativity.

Further, converting the collected knowledge content into structured vector data, comprising: performing word segmentation, part-of-speech tagging and named entity recognition processing on the text data, and extracting text features; performing target detection, image segmentation and key frame extraction processing on the image and video data, and extracting visual features; performing voice recognition and acoustic feature extraction processing on the audio data to extract audio features; and fusing the extracted text features, visual features and audio features to generate a multi-modal fusion feature vector which is used as structured vector data.

Further, extracting text features includes: performing word segmentation processing on text data by adopting a Chinese word segmentation algorithm based on BERT, and segmenting the text data into semantic units; performing part-of-speech tagging on the segmented text by adopting BiLSTM-CRF algorithm, and identifying part-of-speech information of each semantic unit; carrying out named entity recognition by adopting BiLSTM-CRF algorithm combined with an attention mechanism according to the segmented text; and extracting semantic feature vectors of text data as text features by adopting a BERT-based word embedding algorithm according to the word segmentation processing result, the part-of-speech tagging result and the named entity recognition result.

Wherein BERT (Bidirectional Encoder Representations from Transformers) is a language representation pre-training model which is pushed out by Google, so that semantic information of characters can be fully learned. In the application, the BERT model is used for learning word semantic information of a large number of Chinese corpora. And extracting semantic feature representation of the text from the text data needing word segmentation. The text representation is entered into the BERT model, which predicts which phrase each word belongs to. According to the prediction result of BERT, the word segmentation of the Chinese text is realized. Compared with the traditional method, the word segmentation mode can divide semantic units more accurately, and is not just dictionary matching. Semantic Chinese word segmentation is performed by using semantic knowledge learned by BERT. In the subsequent processing, the segmented data is more beneficial to extracting key information and improving the downstream task effect.

Wherein BiLSTM-CRF is a typical sequence labeling algorithm for realizing the tasks of part-of-speech labeling and the like. BiLSTM a two-way long-short-term memory network can learn the front and rear associated information in the text. The CRF conditional random field can optimize the labeling result of the whole sequence. In the present application, a text sequence after word segmentation is input. BiLSTM learn the front and rear semantics of each word and obtain the semantic representation of the whole sequence. The CRF layer optimizes at the sequence level, labeling the most likely part-of-speech tag sequences. Extracting features by BiLSTM, and optimizing a result by using CRF to realize part-of-speech tagging of the segmented corpus. The part of speech of each of the segmented lemmas, such as nouns, verbs, adjectives, etc., is noted. The part-of-speech tagging can help understand semantics and promote downstream task effects. In this scenario, the part-of-speech information may help analyze the intent of the problem, enabling problem classification.

Wherein BiLSTM-CRF combined with the attention mechanism is based on a typical BiLSTM-CRF model, and a sequence labeling model of the attention mechanism is added. The attention mechanism can enable the model to automatically learn the importance degree of different words, so that more accurate named entity recognition can be performed. In the application, biLSTM network learning semantic features are constructed for the segmented text. An attention layer is added at BiLSTM to let the model learn the importance of each word. The attention mechanism may emphasize named entity words and attenuate the influence of extraneous words. The CRF layer completes sequence labeling and labels named entities appearing in the text. And identifying key entities such as person names, place names, organization names and the like in the text. These key entities can help analyze the intent of the text, judging the type of problem. The attention mechanism enables the model to focus on semantic key information, and named entity recognition and downstream task effects are improved. Finally, the named entity information can be accurately extracted from the word segmentation text by utilizing the algorithm.

The word embedding algorithm based on the BERT is a method for converting text into semantic feature vectors by utilizing semantic information obtained by learning a BERT model. The BERT is a language representation pre-training model, and can fully learn the semantic information of words. In the application, text data after word segmentation, part-of-speech tagging and named entity recognition are input. Semantic information of the text sequence is extracted using a pre-trained BERT model. The BERT model internalizes a large amount of knowledge of the corpus and can represent the semantics of the text. For each term in the text, the BERT is used to obtain its semantic representation in a particular context. The BERT vector for each token is stitched as a semantic feature of the entire text sequence. Text semantic features are enhanced by BERT and contain rich semantic knowledge. These semantic feature vectors may be used to represent text as input to a text classification or other model. Knowledge of BERT is used to boost the effect of feature representation, thereby improving the performance of downstream tasks.

Further, extracting visual features includes: performing target detection on the image and the video frame by adopting YOLOv algorithm, and identifying and positioning objects in the image; carrying out semantic segmentation on the image and the video frame by adopting Deeplabv algorithm, and dividing the image into different semantic areas; and extracting visual feature vectors of the image and the video frame by adopting an I3D algorithm according to the target detection result and the semantic segmentation result to serve as visual features.

Wherein YOLOv is a deep learning-based object detection algorithm. Object detection is to identify all existing objects in an image or video frame and give their category and location coordinates. In the present application YOLOv is an end-to-end object detection model, which can directly predict an image and give a detection result. Input images or video frames YOLOv output the target categories contained in these images and their bounding box positions in the images. YOLOv5 uses convolutional neural network to extract image features, and performs spatial meshing on the image, predicting the probability of containing the target and the target frame position for each mesh unit. And finally outputting the category prediction and accurate positioning of each target, and finishing target detection of the image. In the present application YOLOv can identify major objects in the image, such as people, buildings, animals, etc., and provide semantic information of the object. The target detection result can help to understand the image content and judge the scene type of the image, so that the images are classified and managed.

Deeplabv3 is a semantic segmentation algorithm, which can perform pixel-level semantic understanding on an image. Semantic segmentation is the separation of each pixel in an image into different semantic categories, such as people, vehicles, buildings, etc. In the application, deeplabv utilizes convolutional neural network to extract the features of the image and performs semantic prediction. The Atrous cavity volume and ASPP module are introduced, so that information with different scales can be captured. After inputting the image, deeplabv outputs a prediction result of which semantic category each pixel belongs to. The image may ultimately be partitioned into semantic masks, such as partitioning people and vehicles. In the application, deeplabv3 can analyze the semantic region of the image and judge the scene type in the picture. For example, distinguishing a building image from a natural landscape image, etc. Semantic segmentation provides finer semantic understanding than classification, facilitating classification management of images.

Wherein I3D is a video understanding algorithm that can extract spatial and temporal features from video. In the application, the I3D utilizes a 3D convolution network to extract the space-time characteristics of the video and captures the dynamic information of the image sequence. A sequence of video frames subject to object detection and semantic segmentation is input. And I3D learns the spatial and time sequence characterization of the video through operations such as 3D convolution, pooling and the like. And finally outputting the unified feature expression of the whole video segment. This video feature vector fuses object, semantic, and dynamic information. In the present application, the I3D features may represent video content for classification and understanding. In combination with the previous detection and segmentation results, the video semantics can be described more precisely. The I3D feature vector of the final video is used as a visual feature, and the scene category of the video can be judged and classified.

Further, extracting audio features includes: converting the audio data into text data by adopting conformer algorithm; extracting acoustic features of the audio data by adopting XLSR algorithm, wherein the acoustic features comprise MFCC; and extracting an audio feature vector of the audio data by adopting ECPAP-TDNN algorithm according to the text data and the acoustic features to serve as audio features.

Conformer is an automatic speech recognition algorithm that can convert speech audio into text. In the present application Conformer utilizes a convolution module to learn the local information of the audio. And learning the global context dependency relationship by adopting a Self-attribute module. And training the audio input Conformer model, and learning the corresponding relation between the audio and the text. In the prediction, the input audio data Conformer outputs the corresponding text transcription result. In the present application Conformer may transcribe the collected audio data into text. The audio data may come from a navigation narration, a exhibition hall introduction, etc. Conformer transcribed text can be used for subsequent text analysis to enable understanding of the audio content. Audio transcription enhances the coverage of multimodal analysis and can handle richer sources of information.

XLSR (Cross-lingual Speech Representation) is an algorithm for extracting speech recognition features from speech. MFCC (Mel-frequency cepstral coefficient) is a numerical sequence used to represent speech features. In the present application XLSR utilizes a deep neural network to learn speech recognition related phonetic representations from speech. Input raw audio data XLSR can output a feature representation of the speech, including MFCC features. The MFCC may represent spectral information of speech through mel-filtering and discrete cosine transform. MFCC is one of the key data characterizing speech features. In the present application XLSR +mfcc may express acoustic feature information of audio. These acoustic features may be used to represent audio content, determine audio categories, and the like. For example, MFCC is used to determine whether audio is speech or music. The acoustic features enhance understanding of audio semantics, providing assistance for audio analysis.

Wherein ECAP-TDNN is an audio feature extraction algorithm, which can learn the semantic information of voice. An audio feature vector is an audio content feature represented by a vector. In the present application, ECAP-TDNN learns timing information of voices using TDNN structure. And meanwhile, external semantic information such as text transcription results are added, so that the voice semantic learning is enhanced. The original audio and the corresponding text information are input, and the ECAP-TDNN outputs the feature vector of the audio content. This audio feature vector integrates the acoustic features and semantic features of speech. In the present application, ECAP-TDNN features may be extracted for audio content. These semantically audio vectors may represent audio content for classification, retrieval, etc. For example, to determine whether the audio is a guide or explanation, or to query for similar audio. The audio features enhance the expression of understanding the audio, and facilitate subsequent processing and analysis.

Further, fusing the extracted text features, visual features and audio features to generate a multimodal fusion feature vector as structured vector data, including: constructing an attention fusion network, and acquiring fusion weights among text features, visual features and audio features; generating a multi-mode fusion feature according to the fusion weight, the text feature, the visual feature and the audio feature; nonlinear change and feature transformation are carried out on the generated multi-modal fusion features, and multi-modal fusion feature vectors are generated; and carrying out L2 normalization on the obtained multi-mode fusion feature vector to be used as structured vector data.

Specifically, the attention fusion network can learn the relation among different modal features and calculate the weight of feature fusion. Feature vectors of different modalities, such as text features, image features, audio features, are input. The correlation between feature vectors is calculated, and common methods are dot product attention (Dot Product Attention) and scaled dot product attention (Scaled Dot-Product Attention). Each feature vector is assigned a weight value representing the importance contribution of this feature to the final result. Weights are typically normalized using Softmax. The higher weighted feature is given greater value reflecting its importance to the task goal. And carrying out weighted summation or multiplication according to the weights of the features to obtain fusion representations of the features of different modes. The learned feature weights can be regarded as the attention allocation situation of different modalities. Modeling of importance of different features can be obtained through attention calculation, and automatic on-demand selection and fusion of the features are achieved.

Specifically, the generation of the multi-modal fusion features can be realized through an attention fusion network, and feature vectors of different modalities, such as text features x _t, image features x _i and audio features x _a, are input. In the attention fusion network, a correlation weight w _tt、w_ti、w_ta among the text, the image and the audio is calculated. The weights are normalized by the importance of the feature vector, Σw=1. After the weight is obtained, the weighted fusion of the characteristics is carried out: x_fusion=w _tt×x_t+w_ti×x_i+w_ta×x_a, x_fusion is the weighted fusion result of text, image, audio multi-modal features. The learning of the weights reflects the importance of the different modality features to the current task. The fusion feature integrates complementary information of the three modes. Compared with single characteristics, the fusion characteristics can more comprehensively express the semantics of multi-mode information, and the understanding capability of the model is enhanced. The fusion characteristics are used as output and can be used for subsequent processing such as classification, clustering and the like.

Wherein L2 normalization is a method of vector normalization that can rescale vectors to a uniform scale. Structured vector data is a vector representation that converts unstructured data, such as text, images, etc., into fixed lengths. In the application, a vector x obtained through feature fusion and transformation is input. This vector is L2 normalized: Where x 2 is the L2 norm of x, i.e. the euclidean length. L2 normalization can scale different vectors to similar numerical ranges. In addition, the distinguishability of the vectors can be increased, and the model effect is improved. In the present application, L2 normalization is performed on the fusion feature vector. And obtaining vector expression with fixed length and unified numerical value. The vector carries semantic information of the multimodal data and is a structured representation. The structured vector can be directly used for vector searching, clustering and other algorithms.

Further, performing nonlinear variation and feature transformation on the generated multi-modal fusion feature to generate a multi-modal fusion feature vector, including: mapping the generated multi-modal fusion features to a high-dimensional feature space by adopting a Gaussian kernel function to acquire a nonlinear relation between the features, wherein the high-dimensional feature space represents a feature space higher than the original feature space in dimension; calculating an intra-class divergence matrix and an inter-class divergence matrix in a high-dimensional feature space, wherein the intra-class divergence matrix reflects the compactness degree between similar samples, and the inter-class divergence matrix reflects the separation degree between different samples; constructing a generalized Rayleigh Li Shang, and obtaining an optimal projection direction by maximizing the ratio of an inter-class divergence matrix to an intra-class divergence matrix; obtaining a generalized eigenvalue and a generalized eigenvector by calculating the generalized eigenvalue, selecting generalized eigenvectors corresponding to the first M maximum generalized eigenvalues, and constructing a transformation matrix; performing linear transformation on the multi-mode fusion features in the high-dimensional feature space by using a transformation matrix to obtain feature vectors after dimension reduction; and taking the feature vector after the dimension reduction as the final representation of the multi-mode fusion feature vector.

Wherein, the Gaussian kernel function is a method for mapping the original features to a high-dimensional space, and nonlinear relations among the features can be found. A high-dimensional feature space refers to a new feature representation space that is larger in dimension than the original feature space. In the present application, the input is a multimodal fusion feature vector x. It is mapped to a new high-dimensional space using RBF gaussian kernels: The new phi (x) represents the mapping coordinates of x in high dimensional space. The high-dimensional space may model nonlinear relationships between input features. By kernel function mapping, the original linearly inseparable data can become linearly separable. In the present application, the gaussian kernel map can enhance the expressive power of the multi-modal feature. The data is more differentiated in a high-dimensional space, and is beneficial to subsequent classification or clustering. The high-dimensional mapping reveals the internal structure of the data, which is more beneficial to the understanding of multi-modal content. The mapped high-dimensional features can be used as input to a subsequent model.

Wherein the intra-class divergence matrix represents the degree of compactness between samples of the same class, and the inter-class divergence matrix represents the degree of separation between samples of different classes. In the present application, the representation after the multi-modal sample is mapped to the high-dimensional feature space is denoted as x. The covariance matrix Sw of the sample spread within the same class is calculated, representing the degree of compactness within the class. And calculating covariance matrixes Sb among centers of different classes to represent the separation degree among the classes. Smaller Sw means that the same class of samples are highly aggregated and the intra-class differences are small. Sb is large to mean that different categories can be clearly separated, and the difference between categories is large. In the present application, sw and Sb of high-dimensional characteristics are calculated. Small Sw and large Sb mean intra-class aggregation and inter-separation after mapping to high dimensions. The effect of gaussian kernel mapping, and the separability of high-dimensional features, was evaluated. If sufficiently separated high-dimensional features are obtained, subsequent classification and identification is facilitated. The class discrimination capability of the mapping features can be analyzed through Sw and Sb indexes.

The generalized Rayleigh quotient is a matrix function, and can be used for solving the maximum distinguishing projection direction. The optimal projection direction is the direction in which the best class separation can be achieved by projecting the sample in this direction. In the present application, the inputs are an intra-class divergence matrix Sw and an inter-class divergence matrix Sb in a high-dimensional space. Constructing generalized Rayleigh Li Shang: j=sb/Sw, and the feature vector a corresponding to the maximum feature value of J is solved. The feature vector a is the optimal projection direction. Projection into this direction can maximize class spacing while minimizing intra-class distance. In the application, the generalized Rayleigh quotient is utilized to find the optimal projection direction of the multi-modal sample. This direction can distinguish different kinds of samples, plays the effect of dimension reduction separation. And converting the sample according to the projection direction to obtain a new characteristic with more distinguishing property. These features can be used for classification recognition to improve the performance of subsequent multimodal understanding tasks.

Specifically, the transformation matrix is used to perform linear transformation and dimension reduction on the high-dimensional features, and the input is a constructed transformation matrix P and the multi-modal features x mapped to the high-dimensional space. Performing linear transformation on x: y=p ^T ×x, where y is the new feature representation after dimension reduction. The column vector of P is the principal generalized eigenvector chosen. P ^T denotes the transposed matrix of P. The matrix multiplication realizes coordinate transformation and obtains the dimension reduction characteristic. The multi-modal feature x maps to a low-dimensional subspace defined by P. The subspace retains the dimension that best represents the class difference. And redundant dimensions are removed, so that the effect of reducing the dimension is realized. The y after dimension reduction reserves main distinguishing information of the original characteristics. Meanwhile, the distinguishing property of the feature vectors is improved. The new feature y can be used for subsequent processing such as classification recognition.

Further, storing vector data using Milvus vector databases, comprising: establishing Milvus a set of vector databases, and setting the dimension of the set according to the dimension of the multi-mode fusion feature vector; acquiring a time stamp of the multimodal fusion feature vector, and creating a time partition in a Milvus vector database according to the time stamp; acquiring category or label information of the multi-mode fusion feature vector, and creating a semantic partition in a Milvus vector database according to the category or label; converting the multimodal fusion feature vector into a floating point vector; utilizing a batch insertion interface of Milvus vector databases to insert floating point vectors into the established set, the established time partition and the semantic partition in batches; setting HNSW indexes according to Milvus vector databases inserted with floating point vectors, and searching and inquiring vector similarity by constructing a multi-level graph structure;

Wherein Milvus is a vector search engine that can be used to build sets of vectors and to implement retrieval of vectors. In the application, the input is the obtained multi-mode fusion feature vector y after dimension reduction. A new collection is created at Milvus. And setting the dimension of the set as the length of y according to the dimension of the vector y. The set is the basic unit of storing and managing vectors in Milvus. All vectors y are inserted as records into this set. In the present application, milvus sets are used to store the feature vectors of all multi-modal samples. When new data is inserted, its vector is also added to the collection. The vector may be indexed in Milvus sets to enable fast nearest neighbor searches. And calculating the similarity between the vectors in search, and returning the best matching result. Thereby realizing the retrieval and the similar matching of the multi-mode data sample.

Specifically, in Milvus, a time partition is created according to the time stamp, and a generated time stamp corresponding to each multi-mode sample feature vector is obtained. A time field is set in Milvus sets, holding the time stamp for each vector. The partition rules are set according to a time frame, for example, one partition per month. Creating a time partition in the Milvus set: CREATEPARTITION collection _name TIME ("2022-01"). Where TIME indicates the creation of a partition according to TIME, followed by a specified month range. When a new sample vector is inserted, a corresponding time partition is automatically put in according to the time of the new sample vector. Upon querying, search partition ranges may be defined to optimize performance. In the application, time partitions are established according to the characteristic generation time. The multi-modal sample management and inquiry according to the time sequence is realized. The time partition improves the query efficiency of the vector index.

Specifically, in Milvus, a semantic partition is created according to the category or the label, and the category label corresponding to each multi-mode sample feature vector is obtained. A classification field is defined for the set at Milvus, holding the class of each vector. Partitions are created according to category, such as by exhibition hall or cultural relic type. Creating a semantic partition in Milvus sets: CREATEPARTITION collection _name TAG ("Category A"), which means that a partition is created from the TAG values, followed by a Category. When a new vector is inserted, a corresponding semantic partition is placed according to the category label. When queried, searches within a specified category partition may be defined. In the application, semantic partitions are established according to class labels of vectors. The management and the inquiry of the multi-mode data according to the content CATEGORY are realized. Semantic partitioning optimizes indexing performance for a particular class vector.

Where a floating point vector is a vector that converts each element in the vector into a floating point representation. In the present application, a vector y representing a multi-modal feature is input. Each element in y may originally be of various numerical types, such as integer, double precision, etc. Each element in y is converted to a 32-bit or 64-bit floating point number. A new floating point vector representation y_float is obtained. At Milvus, the vector needs to be uniformly converted to floating point number storage. Because Milvus uses floating point numbers for vector distance calculations. If not converted, a calculation error may result. In the present application, y needs to be converted to a floating point vector y_float. Then insert y_float into Milvus sets. The floating point vector is used for the correct vector distance metric.

The batch insertion interface is an API provided by Milvus, and can insert a plurality of vector data at the same time. In the application, the input is the converted multimodal feature floating point vector y_float. Each vector y_float corresponds to a multi-modal sample. Milvus provides a batch insertion function interface: insert (collection _ name,

[ Vector_arrays ]), wherein [ vector_arrays ] is a plurality of vector arrays. The interface can insert multiple vectors simultaneously, which is faster than inserting one by one. In the present application, a batch plug-in interface is employed. All y_float are packed into arrays, inserted once into Milvus. And automatically filling the corresponding partition according to the time and semantic tags of the vectors. The high-efficiency storage of a large quantity of multi-mode vectors is realized. Batch insertion is faster than one-by-one insertion, reducing editing costs.

The HNSW index is a graph-based index algorithm, and can build multi-level navigation graphs in a vector database for realizing efficient vector similarity search. In this disclosure, HNSW index is built for a set of vectors in Milvus. HNSW navigate the query by building a hierarchical graph structure. Graph vertices are data vectors and edges represent the similarity between the vectors. The graph traversal query is started from the entry node at the time of search. And (5) rapidly locking the target by using the similarity relation between the vectors. In this disclosure, HNSW indexes are built on the set of insertations. A multi-level similarity graph is built for the multi-modal vector. When searching for a vector, the node most similar to the query vector is found from the graph. I.e., return the multimodal result that is closest to the query vector semantics. The HNSW index enables efficient similarity searching for large-scale vectors.

Further, obtaining external information includes: determining the category of the functional problem according to the acquired functional problem, wherein the category of the functional problem comprises weather inquiry, ticket inquiry, hotel inquiry, collection inquiry, scenic spot open time or ticket purchase; according to the category of the functional problem, selecting an API interface corresponding to the category from a preset external API interface set, taking the functional problem as a request parameter, and acquiring corresponding external data in a JSON format through the API interface; analyzing and extracting information of JSON format data acquired from an external API interface by adopting a rule-based method to generate structured data; and performing de-duplication processing on the obtained structured data by adopting a fuzzy matching algorithm to obtain final external information.

Specifically, intent recognition of functional questions: and adopting a pre-trained intention recognition model to perform intention recognition on the input functional problem. The intention recognition model is obtained by performing fine tuning training on the problem classification data set based on a large language model such as BERT. The intent recognition model classifies functional questions into predefined categories such as weather queries, ticket queries, hotel queries, and stock queries. Each category corresponds to a specific query intent. Selecting a corresponding external API interface according to the problem category: an external API interface set is preset, wherein the external API interface set comprises API interfaces required by different types of inquiry, such as a weather inquiry interface, a ticket inquiry interface, a hotel inquiry interface, a collection inquiry interface and the like. And according to the problem category obtained by intention recognition, selecting an API interface corresponding to the category from the external API interface set. External data in the JSON format is obtained through an API interface: key information such as the city, date, keywords, etc. of the query is extracted from the functional questions as request parameters of the API interface. And splicing the request parameters into corresponding API interface URLs, calling an external API interface in an HTTP request mode, and acquiring returned JSON format data. Analyzing the JSON data by adopting a rule-based method to generate structured data: and analyzing and extracting information of the acquired JSON format data by adopting a rule-based method. And a set of parsing rules are predefined according to JSON data structures and field meanings of different categories. Such as resolving a temperature field in the weather JSON data into weather, resolving a weather condition field into weather, and the like. And extracting key fields in the JSON data by using an analysis rule to generate a structured data representation. The structured data is stored in the form of key-value pairs, one attribute for each field. And (3) de-duplicating the structured data by adopting a fuzzy matching algorithm: and carrying out de-duplication processing on the structural data obtained by analysis by adopting a fuzzy matching algorithm. And segmenting text fields in the structured data, and extracting keyword features. And comparing the keyword features of different structured data in pairs, and calculating the similarity between the keyword features. The similarity can be calculated by using an algorithm such as an edit distance and a Jaccard coefficient. And setting a similarity threshold, wherein structured data with similarity higher than the threshold are considered to be repeated, only one piece of information in the structured data is reserved, and the rest pieces of structured data are subjected to deduplication deletion. Obtaining the structural external information data after the duplication removal: through the steps, the structural external information data after the duplication removal is obtained. And carrying out format conversion and field mapping on the structured external information data, unifying the data format, and facilitating subsequent storage and use. And inputting the processed external information data into an answer generation module, and fusing the processed external information data with a knowledge base search result to generate a final answer text.

Furthermore, the large model adopts a universal-sense thousand-14B-chat model, and the universal-sense thousand-14B-chat model adopts a coder-decoder architecture of a transducer and comprises a plurality of self-attention coding layers and a plurality of self-attention decoding layers; the knowledge content comprises information of scenic spots or exhibition halls, and the information of scenic spots or exhibition halls comprises opening time of scenic spots or exhibition halls and ticket purchase information of scenic spots or exhibition halls.

Wherein, the learning thousand questions-14B-chat is a hundred degrees deduced ultra-large Chinese open domain dialogue pre-training model. In the present application, the universal 14B-chat is a transducer-based Seq2Seq dialog model. The encoder contains multiple layers of self-attention mechanisms that can capture context information of the text. The decoder may also use self-attention to generate coherent, correlated replies. The model obtains strong semantic understanding capability through large-scale Chinese knowledge base pre-training. In the present application, a dialogue is performed using a general 14B-chat model. And inputting a question prompt text and outputting reply content with strong pertinence. By means of pre-training knowledge of the model, the questions can be accurately understood and replies conforming to the style of the knowledge base can be generated. The knowledge question-answering task based on the specific knowledge base is realized.

Among them, the transducer is a neural network architecture based on the attention mechanism, and is widely applied to the field of natural language processing. In the application, a transducer architecture is adopted by a general-purpose 14B-chat model. The transducer comprises two parts, an encoder and a decoder. The encoder learns the inherent representation of the text through the self-attention layer. The decoder generates the target sequence and also uses self-attention. In the present application, the encoder portion of the transducer is utilized. And extracting features and performing representation learning on the question prompt text. The text features output by the encoder serve as inputs to the decoder. The decoder generates a complete, coherent reply based on the features. The self-attention mechanism of the transducer is powerful in text representation capability. The model can accurately understand the semantics of the prompt text, and generate a knowledge base style reply.

3. Advantageous effects

Compared with the prior art, the application has the advantages that:

(1) The intelligent reasoning is carried out by utilizing a large model such as a universal question-14B-chat model, so that the questions presented by the user can be more accurately understood, and accurate answers are carried out based on the multi-mode fusion feature vector, so that the answer precision and the intelligent level of the navigation service are effectively improved;

(2) By means of Milvus vector databases, the structured multi-mode fusion feature vectors are stored and managed, and the supported vector similarity searching and inquiring functions are utilized, so that efficient information retrieval and matching are realized, and powerful data support is provided for intelligent navigation;

(3) By fusing text, vision and audio features, a multi-mode fusion feature vector is constructed, so that contributions of different information sources are comprehensively considered, and the comprehensiveness and accuracy of scenic spot information are improved;

(4) Semantic understanding and named entity recognition are carried out on text data by using BERT, biLSTM-CRF and other algorithms, semantic feature vectors of the text are extracted, and a powerful semantic basis is provided for intelligent reasoning of a large model.

Drawings

The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a large model-based navigation method according to some embodiments of the present description;

FIG. 2 is an exemplary flow chart for generating text features according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart for generating visual features shown in accordance with some embodiments of the present description;

FIG. 4 is an exemplary flow chart for generating audio features according to some embodiments of the present description;

FIG. 5 is an exemplary flow chart for generating structured vector data according to some embodiments of the present description.

Detailed Description

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is an exemplary flow chart of a large model-based navigation method according to some embodiments of the present disclosure, capturing knowledge content of a attraction or exhibition hall, converting the captured knowledge content into structured vector data, storing the vector data using a Milvus vector database, building a knowledge base of the attraction, the knowledge content containing information of the attraction or exhibition hall; acquiring questioning voice data of a user, converting the acquired questioning voice data into text data, and preprocessing the text data; inputting the preprocessed text data into a pre-trained large model for reasoning, extracting semantic information, and judging the type of the problem; wherein the questions include knowledge-type questions and functional-type questions; the large model represents a deep learning language model obtained by pre-training on a large-scale corpus; matching the knowledge type problem with the constructed scenic spot knowledge base through an embedded text search algorithm to obtain the first K knowledge base contents with the highest similarity with the text data; according to the acquired functional problem, selecting an API interface corresponding to the functional problem from preset external API interfaces, and acquiring external information of the API interface, wherein the external information comprises: scenic spot real-time weather, round trip ticket information, hotel information or collection information; fusing the acquired external information with the first K knowledge base contents, adopting a natural language generation algorithm, generating a final answer text based on a preset answer generation template and a fusion rule, and returning the generated answer text to the user.

And (3) data acquisition: the method comprises the steps of collecting multi-source heterogeneous data such as texts, images, videos and audios of scenic spots through the modes of crawlers, API interfaces, manual input and the like, and covering multiple dimensions such as scenic spot information, navigation explanation, tourist comments, route guidance, activity bulletins and the like. Text data acquisition: the text data relating to the scenic spot is obtained from sources such as scenic spot introduction, history literature, etc. Image data acquisition: and collecting picture data of all scenic spots of the scenic spot. Video data acquisition: video data of scenic spots are acquired, including scenic spot introduction, tour video and the like. And (3) audio data acquisition: and collecting sound data such as scenic spot guiding voice guidance, audio explanation and the like.

FIG. 2 is an exemplary flow chart for generating text features according to some embodiments of the present description for converting collected knowledge content into structured vector data, including: inputting text data: "Zhang Sanvisited Beijing hometown museum, he stroked for two more hours, seen many national treasures-grade relics, where he liked the blue and white porcelain of Qianlong dynasty. "Chinese word segmentation algorithm based on BERT carries out word segmentation on text data: "Zhang Sanzhen/visited/Peking/Dekken doctor's college/,/he/stroll/two/multiple/hour/,/see/many/national treasures grade/relic/,/wherein/he/most/like/Qianlong/facing/blue-white/porcelain/. The' BiLSTM-CRF algorithm carries out part-of-speech tagging on the segmented text: "Zhang San/nr visited/v/ul Beijing/ns hometown museum/n,/x he/r stroll/ul two/m more/m hours/n,/x see/v many/m national treasures grade/b cultural relics/n,/x wherein/r he/r most/d likes/v Qianlong/nr toward/n/ude 1 blue and white/n porcelain/n. /x). The BiLSTM-CRF algorithm based on the attention mechanism carries out named entity recognition: "Zhang Sanzhen/person visiting/v/ul Beijing/Place name hometown museum/scenic spot,/x he/r stroll/ul two/m more/m hours/t,/x see/v many/ul national treasures grade/z cultural relics/n,/x wherein/r he/r most/d likes/v Qianlong/person name facing/n/ude 1 blue and white/n porcelain/n. /x). Initializing a pre-training BERT model, and loading a word list and model parameters. Here a chinese BERT-Base pre-training model is employed. The text after named entity recognition, "Zhang Sanren name visited/v/ul Beijing/place name hometown museum/attraction,/x he/r stroll/v/ul.." converts the input text into a BERT acceptable input format, i.e., wordPiece, beginning with the [ CLS ] tag and ending with the [ SEP ] tag. And inputting WordPiece representations into the loaded BERT model, and extracting codes corresponding to the text to obtain BERT representations of the text. By adding a pooling layer at the last layer of the BERT, semantic feature vectors of text BERT codes are extracted, wherein the extracted vector dimension is 768 dimensions. And carrying out L2 normalization processing on the semantic feature vector of the text BERT code to obtain a final text semantic feature vector serving as a text feature.

FIG. 3 is an exemplary flow chart for generating visual features according to some embodiments of the present description, loading a pre-trained YOLOv s object-detection model that has been pre-trained on a COCO dataset. And reading the input image, and adjusting the size of the picture to 640x640 to serve as a model input picture. And inputting the pictures with the adjusted sizes into a YOLOv s model for forward calculation to obtain a prediction frame and a prediction category. Filtering is carried out by using a confidence threshold value of 0.5, and a prediction frame with low confidence is removed. NMS processing is performed to remove duplicate prediction frames. The final output test results were as follows: class_id:0, score:0.95, b box: [ x1, y1, x2, y2] (case 1), class_id:0, score:0.92, b box: [ x3, y3, x4, y4] (case 2), class_id:0, score:0.90, b box: [ x5, y5, x6, y6] (case 3), class_id:1, score:0.85, b box: [ x7, y7, x8, y8] (vase 1), class_id:1, score:0.88, b box: [ x9, y9, x10, y10] (vase 2). The above is to use YOLOv to detect the targets and output detailed implementation flow of detection frames and categories, and detect that 3 exhibit boxes and 2 vase targets exist in the picture.

A pre-trained Deeplabv model is loaded, which is pre-trained on pixel-level classification tasks. The input image is read and the resolution is adjusted to 512x512 as Deeplabv model input. The image is input into Deeplabv models for forward computation to obtain the prediction category of each pixel. Mapping prediction categories to different semantic tags, for example: 0: background 1: wall 2: floor 3: exhibition stand, 4: showcase 5: portrait 6: and (5) an exhibit. And (5) regularizing, and reassigning the categories to the scattered point areas with small areas. Finally, outputting a semantic segmentation graph, wherein different semantics are represented by different colors: the background is white, the wall is grey, the floor is brown, the exhibition stand is yellow, the portrait is pink, and the exhibited article is green.

Input YOLOv detection result: class_id:1, score:0.85, b box: [ x7, y7, x8, y8] (vase 1), class_id:1, score:0.88, b box: [ x9, y9, x10, y10] (vase 2). Inputting Deeplabv semantic segmentation results, and extracting semantic region coordinates of the exhibition stand and the wall body. And analyzing the coordinates of the vase 1, and judging that the vase is in the exhibition stand area. And analyzing the coordinates of the vase 2, and judging that the vase is in the wall area. Then, based on an I3D algorithm, respectively extracting visual features of two vase regions: intercepting an image area of the vase 1, inputting an I3D model, outputting 256-dimensional features f1, intercepting an image area of the vase 2, inputting the I3D model, and outputting 256-dimensional features f2. And f1 and f2 are spliced in sequence to construct unified visual features of the vase category. Similarly, visual features of more types of exhibits can be expanded and extracted. Inputting all detected visual features of the display area, including: f1 (vase 1 area feature) f2 (vase 2 area feature), f3 (case 1 area feature) f4 (case 2 area feature). And performing nonlinear mapping on all visual features by using a multi-layer perceptron (MLP) to obtain advanced semantic feature representation, wherein the advanced semantic feature representation is used as a visual embedding vector v1, v 2. And weighting and fusing all the vision embedded vectors according to the attention weight to generate unified vision characteristics vf of the image. And (3) restricting the dimension of the visual features and improving the robustness by utilizing L2 regularization, and completing the acquisition of the unified visual features vf of the final image.

FIG. 4 is an exemplary flow chart for generating audio features, input audio data, according to some embodiments of the present description: a section of the exhibition hall teaches voice audio, initializes and loads a pre-trained Conformer voice recognition model that has been trained on the dataset of thousond shours. The voice input data is imported, for example, an exhibition lecture audio segment. Wav for a period of 10 seconds. The audio data is preprocessed, acoustic features, including MFCCs, are extracted and input into the Conformer model. The Conformer model models the audio characteristics based on a convolution module and a self-attention module to obtain semantic representation of the voice. Through the CTC decoding layer, semantic representations are mapped to word sequence space, producing probability distributions. The most probable word sequence is searched out and used as a voice recognition text result, for example, a Wenchang pavilion of the 'home is the doctor of the hometown', and tens of thousands of literary and ancient painting and calligraphy works of art are mainly displayed in the Wenchang pavilion.

And outputting a text result of voice recognition, and completing voice-to-text processing. A pre-trained XLSR acoustic feature extraction model is loaded, which model is trained on a large amount of speech data. The speech segments are used as input, for example, 10 seconds of exhibition hall guide speech segments audio. The audio samples are frame sliced, one frame every 25 milliseconds. And applying a Hamming window to each frame to extract an audio original waveform signal. Based on XLSR model, the waveform signal of each sound frame is processed, and the following acoustic characteristics are extracted: MFCC characteristics: characterizing speech envelope spectrum information, fundamental frequency characteristics: reflecting the pitch and tone characteristics of the speech: representing tone information. The acoustic features of the 36-dimensional mfcc+2-dimensional fundamental frequency+1-dimensional tone of each sound frame are constructed. The acoustic features of all the sound frames are connected to form an acoustic feature sequence vector of the audio samples. The acoustic feature vector with the length of 39xN is output and transferred to the subsequent ECPAP-TDNN model.

Text recognized by the input Conformer: "Wenchang pavilion in the hometown doctor's office here". The 39-dimensional acoustic feature sequence extracted by XLSR is input, with a length of 500 frames. Together into a pre-trained ECPAP-TDNN model. ECPAP-TDNN contains a convolutional layer and a time delay layer that model text semantic and acoustic sequences, respectively. Through the multiplicative layer, cross-modal interaction of text semantic representation and acoustic sequences is achieved. The final fully connected layer combines semantic, acoustic representations, outputting 512-dimensional audio features embedding. L2 normalization is carried out on the 512-dimensional vector, and embedded features of the audio data are constructed. The input is the 512-dimensional audio embedded vector output by the previous ECPAP-TDNN model: emb= [0.7,0.3,..0.5 ]. Calculate the L2 norm (euclidean norm) of the vector: i emmb i= (0.7 ζ2+0.3++2.,..+ -.) plus 0.5++2)/(1/2) =1.23, normalizing each element in vector emb: emb_norm [ i ] =emb [ i ]/| Emb|. Obtaining normalized vectors: emb_norm= [0.7/1.23,0.3/1.23, ], 0.5/1.23. The emb_norm is the audio feature normalized by L2, and the vector length is normalized to about 1. And the characteristic representation of the final audio data is used for subsequent clustering, searching, matching and other algorithms.

FIG. 5 is an exemplary flow chart for generating structured vector data, defining text features x _t＝[x₁,x₂,......,x₁₀₂₄, image features x _i＝[y₁,y₂,......,y₂₀₄₈, audio features x _a＝[z₁,z₂,......,z₅₁₂, three features being input to the Attention layer, the Attention layer being input to separate fully connected layers, respectively, resulting in text representation h _t, image representation h _i, audio representation h _a, according to some embodiments of the present description. Calculating the relevance scoring of the three components: st=w ₁ ^T×htsi＝w₂ ^T×hisa＝w₃ ^T ×ha. And carrying out Softmax normalization on the relevance scores to obtain weights: a ₁＝softmax(st)＝0.6,a₂＝softmax(si)＝0.3,a₃ =softmax (sa) =0.1. The final multi-modal characteristics are: xf=a ₁×x_t+a₂×x_i+a₃×x_a. xf is 2048-dimensional preliminary multi-modal feature representation after weight fusion.

The input is 2048-dimensional fusion feature vector obtained in the previous step: x= [ x ₁,x₂,......,x₂₀₄₈ ]. An RBF Gaussian kernel is constructed, defined as: Wherein γ is a kernel parameter, ||x ₁-x₂ || is a Euclidean distance. For each sample x, a new feature representation is calculated using a kernel function: /(I) Taking computational complexity into account, a 3000-dimensional kernel map is adopted,/>I.e. using gaussian kernels to map the sample x to a high-dimensional feature space of 3000 dimensions, in which complex nonlinear patterns can be obtained.

The method comprises the steps of inputting a sample data set in a high-dimensional space, wherein the sample data set comprises K categories, ni samples of each category, calculating a category center point mu i and an intra-category sample covariance matrix Ci of an i-th category, constructing an intra-category divergence matrix Sw, sw= ΣCi, calculating all sample centers mu, calculating an inter-category variance Si = mu i-mu, constructing an inter-category divergence matrix Sb, sb= ΣNixSixSi transpose, and obtaining Sw and Sb matrixes in the high-dimensional space. Calculating an intra-class divergence matrix Sw and an inter-class divergence matrix Sb, and constructing a generalized Rayleigh Li Shang: fisher=sb/Sw. Performing eigenvalue decomposition on the Fisher matrix: [ U, S, V ] =svd (fisher), U and V are column feature vectors and row feature vectors, respectively. The values in the S diagonal matrix correspond to eigenvalues. According to the feature value size sorting, selecting feature vectors corresponding to the largest d feature values: omega ₁,ω₂,......,ω_d, a projection matrix w= [ omega ₁,ω₂,......,ω_d ] is constructed. The samples are linearly transformed to a new representation: y=w ^T x, the matrix W transforms the data into a new subspace so that the samples get the best separation. And completing data projection based on generalized Rayleigh quotient optimization.

Performing eigenvalue decomposition on the constructed Fisher matrix: fisher=uΣv ^T. The diagonal value in the sigma matrix is the eigenvalue lambda, the first 1000 largest eigenvalues are selected after sorting, the column vector corresponding to the selected eigenvalue in the U matrix is the eigenvector, the 1000 column eigenvectors are reserved to construct a projection matrix W= [ omega ₁,ω₂,......,ω₁₀₀₀ ], and the input is a sample x mapped to a high-dimensional space and has a dimension of 3000. Linear transformation using W matrix: and x_fusion=w ^T x, wherein x_fusion is the obtained low-dimensional fusion feature, the dimension is 1000, the operation is repeated on all samples, the low-dimensional multi-mode fusion feature representation is finally obtained, the dimension is reduced to 1000, the operation can be accelerated, and the subsequent vector clustering and other analyses are facilitated.

The input is the generated 1000-dimensional fusion feature vector: x= [ x ₁,x₂,......,x₁₀₀₀ ]. Calculate the L2 norm (euclidean norm) of the vector: normalizing each element xi in the vector x: obtaining normalized multi-mode characteristics: /(I) The x_norm is the final multi-mode feature vector for L2 normalization, so that the influence of feature scale can be eliminated, the feature distribution range is limited, and the analysis of searching, clustering, classifying and the like can be facilitated as a structured vector.

And (3) knowledge storage: when Milvus sets are created, the dimension parameter is set to 1000 when the create_collection function is called, according to the dimension of the vector being 1000, and the dimension parameter is consistent with the feature dimension represented by the vector. The timestamp attribute is extracted from the vector data, for example ections is obtained as a timestamp "20230305", "20230306", etc. Based on these obtained timestamps, the create_part interface is called with each timestamp such as "20230305" as a name, and a corresponding time partition is created in Milvus to distinguish the data of different periods. Category labels are obtained from the vector data, for example, extracted to "scenic spots", "cultural relics", "buildings", etc. Then, according to the obtained vector category, such as 'scenic spots', the obtained vector category is used as a partition name, and semantic partitions corresponding to the category are created in Milvus sets through a create_partition interface so as to distinguish different types of data.

In order to facilitate Milvus storage and processing, the generated multimodal fusion feature vector needs to be converted into a floating point array format. For example, a 1000-dimensional vector would be converted to a Python list containing 1000 floating point numbers, e.g., [0.1,0.3,0.5, ], which is the dimension of the vector. In Milvus, an interface is provided to insert data in batches, where multiple vector records can be inserted simultaneously. Therefore, we collect more floating point vectors and store them in one PythonList to form batch data. And then calling Milvus an insert interface provided, taking the floating point vectors in the batch as parameters of an insert function, and transmitting the parameters to a Milvus server side to realize the insertion operation of batch vector data. This process persists and saves the batch of floating point vectors for deposit into the previously created collection. The steps are repeated, the aim of vector batch import Milvus is finally achieved, and support is provided for subsequent vector retrieval and search.

Milvus support various index structures, such as FLAT, IVF, HNSW, each having its own characteristics. We have chosen HNSW this indexing scheme based on the attributes of the vector data and the application scenario. HNSW index organizes the relationships between vectors by constructing a multi-layered graph network. The nodes in the graph are vectors, and the edges represent the proximity or similarity between the vectors. Such a mesh structure may optimize the vector search process. We call MilvusSDK the function that created the index and explicitly point out the index pattern as HNSW. Meanwhile, some meta-information related to HNSW, such as parameters of layer number, recall rate and the like, are set in the configuration parameters of the function. These parameters will instruct Milvus servers how to build this multi-tier HNSW graph index specifically. The parameters can be adjusted according to the requirements, and the index configuration and the best matching of vector data are required. HNSW the index is a multi-layer graph structure, the number of layers directly determining the level of the index. We set the number of layers to 5, i.e. HNSW index map contains 5 levels. Such multiple layers may make the retrieval process more efficient. Recall determines the coverage of vector retrieval. If set to a recall of 90%, in a search query, the relevance vector of the returned results will cover the 90% most similar vector of the query vector. The larger the coverage, the higher the result correlation, but the larger the search calculation amount. These two parameters need to be considered in combination with index performance and search accuracy. The hierarchical numerical control indexes complexity, and recall rate determines the proportion of approximate nearest neighbors that can be covered. These two parameter combinations need to be set according to the vector data set and the application requirements in order to obtain the optimal setting of the index configuration.

Milvus upon receiving a request to create an index, construction of the corresponding index structure is started according to the specified type HNSW. This process is performed at the server side. Milvus vector data that has been imported, such as in a Collection named "Collection 1". Milvus will begin index building corresponding to this set of data. According to the parameter configuration of the layer number, recall rate and the like which are set by the user, milvus automatically builds a multi-layer HNSW graph network through an algorithm. Each node in this HNSW index graph is a vector in the corresponding set. While edges represent the similarity or distance between two nodes, i.e. two vectors. Thus, a multi-layer graph index structure is formed that organizes all the vectors in the set, graph-based index structure. Such a graph index may organize vector data, accelerate subsequent nearest neighbor searches, insert delete, and like algorithms.

The user entered a functional question: "what are the famous colleges in Beijing palace? Firstly, inputting a question text into a pre-trained intention recognition model, and obtaining that the question belongs to a 'collection inquiry' category through model reasoning, wherein the recognition confidence is 0.92. Then, selecting a collection API interface corresponding to the collection query category from a preset external API interface set. Then, key information is extracted from the problem text to obtain a museum name of 'hometown', and the museum name parameter is spliced into the URL of the collection API interface to obtain a complete request URL. And calling an inventory API interface to acquire returned JSON format data, wherein the data comprises inventory information of the hometown museum, and each inventory has fields such as name, age, material, description and the like. And analyzing and extracting information of the acquired collection JSON data by adopting a rule-based method. According to the structure and field meaning of the collection JSON data, an analysis rule is predefined, key fields are extracted, and structured data are generated. The parsing rule includes: the status field indicates a request state, collections field indicates a collection list, the collection name, the collection dynasty field indicates a collection year, the collection material field indicates a collection material, the collection description field indicates a collection description, and the like. And (3) extracting key fields in the JSON data by applying an analysis rule to generate structured data, wherein each collection corresponds to one piece of structured data, and the fields comprise collection names, ages, materials, descriptions and the like. And carrying out de-duplication treatment on the extracted collection structured data by adopting a fuzzy matching algorithm. And calculating the similarity between different structured data, wherein the data with the similarity higher than the set threshold value are judged to be repeated, and only one piece of information in the data is kept to be the most complete. If two pieces of structured data are available, the collection names, years and material fields of the two pieces of structured data are identical, the description fields are different only by individual words, the calculated similarity is 0.95 and is higher than a set threshold value of 0.8, so that the two pieces of structured data are judged to be repeated data, one piece of data with more detailed description fields is reserved, and the other piece of structured data is subjected to deduplication. Finally, the final external collection information data is obtained after the duplication removal, and the final external collection information data comprises the structural information such as the name, the age, the material, the description and the like of the famous collection of the hometown museum. The external collection information data are input into an answer generation module and are fused with the existing hometown collection information in the knowledge base to form comprehensive and rich collection knowledge which is used for generating an answer text and introducing the famous collection of the hometown to a user.

And (5) knowledge retrieval: the text query of the user is first converted into a structured representation using a semantic parsing model, which is the basis for knowledge retrieval. The scenic spot knowledge graph constructed by the method comprises semantic components such as entities, attributes, relations and the like. Based on the structural representation of the user query, searching a map component with highest correlation degree through a semantic matching algorithm, and acquiring map knowledge corresponding to the query. Meanwhile, in a unified semantic space generated by fusing multiple modes, vectors such as images, audios and videos are contained, non-text resources related to text query intention can be obtained in a multi-mode retrieval mode, and knowledge sources are enriched. In addition, the text query is converted into vector representation, the vectors built in the knowledge base are matched in a vector space, and the knowledge base vectors with the matching degree exceeding a set threshold value are found to obtain corresponding knowledge content. And synthesizing all source knowledge acquired from the atlas, the multi-modal resource and the knowledge base vector, and constructing knowledge support for the user query problem so as to realize intelligent response. In the previous knowledge retrieval and matching process, if knowledge content related to and matched with the user query intention cannot be found in the knowledge base finally, a question and answer processing mechanism is triggered. Here we have trained a semantic analysis question-answer model facing the industry domain in advance, can analyze the text query directly, give the answer. The model constructs clear semantic analysis and matching capability through a large number of question inquiry and answer training. The intent of the query statement may be analyzed, directly generating an answer. And the question-answer response result generated by the model is combined with the related knowledge acquired from the knowledge graph, the multi-modal resource and the like, so that the comprehensive support of the question-answer system to the user query problem is constructed. And knowledge is acquired from multiple angles, and finally, knowledge matching and direct question and answer are considered in question and answer response, so that the system can provide answer for more diversified and open user questions.

And (5) external information fusion: the query questions posed by the user belong to the travel planning category, for example, "5 months to Beijing play for several days is better? ". And searching static knowledge contents such as average air temperature, scenic spot information and the like of Beijing 5 months in the knowledge base. We additionally call the open API interface of the beijing city weather forecast to obtain the daily weather details of 5 months No. 2 to 5 months No. 5 in real time. The data returned by the API shows that the days are clear days, with temperatures above and below 20 degrees. The real-time weather forecast content is enhanced in combination with previously acquired static weather knowledge, etc. as a reply. And finally, constructing a response result which takes both dynamic information and static knowledge into consideration. For the inquiry of the scenic spot playing plan of the user, the open interface of the real-time road condition of the urban traffic can be called. This interface can provide real-time traffic conditions on major roads in the city, whether it is congested, time consuming, etc. The real-time traffic flow and the speed of the vehicle can be obtained aiming at main roads around a scenic spot, such as a main street outside a Xuan gate. The data returned by the API shows that the current road speed outside the Xuanwumen is normal and the road condition is good. The real-time road condition information and the static information such as weather acquired before are organized together: the road conditions for playing to the home palace are good at present, bright sunshine and suitable for going out in two days after the road conditions are predicted, and the reply integrates a dynamic real-time data interface and static knowledge base content, so that the information is more comprehensive and reliable, and the query intention of a user is also more fitted. Answer generation: the user proposes a problem of 'Beijing hometown tour route recommendation'. And (5) retrieving the entity and the relation of the scenic spots of the palace from the knowledge graph, such as scenic spots of the Shenwumen, meridian Gate and the like and the space connection relation thereof. And matching the related tour route introduction from the knowledge base. Such as the eastern forty-famous attraction introduction. Real-time weather conditions are obtained from an external weather API. The temperature is proper as in the current sunny day. We input the above three parts of content into a pre-trained generic question-answer generation model. The model automatically assigns weights to different content. And adding context information such as question types, user characteristics and the like at the decoder side to generate personalized tour route recommendation answers.

Aiming at the problem of 'what needs to be prepared for mountain climbing on the weekend Huangshan'. And searching the map of the Huang Shanjing area to obtain static knowledge of related climbing routes, scenic spots and the like. The weight is 0.4. The knowledge base documents that match to the climbing equipment list contain specific equipment recommendations. The weight is 0.3. And calling a weather API to know the overcast and rainy weather of the weekend. The weight is set to 0.2. In the encoding stage, the atlas knowledge and the equipment manifest document representation are assigned higher weight values of 0.4 and 0.3 by a self-attention mechanism. And the weather information, which is less relevant, is given a lower attention weight of 0.2. When decoding and generating a reply, context factors such as a type tag of a problem, user portrait information and the like are also input. The model combines all semantic input contents, integrates attention weights, generates a coherent smooth and personalized answer text, and finally presents the answer text to a user. The pre-training model we used uses a transducer architecture, through the self-attention layer in the encoderThe self-attention layer of the decoder completes the knowledge conversion and semantic generation from sequence to sequence.

And (5) returning a result: for the problem of 'Hangzhou one-day tour route', the system text generation module generates a tour route plan comprising scenic spots such as a western lake and a mosque. We use audio sampling to generate a speech file that speaks the text answer content using speech synthesis techniques. Meanwhile, calling an image generation model, inputting keywords such as 'West lake' appearing in the text, and generating images of related scenic spots. And synthesizing the generated voice audio and the corresponding scenic spot picture into a Hangzhou tour video for about 1 minute by using a video generating tool. In the applet, the final text answer is displayed in text while the content is presented in speech and video form.

Claims

1. A large model-based navigation method, comprising:

Collecting knowledge content of scenic spots or exhibition halls, converting the collected knowledge content into structured vector data, storing the vector data by utilizing Milvus vector database, and constructing a knowledge base of the scenic spots, wherein the knowledge content contains information of the scenic spots or exhibition halls;

Acquiring questioning voice data of a user, converting the acquired questioning voice data into text data, and preprocessing the text data;

Inputting the preprocessed text data into a pre-trained large model for reasoning, extracting semantic information, and judging the type of the problem; wherein the questions include knowledge-type questions and functional-type questions; the large model represents a deep learning language model obtained by pre-training on a large-scale corpus;

Matching the knowledge type problem with the constructed scenic spot knowledge base through an embedded text search algorithm to obtain the first K knowledge base contents with the highest similarity with the text data;

According to the acquired functional problem, selecting an API interface corresponding to the functional problem from preset external API interfaces, and acquiring external information of the API interface, wherein the external information comprises: scenic spot real-time weather, round trip ticket information, hotel information or collection information;

fusing the acquired external information with the first K knowledge base contents, adopting a natural language generation algorithm, generating a final answer text based on a preset answer generation template and a fusion rule, and returning the generated answer text to the user.

2. The large model-based navigation method of claim 1, wherein:

converting the collected knowledge content into structured vector data, comprising:

performing word segmentation, part-of-speech tagging and named entity recognition processing on the text data, and extracting text features;

Performing target detection, image segmentation and key frame extraction processing on the image and video data, and extracting visual features;

Performing voice recognition and acoustic feature extraction processing on the audio data to extract audio features;

And fusing the extracted text features, visual features and audio features to generate a multi-modal fusion feature vector which is used as structured vector data.

3. The large model-based navigation method according to claim 2, wherein:

Extracting text features, including:

performing word segmentation processing on text data by adopting a Chinese word segmentation algorithm based on BERT, and segmenting the text data into semantic units;

performing part-of-speech tagging on the segmented text by adopting BiLSTM-CRF algorithm, and identifying part-of-speech information of each semantic unit;

carrying out named entity recognition by adopting BiLSTM-CRF algorithm combined with an attention mechanism according to the segmented text;

And extracting semantic feature vectors of text data as text features by adopting a BERT-based word embedding algorithm according to the word segmentation processing result, the part-of-speech tagging result and the named entity recognition result.

4. A large model based navigation method according to claim 3, characterized in that:

extracting visual features, comprising:

Performing target detection on the image and the video frame by adopting YOLOv algorithm, and identifying and positioning objects in the image;

carrying out semantic segmentation on the image and the video frame by adopting Deeplabv algorithm, and dividing the image into different semantic areas;

and extracting visual feature vectors of the image and the video frame by adopting an I3D algorithm according to the target detection result and the semantic segmentation result to serve as visual features.

5. The large model based navigation method of claim 4, wherein:

Extracting audio features, comprising:

converting the audio data into text data by adopting conformer algorithm;

extracting acoustic features of the audio data by adopting XLSR algorithm, wherein the acoustic features comprise MFCC;

And extracting an audio feature vector of the audio data by adopting ECPAP-TDNN algorithm according to the text data and the acoustic features to serve as audio features.

6. The large model based navigation method of claim 5, wherein:

Fusing the extracted text features, visual features and audio features to generate a multimodal fusion feature vector as structured vector data, comprising:

constructing an attention fusion network, and acquiring fusion weights among text features, visual features and audio features;

Generating a multi-mode fusion feature according to the fusion weight, the text feature, the visual feature and the audio feature;

nonlinear change and feature transformation are carried out on the generated multi-modal fusion features, and multi-modal fusion feature vectors are generated;

And carrying out L2 normalization on the obtained multi-mode fusion feature vector to be used as structured vector data.

7. The large model based navigation method of claim 6, wherein:

nonlinear change and feature transformation are carried out on the generated multi-modal fusion features to generate multi-modal fusion feature vectors, and the method comprises the following steps:

Mapping the generated multi-modal fusion features to a high-dimensional feature space by adopting a Gaussian kernel function to acquire a nonlinear relation between the features, wherein the high-dimensional feature space represents a feature space higher than the original feature space in dimension;

Calculating an intra-class divergence matrix and an inter-class divergence matrix in a high-dimensional feature space, wherein the intra-class divergence matrix reflects the compactness degree between similar samples, and the inter-class divergence matrix reflects the separation degree between different samples;

constructing a generalized Rayleigh Li Shang, and obtaining an optimal projection direction by maximizing the ratio of an inter-class divergence matrix to an intra-class divergence matrix;

obtaining a generalized eigenvalue and a generalized eigenvector by calculating the generalized eigenvalue, selecting generalized eigenvectors corresponding to the first M maximum generalized eigenvalues, and constructing a transformation matrix;

performing linear transformation on the multi-mode fusion features in the high-dimensional feature space by using a transformation matrix to obtain feature vectors after dimension reduction;

and taking the feature vector after the dimension reduction as the final representation of the multi-mode fusion feature vector.

8. The large model based navigation method of claim 7, wherein:

Storing vector data with a Milvus vector database, comprising:

Establishing Milvus a set of vector databases, and setting the dimension of the set according to the dimension of the multi-mode fusion feature vector;

Acquiring a time stamp of the multimodal fusion feature vector, and creating a time partition in a Milvus vector database according to the time stamp;

Acquiring category or label information of the multi-mode fusion feature vector, and creating a semantic partition in a Milvus vector database according to the category or label;

converting the multimodal fusion feature vector into a floating point vector;

utilizing a batch insertion interface of Milvus vector databases to insert floating point vectors into the established set, the established time partition and the semantic partition in batches;

According to Milvus vector databases inserted with floating point vectors, HNSW indexes are set, and vector similarity searching and query are performed by constructing a multi-level graph structure.

9. The large model based navigation method of claim 8, wherein:

acquiring external information, including:

Determining the category of the functional problem according to the acquired functional problem, wherein the category of the functional problem comprises weather inquiry, ticket inquiry, hotel inquiry, collection inquiry, scenic spot open time or ticket purchase;

According to the category of the functional problem, selecting an API interface corresponding to the category from a preset external API interface set, taking the functional problem as a request parameter, and acquiring corresponding external data in a JSON format through the API interface;

analyzing and extracting information of JSON format data acquired from an external API interface by adopting a rule-based method to generate structured data;

and performing de-duplication processing on the obtained structured data by adopting a fuzzy matching algorithm to obtain final external information.

10. The large model based navigation method of claim 9, wherein:

The large model adopts a sense thousand-14B-chat model, and the sense thousand-14B-chat model adopts a coder-decoder architecture of a transducer and comprises a plurality of self-attention coding layers and a plurality of self-attention decoding layers.