WO2023113400A1

WO2023113400A1 - Apparatus for processing embedding-based dataset, and method therefor

Info

Publication number: WO2023113400A1
Application number: PCT/KR2022/020118
Authority: WO
Inventors: 박종빈; 정종진; 김경원
Original assignee: 한국전자기술연구원
Priority date: 2021-12-14
Filing date: 2022-12-12
Publication date: 2023-06-22
Also published as: KR20230090094A; KR102592515B1

Abstract

The present invention relates to a data processing apparatus and a method therefor, the apparatus being usable in the artificial intelligence technology development and machine learning field, and, more specifically, to an apparatus for processing an embedding-based dataset, and a method therefor. The apparatus for processing an embedding-based dataset, according to the present invention, comprises: a multi-vector space embedding unit for expressing input data in a vector space; a first dataset-based search range determination unit for determining a search range by using an embedding result of the multi-vector space embedding unit; and a second dataset search and output unit for constructing a new dataset by extracting subset data from a second dataset, which is a data search space.

Description

Apparatus and method for processing embedding-based data set

The present invention relates to a data processing device and method that can be used in the field of artificial intelligence technology development and machine learning, and more particularly, to a processing device and method for embedding-based data sets.

In the field of machine learning such as deep learning, a large amount of training data is required to artificially construct a neural network model and learn it. Model learning is the process of determining the values of parameters such as weights and biases that describe the neural network. Thousands to tens of thousands or more training data are required to train the parameters. Depending on the artificial neural network model, when learning is performed with insufficient data, over-fitting may result in that the model trained on only a small amount of given data responds normally and does not handle the remaining input data well. In order to solve this problem, a large number of learning data is required, and in many cases, there is a problem that requires a lot of time, resources, and money to secure the data.

Data augmentation according to the prior art is proposed to solve the problem of lack of data necessary for learning, and increases the amount of learning data by transforming given data. For example, for images, transformation operations such as brightness conversion, geometric distortion, sharpness processing, and color processing are performed on given data to increase diversity. Even if a small number of conversion technologies are used, it is possible to create an astronomical number of data sets by repeating the process of adjusting parameters, combining conversion methods, and inputting the final converted output to the converter again. . However, when some of the proliferated data is used for learning, the performance of the learned model may decrease, and the computation time and resources required for learning may increase excessively due to too much data.

According to the prior art, there is a problem in that difficulty control is not easy to train or test a classifier by machine learning. For example, learning can be easily performed by using only training data with clear characteristics. If only test data with clear characteristics and easy to distinguish are used, performance can be high, which is a kind of overfitting problem. To solve this problem, injecting difficult data that is difficult for humans to distinguish into training data or test data can help improve performance. To express this metaphorically, a student with excellent learning ability does not make progress in learning because he is only solving easy problems, or a student who does not have basic skills is trying to solve too difficult problems. It's like a situation that can't be solved. Conventionally, it was assumed that the user preprocessed the work of selecting and preprocessing the data, but there is a problem in that it is not easy to manually secure the data considering the difficulty of the learning data.

In the process of handling data such as machine learning, in many cases, a data set having various characteristics or desired characteristics is required as much as increasing the amount in configuring training data, verification data, and test data. For example, when training a classifier that identifies a person's ID (identification) by looking at a face, there is a case in which controlled condition data such as specific gender, age, and race are required. Conventionally, in order to process this, it was necessary for a person to classify in advance and prepare data, and it takes a lot of time and money for classification, and even if it is already classified data, there is a limit in that more detailed classification is not made. . For example, even if the face is classified as a 30-year-old face, there is a limit in that it is impossible to distinguish in detail whether the face is 31 years old or 39 years old. As another example, assuming that a lot of "road signs photographed in the rain" should be added for learning when creating a neural network model that recognizes road sign objects appearing while driving, conventionally, in a previously obtained data set, " There was no effective and appropriate method in place to find "road signs photographed in the rain."

In the field of machine learning or artificial intelligence, in learning a classifier or a generative model such as a GAN, simply increasing the size of training data and injecting it is not a skill. The performance of machine learning can be improved by injecting ambiguous data into the learning process. As an example, assuming a machine learning-based classifier that classifies them by giving pictures of "apple" and "strawberry", conventionally, such a classifier It has been considered a reasonable choice to use a large number of "apple" and "strawberry" pictures for learning, but if only data that is easy to distinguish "apple" or "strawberry" is injected into the learning process, it is correct for ambiguous data. There are problems that make it difficult to decide. If this is called decision boundary data, it is common to not perform learning by explicitly considering it.

The present invention is proposed to solve the above-mentioned problems, and proposes a practical data processing technology for the development of artificial intelligence and machine learning technology, such as improving the diversity of learning data and adjusting the difficulty of learning, by artificial neural networks. An object of the present invention is to provide an embedding-based data set processing apparatus and method capable of improving the quality of learning data for machine learning.

An apparatus for processing an embedding-based data set according to the present invention includes a multi-vector space embedding unit representing input data in a vector space and a first data set-based search for determining a search range by using the embedding result of the multi-vector space embedding unit. and a range determination unit and a second data set search and output unit configured to construct a new data set by extracting subset data from the second data set, which is a data search space.

The multi-vector space embedding unit passes elements of the input data through the plurality of embedding units and embeds them into respective vector spaces.

The first data set-based search range determination unit determines a reference position or boundary for search, and determines a search range using a preset search distance parameter from the determined position.

The first data set-based search range determination unit sets positions of elements of the first data set as reference positions, and the second data set search and output unit performs a search in the second data set to obtain a subset of the second data set. outputs

The first data set-based search range determining unit determines the linearly combined positions of the elements of the first data set as a reference location, and the second data set search and output unit performs a search on the search range to determine a portion of the second data set. output a set

The first data set-based search range determination unit determines a boundary by considering the positions of elements of the first data set, and the second data set search and output unit performs a search based on the boundary and determines a subset of the second data set. outputs

The second data set search and output unit obtains a subset of output candidates and randomly selects and outputs them, outputs all of them, or outputs a processing result of a specific embedding unit.

The embedding-based data set processing apparatus according to the present invention further includes an embedding vector space reduction unit that performs dimensionality reduction on the embedding result.

A method for processing an embedding-based data set according to the present invention includes the steps of (a) receiving a data set. (b) performing multi-vector space embedding; (c) determining a search range based on a first data set; and (d) performing a search and output of a second data set.

In the step (b), the data set is expressed in a vector space, and elements of the input data are passed through a plurality of embedding units to be embedded in each vector space.

In the step (c), a reference position or boundary for search is determined, and a search range is determined using a preset search distance parameter.

In step (c), positions of elements of the first data set are determined as reference positions.

In step (c), linearly combined positions of elements of the first data set are determined as reference positions.

In step (c), the boundary is determined by considering the positions of elements of the first data set.

In the step (d), a subset of output candidates is acquired and output is performed according to a preset method.

The embedding-based data set processing method according to the present invention further includes performing dimensionality reduction on the embedding result after step (b) and before step (c).

According to the present invention, there is an advantage in that it is easy to secure data for machine learning.

According to the present invention, the diversity of data can be increased, a desired data distribution can be created, a data set having predetermined characteristics can be constructed, data at a decision boundary can be selected to form a data set, It has the advantage of being able to process various learning data such as removing or including outlier data.

The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

1 illustrates an embedding-based data set processing apparatus according to an embodiment of the present invention.

Figure 2 shows a utilization embodiment according to the present invention.

3 and 4 illustrate data embedding according to an embodiment of the present invention.

5 illustrates a method for determining a search range by linear combination (when the first data set has two elements) according to an embodiment of the present invention.

6 shows a method for determining a search range by linear combination (when the first data set has three elements) according to an embodiment of the present invention.

7 illustrates a method for determining a search range by linear combination (when the first data set has four elements) according to an embodiment of the present invention.

8 illustrates an example of applying a plurality of distance parameters ε in determining a search range by linear combination according to an embodiment of the present invention.

9 illustrates an embedding-based data set processing apparatus according to another embodiment of the present invention.

10 illustrates a method of processing an embedding-based data set according to an embodiment of the present invention.

11 is a block diagram illustrating a computer system for implementing a method according to an embodiment of the present invention.

The foregoing and other objects, advantages and characteristics of the present invention, and a method of achieving them will become clear with reference to the detailed embodiments described below in conjunction with the accompanying drawings.

However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms, and only the following embodiments provide the purpose of the invention, As only provided to easily inform the configuration and effect, the scope of the present invention is defined by the description of the claims.

Meanwhile, terms used in this specification are for describing the embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, “comprises” and/or “comprising” means the presence of one or more other components, steps, operations, and/or elements in which a stated component, step, operation, and/or element is present. or added.

Data augmentation according to the prior art is a method of transforming and processing given data to create new data, whereas according to an embodiment of the present invention, a kind of "seed", "guide", and "query" called "first data set" There is a technical feature of constructing a new data set by extracting subset data for a kind of data search space called a "second data set" by determining a data search range through a data set that can be considered.

According to an embodiment of the present invention, similarities and differences between data are calculated by applying an embedding technique to media types such as image, text, and audio.

According to an embodiment of the present invention, there is a technical feature of determining a reference position or boundary for search for the “first data set” and specifying a search range through a predetermined search distance parameter ε at the determined position.

According to an embodiment of the present invention, a plurality of embedding techniques are applied to perform a search in a plurality of methods in the "second data set" and randomly extract or select some or all of them to increase diversity It has technical features that can overcome the bias problem caused by specific embedding techniques and secure high-quality data for machine learning.

An embedding-based data set processing apparatus 100 according to an embodiment of the present invention includes a multi-vector space embedding unit 101, a first data set-based search range determination unit 103, and a second data set search and output unit 105. ).

From a data input/output point of view, the embedding-based data set processing apparatus 100 according to an embodiment of the present invention provides a first data set Q={q ₁ , q ₂ , q ₃ , ... , q _N } and second data The set S={s ₁ , s ₂ , s ₃ , ... , s _N } is input, and the second data set S' is output.

Output data can be data with controlled characteristics, outliers (anomalies or outliers), or data related to decision boundaries.

N and M represent the respective set sizes of the first data set and the second data set of input data, and the output data S' may be a subset of S including an empty set (null set).

Input data can be in various forms such as image, text, and audio. Media types between elements in the first data set, media types between elements in the second data set, and media types between elements in the first data set and the second data set are preferably the same. For example, it is easy to compare each other only when data sets are made in the form of text for text, image for image, and audio for audio.

Figure 2 shows a utilization embodiment according to the present invention.

Referring to (1) of FIG. 2 , a subset of the second data set is output after searching in the second data set using the positions of elements of the first data set as reference positions. To help those skilled in the art understand, the example shown in (1) of FIG. 2 shows that elements having a predetermined similarity are searched for and output using the positions of elements of the first data set as reference positions. For example, when the first data set is given as Q={q ₁ , q ₂ , q ₃ }, data similar to elements of Q is extracted from the second data set.

In FIG. 2 , q ₁ is indicated by a triangle, q ₂ is a rounded rectangle, and q ₃ is a pentagon, and a region around each element within a critical distance is shown as a green region. If the green area is regarded as a search range, it is possible to search for and output elements of the second data set corresponding to the search range. That is, in the second data set, data resembling a triangle such as q ₁ , data resembling a rectangle with rounded corners such as q ₂ , and data resembling a pentagon such as q ₃ are searched and output so that they can be used as a new data set.

This is similar to the conventional search method at first glance, but differs in that a search range is set by setting a predetermined range with a plurality of elements in an embedded vector space and a new data set is formed.

According to an embodiment of the present invention, the scope of application is not limited only to extracting similar data.

Referring to (2) of FIG. 2, after searching in the second data set using the linearly combined positions of the elements of the first data set as a reference position and a predetermined range as a search range, a part of the second data set output a set Referring to (2) of FIG. 2, the first data set is given as Q = {q ₁ , q ₂ , q ₃ }, refer to the position of the linear combination or weighted summation of each element After determining the position, a predetermined search range is determined, and a subset of the second data set corresponding to the search range is extracted and displayed. Here, the position of the linear combination of the first data set may be an average position representing each element, a position at a decision boundary where the distinction between elements is ambiguous, or a position closer to a specific element, etc. Various effects can be induced by varying the weight. For this processing, according to an embodiment of the present invention, each element of the first data set Q and the second data set S is processed in a linear vector space through a predetermined embedding process.

Referring to (3) of FIG. 2, a boundary is determined in consideration of the positions of elements of the first data set, an area included in the threshold range along the determined boundary is set as a search range, and the second data set is searched. , outputs a subset of the second data set. Boundaries can be obtained by applying a method of dividing a hyperplane based on a given vector position, such as a Voronoi diagram, on a 2D plane or a 3D or higher dimensional hyperplane. Alternatively, each position of an element may be regarded as an average position of a high-dimensional Gaussian function, and a point where Gaussian functions meet each other may be determined as a boundary. If the boundary is determined according to the above-described methods and the critical range is used as the search range, it is possible to obtain data located at the decision boundary, which is difficult to distinguish, using a machine learning technique.

Referring to (4) of FIG. 2, after searching in the second data set by combining the methods for setting the search range of (1) to (3) of FIG. 2, a subset of the second data set is output. . Through this combination, it is possible to secure data in a wide variety of forms. For the first data set given as an example, {Set the search range as reference positions for each element, set weights for each element for linear combination, set weights for each element to determine the boundary, and additionally search around the linearly combined position or boundary. By adjusting various parameters for setting a predetermined range to perform a predetermined range, there is an effect capable of achieving goals such as data diversity, data difficulty control, and securing decision boundary data.

The multi-vector space embedding unit 101 passes the input data elements through a plurality of embedding units (embedding unit 1, embedding unit 2, *, embedding unit K) and embeds them (encoding, mapping or injecting) into each vector space. Do it.

The multi-vector space embedding unit 101 includes different embedding units, and through this, it is possible to prevent biased data output due to a specific embedding method in expressing input data in a vector space.

[Table 1] shows vectors obtained by embedding arbitrary elements of the first data set Q and the second data set S in K different vector spaces.

	제1 데이터 집합 Q의 임베딩 벡터Embedding vector of the first data set Q	제2 데이터 집합 Q의 임베딩 벡터Embedding vector of the second data set Q
임베딩부 1Embedding part 1	E₁(Q)={E₁(q₁), E₂(q₂),..., E₁(q_N) }E ₁ (Q)={E ₁ (q ₁ ), E ₂ (q ₂ ),..., E ₁ (q _N ) }	E₁(S) = { E₁ (s₁), E₁ (s₂), ... , E₁ (s_M)E ₁ (S) = { E ₁ (s ₁ ), E ₁ (s ₂ ), ... , E ₁ (s _M )
임베딩부 2Embedding part 2	E₂(Q) = { E₂(q₁), E₂(q₂), ... , E₂(q_N)E ₂ (Q) = { E ₂ (q ₁ ), E ₂ (q ₂ ), ... , E ₂ (q _N )	E₂(S) = { E₂ (s₁), E₂ (s₂), ... , E₂ (s_M)E ₂ (S) = { E ₂ (s ₁ ), E ₂ (s ₂ ), ... , E ₂ (s _M )
임베딩부 3Embedding part 3	E₃(Q) = { E₃(q₁), E₃(q₂), ... , E₃(q_N)E ₃ (Q) = { E ₃ (q ₁ ), E ₃ (q ₂ ), ... , E ₃ (q _N )	E₃(S) = { E₃ (s₁), E₃ (s₂), ... , E₃ (s_M)E ₃ (S) = { E ₃ (s ₁ ), E ₃ (s ₂ ), ... , E ₃ (s _M )
......	......	......
임베딩부 KEmbedding part K	E_K(Q) = { E_K (q₁), E_K (q₂), ... , E_K (q_N)E _K (Q) = { E _K (q ₁ ), E _K (q ₂ ), ... , E _K (q _N )	E_K(S) = { E_K (s₁), E_K (s₂), ... , E_K (s_M)E _K (S) = { E _K (s ₁ ), E _K (s ₂ ), ... , E _K (s _M )

In [Table 1], the calculation result of the i-th embedding unit for the element q is shown as E _i (q). That is, E ₁ (Q) = { E 1 (q ₁ ), E ₁ through _the embedding unit 1 for the input first data set Q = {q ₁ , q ₂ , q ₃ , ... , q _N } (q ₂ ), ... , E ₁ (q _N ) }. Through this embedding process, similarity/difference calculation between data is effective and convenient.

Embedding results generated through a plurality of embedding units can basically be mutually compared only for the corresponding embedding method. That is, the data embedded through the embedding unit 1 and the data embedded through the embedding unit 2 are not related to each other, and therefore, similarities or differences cannot be calculated.

An embedding technique according to an embodiment of the present invention is defined as a process of mapping arbitrary data to a vector space of an arbitrary dimension. It is preferable that the data output through the embedding process be smaller than the size at the time of input. This is a process commonly referred to as encoding, which has the effect of removing redundant or unnecessary information and compressing it into concentrated information. For such embedding, schemes such as PCA (Principle Component Analysis), LDA (Linear Discriminant Analysis), Laplacian eigenmaps, Isomap, LLE (Local Linear Embedding), and t-SNE (Stochastic Neighbor Embedding) may be used.

Another embedding method is to pass the input data through an artificial neural network capable of obtaining high-level semantic information, and then use layer information including semantic information in the artificial neural network as a vector space in which information is enriched. is to use the method

According to an embodiment of the present invention, it is preferable to modify and supplement methods such as an autoencoder, a variational autoencoder (VAE), and a generative adversarial neural network as an artificial neural network that acquires such enriched information. Such a neural network layer may have the shape of an array or matrix in which numbers are listed, and these values mean data embedded in a predetermined vector space. As an arbitrary position on this vector space can also be referred to as a representation vector, main characteristics of input data are expressed in different ways for each position.

For example, the vector (1, 0, 0) can express meanings such as "laughing", the vector (0, 1, 0) "normal", and the vector (0, 0, 1) "crying".

This vector space is an algebraic space and can also be called a latent space, and becomes a different vector space depending on the neural network model or learning method.

It is desirable to consistently perform such embedding for each media format in image, text, and audio. For example, an image is mapped to a vector space using a separate embedding method for each image and text for each text.

Conventionally well-known artificial neural networks for images include Google's Inception Net, Oxford University's VGG Net, and Stanford University's SqueezeNet, and for text, word2vec, doc2vec, ELMo (Embeddings from Language Model), fastText, GloVe, and BERT (Bidirectional Encoder Representations from Transformers). For example, the process of embedding an image into a vector space with an artificial neural network is as follows. A color image with 100 pixels in width, 100 pixels in height, and three RGB color channels can be expressed as a tensor of 100x100x3. If this value is embedded in a 2x1 vector space through an artificial neural network, it can be seen that as a result, a 100x100x3 image is embedded (encoded, mapped, or injected) into a 2-dimensional vector space.

3 and 4 show only the first data set to help those skilled in the art understand the concept of embedding, and it is assumed that the multi-vector space embedding unit 101 is composed of only embedding unit 1 and embedding unit 2. If data q is input for embedding unit 1 and embedding unit 2, output embedding vectors are represented by E ₁ (q) and E ₂ (q), respectively.

Referring to FIG. 3 , it is assumed that the first data set is composed of two elements Q = {q ₁ , q ₂ }, where q ₁ is a triangle and q ₂ is a rectangle with rounded corners. When the elements of the first data set Q pass through the two embedding units, respectively, {E ₁ (q ₁ ), E ₁ (q ₂ )} and {E ₂ (q ₁ ), E ₂ ( q ₂ )} is output and mapped to the location of each embedding vector space. What is noteworthy in the left and right figures of FIG. 3 is that even the same data can be embedded in different positions if different embedding units are used, and the relative distance in the embedding space can be different if different embedding units are used . In FIG. 3, q ₁ of a triangular shape is different from each other, such as E ₁ (q ₁ ) and E ₂ (q ₁ ), in embedding vector space 1 and embedding vector space 2. The relative distance of q ₁ and q ₂ in the embedding space is E ₁ (q ₁ ) - E ₁ (q ₂ ) on the left side of FIG. 3 , and E ₂ (q ₁ ) - E ₂ (q 2 ) on the right side of FIG. _{2 .} ), the values will be different from each other.

FIG. 4 is an example of a case where the first data set is four elements, Q = {q ₁ , q ₂ , q ₃ , q ₄ }, and shows a case in which data is expanded from 2 to 4 compared to FIG. 3 do. In this way, Q can have multiple elements, represented by N.

As described above, the first data set Q is embedded in a predetermined vector space through the multi-vector space embedding unit.

The first data set-based search range determination unit 103 determines the search range based on the embedded data according to the four methods described above with reference to FIG. 2 .

(i) Determining a predetermined search range using elements of the first data set as reference positions

(ii) determining a predetermined range by using the linearly combined positions of the elements of the first data set as reference positions;

(iii) A boundary is determined in consideration of the positions of elements of the first data set, and an area related to the critical range is determined as a predetermined search range along the determined boundary.

(iv) Determine a predetermined search range by combining the methods for determining the search range of (i) to (iii)

Hereinafter, "(i) determining a predetermined search range by using elements of the first data set as reference positions" will be described.

As shown in (1) of FIG. 2, with arbitrary elements of the first data set as reference positions, the distance ε near the periphery is used to determine the search range.

For example, when the first data set is given as Q = { q ₁ , q ₂ }, when the ith embedding method embedding i is performed, E _i (Q) = { E _i (q ₁ ), E _i (q ₂ ) } , and if the given search distances are ε ₁ and ε ₂ for q ₁ and q ₂ , respectively, the distances ε ₁ and ε ₂ centered on the vectors E _i (q ₁ ) and E _i (q ₂ ) in the embedding vector space Based on , the search range is determined. That is, find all points p shorter than the distance of ε ₁ centered on the vector E _i (q ₁ ), and find points p shorter than the distance ε ₂ centered on the vector E _i (q ₂ ) and set them as the search range. do.

Expressing this as an absolute value and an inequality for data q ₁ is as follows. The position of p on the vector space that satisfies the inequality of [Equation 1] becomes the search range.

[Equation 1]

| E _i (q ₁ ) - p | < ε ₁

For data q2, the position of p on the vector space that satisfies the inequality of [Equation 2] becomes another search range.

[Equation 2]

| E _i (q ₂ ) - p | < ε ₂

The search range satisfying [Equation 1] and [Equation 2] is the range represented by the green area in (1) of FIG. 2 .

This can be modified in various ways as shown in [Equation 3]. In other words, the search range can be all internal, external, and boundary points with the ε distance as the boundary around the predetermined reference point.

The reason for supporting such diverse possibilities is to combine and organize machine learning data in various ways.

[Equation 3]

| E _i (q ₁ ) - p | > ε _{1 (1)}

| E _i (q ₁ ) - p | = ε _{1 (2)}

| E _i (q ₂ ) - p | > ε _{2 (3)}

| E _i (q ₂ ) - p | = ε ₂ ₍₄₎

In [Equation 3], (1) is a set of vectors p farther than ε ₁ based on the vector E _i (q ₁ ) on the embedded vector space, (2) is the vector E _i (q ₁ ) on the embedded vector space ), (3) is a set of vectors p farther than ε ₂ based on the vector E _i ( _q ₂ ) on the embedded vector space, (4) is the embedded vector space In the above, it means the set of vectors p separated by ε ₂ from the vector E _i (q ₂ ).

Hereinafter, “(ii) determining a predetermined range by using linearly combined positions of elements of the first data set as reference positions” will be described.

Assuming that the first data set is given as Q = { q ₁ , q ₂ }, an embedding vector of { E _i (q ₁ ), E _i (q ₂ ) } is obtained by performing the i-th embedding method embedding i. . If the weight for linear combination is given as { w ₁ , w ₂ } and the search range is given as ε, the linearly combined position T = w ₁ x E _i (q ₁ ) + w ₂ x E _i (q ₂ ) You can get it.

Then, based on the distance of ε with T as the center, the inside, outside, and boundary can be set as the search range, which is as shown in [Equation 4] below.

Here, the sum of all weights is w ₁ + w ₂ = 1. In this case, there are only two elements, and in the case of N elements, w ₁ + w ₂ + . . . + w _N = 1.

[Equation 4]

| T - p | < ε

| T - p | > ε

| T - p | = ε

E _i (q ₁ ) and E _i (q ₂ ) are linearly combined with the weights of w ₁ and w ₂ to point to a new position, T. d ₁ and d ₂ mean the distance between the vectors, d ₁ = | T - E _i (q ₁ ) |, d ₂ = | T - E _i (q ₂ ) | Calculate the absolute value of the difference between the two vectors as As w ₁ = d ₂ / (d ₁ + d ₂ ) and w ₂ = d ₁ / (d ₁ + d ₂ ), the weight and the distance are correlated with each other. That is, the distance can be determined with the weight, and the weight can be calculated with the distance.

Referring to FIG. 6, when Q={q ₁ , q ₂ , q ₃ }, E _i (q ₁ ), E _i (q ₂ ), and E _i (q ₃ ) are embedded, which are w ₁ , w This is the case of pointing to a new position called T by linearly combining with the weight of ₂ and w ₃ . The search range is determined through the same method as in [Equation 4] described above or a modification thereof.

In the case of Q = {q ₁ , q ₂ , q ₃ , q ₄ }, it is embedded as E _i (q ₁ ), E _i (q ₂ ), E _i (q ₃ ), and E _i (q ₄ ). In this case, a new position T is indicated by linear combination with weights of w ₁ , w ₂ , w ₃ , and w ₄ . The search range is determined through the same method as in [Equation 4] described above or a modification thereof.

As described with reference to FIG. 6, FIG. 8 is a case where the elements of the first data set are extended to three, and E _i (q ₁ ), E _i ( when Q = {q ₁ , q ₂ , q ₃ }) q ₂ ) and E _i (q ₃ ), which are linearly combined with the weights of w ₁ , w ₂ , and w ₃ to point to a new location called T. In this case, two parameters, {ε _a , ε _b }, are used to determine the search range. In the case of FIG. 8, the distance parameter ε a is farther than the distance parameter ε _a centered on the vector T in the embedding vector space, and the distance If it is necessary to select a region shorter than the parameter ε _b , the range can be selected in the same way as in [Equation 5], the number of parameters ε representing the distance is increased, and the combination method is modified to select a much more diverse region. It is possible.

[Equation 5]

{ | T - p | > ε _a } and { | T - p | < ε _b }

Hereinafter, "(iii) determining a boundary considering the positions of elements of the first data set, and determining an area related to a critical range along the determined boundary as a predetermined search range" will be described.

A boundary is determined in consideration of positions of elements of the first data set, and an area included in the critical range along the determined boundary is determined as a predetermined search range.

(ii) The description and principle related to the case are the same, but in the case of (ii), if the ε distance is taken as the search range for one vector position called T, the method of (iii) is a Voronoi diagram (Voronoi diagram) as described above. Diagram) or a high-dimensional Gaussian function is used to find the boundary, and the ε distance around it is used as the search range. Through this, it is possible to effectively find data that is expected to be located at the decision boundary, and to secure data having various characteristics by referring to the location of the decision boundary.

Hereinafter, "(iv) determining a predetermined search range by combining the methods for determining the search range of (i) to (iii)" will be described.

In the case of (iv), desired data can be output in various ways by determining a predetermined search range by combining the methods for determining the search range of (i) to (iii) described above.

Hereinafter, the role of the second data set search and output unit 105 will be described.

When the search range is determined using the first data set, data suitable for the search range may be extracted from the second data set. That is, the data set Q={q ₁ , q ₂ , . . . , q _N }, by applying the i-th embedding method embedding i to {E _i (q ₁ ), E _i (q ₂ ), . . . , E _i (q _N )} vector, similarly input second data set S={s ₁ , s ₂ , . . . , S _M } by applying the i-th embedding part embedding i to E _i (S) = {E _i (s ₁ ), E _i (s ₂ ), . . . , E _i (s _N )}.

And embedding S {E _i (s ₁ ), E _i (s ₂ ), ... , E _i (s _N )}, data that fits the search range determined by the first data set-based search range determination unit 103 is selected.

Then, the i-th embedding unit outputs a subset _S'i of S.

That is, candidate subsets for K outputs {S' ₁ , S' ₂ , . . . according to the plurality of embedding schemes used. , S' _K } is obtained.

i th embedding unit E _i (S) obtained by applying embedding i = {E _i (s ₁ ), E _i (s ₂ ), … , E _i (s _N )}, as described above with reference to [Equation 1] to [Equation 5], the method for checking whether it fits within the predetermined search range is the distance between the determined center point and the embedded vector. It can be easily inspected by measuring it.

For example, to check whether the embedding vector E _i (s ₁ ) is within ε distance compared to the vector calculated by the linear combination T, | E _i (s ₁ ) - T | < ε then the condition is met, otherwise the condition is not met. In this way, data that meets the condition can be extracted for all embedding vectors.

The second data set search and output unit 105 selects K output candidate subsets {S' ₁ , S' ₂ , . . . , S' _K }, randomly selects one or a plurality of outputs, outputs all of them, or explicitly outputs the processing result of a specific embedding part.

Even if a plurality is output, all output data basically belongs to the second data set S, so it may be duplicated when a plurality is output. In this case, it is preferable to output unique data.

9 , after performing embedding through the multi-vector space embedding unit 101, an embedding vector space reduction unit 102 is additionally included.

Embeddings that usually show meaningful performance have a size of 1000 dimensions or more, and the higher the dimension, the less information is lost due to dimension reduction, so it is advantageous to inversely estimate input data with embedded information or analyze unique features. do.

Conversely, however, as the dimension of embedding increases, it may be difficult for people to intuitively understand a vector space and it may be difficult to set a search range in a high dimension.

According to an embodiment of the present invention, the embedding vector space reduction unit 102 performs a process of reducing the embedded data to a lower dimension.

The lower dimensions to be reduced are likely to be two or three dimensions that are easy for humans to understand or handle. This is because if it becomes 2D or 3D, analysis and control are easy and calculation processing can be simplified.

Preferably, the embedding vector space reduction unit 102 independently and parallelly performs reduction processing for each of the K embedding units.

For example, the embedding unit 1 may reduce 100 dimensions to 2 dimensions, and the embedding unit 2 may reduce 1000 dimensions to 3 dimensions. In addition, the method of reducing the dimension for each embedding unit must be equally processed for the first data set Q and the second data set S.

In relation to the reduction of the embedding unit, an example will be described as follows.

For the first data set Q, the set of vectors embedded in the ith embedding unit E _i (Q) = {E _i (q ₁ ), E _i (q ₂ ), . . . , E _i (q _N )}, and for the second data set S, the set of vectors embedded in the ith embedding unit is E _i (S) = {E _i (s ₁ ), E _i (s ₂ ), . . . , E _i (s _N )}, it is assumed that it passed through the embedding method E _i () and became a 1000-dimensional vector. Then E _i (q ₁ ), E _i (q ₂ ), ..., E _i (q _N ), E _i (s ₁ ), E _i (s ₂ ), ... , E _i (s _N ) have the same dimension of 1000 dimensions. That is, it represents its own vector position in the same 1000-dimensional vector space.

In this situation, when passing through the embedding vector space reduction unit 102, an effect occurs in which E _i () becomes a new newE _i (). newE _i () becomes a vector of dimension lower than 1000, such as 2D or 3D. That is, through the processing of the embedding vector space reducer 102, the vectors embedded in the previous step are newE _i (q ₁ ), newE _i (q ₂ ), ..., newE _i (q _N ), newE _i ( s ₁ ), newE _i (s ₂ ), . , it represents the position of a vector in a reduced vector space, such as newE _i (s _N ).

As a specific method of reducing an embedded dimension to a lower dimension, there are the following methods.

(1) Apply conventional PCA (Principle Component Analysis). PCA can be used not only to embed input data, but also to reduce the dimensionality of an already embedded vector space. For example, if PCA is performed on 1000-dimensional data, it is possible to map (mapping or projection) to a low-dimensional vector space while damaging the amount of information (energy of information) contained in the original vector space as little as possible.

(2) A conventional vector quantization method is applied. As vector quantization can be defined as mapping (mapping) a set X of Z vectors to a set Y of K vectors, a set of high-dimensional embedding vectors is mapped to a set of low-dimensional embedding vectors.

(3) It is possible to reduce the dimension by extracting some of the embedded vectors. For example, if the set of embedded vectors is X = {A, B, C, D, E}, in order to reduce its dimensionality, Y = {A, B} or Y = {C, D, E} It is a method of simple extraction by rules. Even if this is possible in principle, it is because there is a high possibility that each position of the vector space in which the vector space is embedded contains unique meanings and characteristics. For example, if vector A constituting the above X is "person's gender", vector B is "age group", vector C is "race", vector D is "interest", and vector E is "hobby", then only random elements are extracted. However, you can perform any analysis you want. In this case, there is a premise that the meaning and characteristics of the vectors in the embedded vector space can be effectively interpreted only when the exact meaning and characteristics are implied.

After reducing the dimension in this way, it is possible to use a Voronoi diagram method such as a 2-dimensional or 3-dimensional method or a Gaussian method to calculate a decision boundary, so that processing is much simpler than using a high-dimensional algorithm.

Meanwhile, the embedding-based data set processing method according to an embodiment of the present invention may be implemented in a computer system or recorded on a recording medium. A computer system may include at least one processor, a memory, a user input device, a data communication bus, a user output device, and a storage. Each of the aforementioned components communicates data through a data communication bus.

The computer system may further include a network interface coupled to the network. The processor may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in memory and/or storage.

The memory and storage may include various types of volatile or non-volatile storage media. For example, memory may include ROM and RAM.

Accordingly, the embedding-based data set processing method according to an embodiment of the present invention may be implemented in a computer-executable method. When the embedding-based data set processing method according to an embodiment of the present invention is executed in a computer device, computer readable instructions may perform the embedding-based data set processing method according to the present invention.

Meanwhile, the above-described embedding-based data set processing method according to the present invention can be implemented as computer readable codes on a computer readable recording medium. Computer-readable recording media includes all types of recording media in which data that can be decoded by a computer system is stored. For example, there may be read only memory (ROM), random access memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like. In addition, the computer-readable recording medium may be distributed in computer systems connected through a computer communication network, and stored and executed as readable codes in a distributed manner.

A method of processing a data set based on embedding according to an embodiment of the present invention includes receiving a data set (S1001). It includes performing multi-vector space embedding (S1002), determining a first data set-based search range (S1004), and performing a second data set search and output (S1005).

In step S1002, the data set is expressed in a vector space, and elements of the input data are passed through a plurality of embedding units to be embedded in each vector space.

In step S1004, a reference position or boundary for search is determined, and a search range is determined using a preset search distance parameter.

In step S1004, positions of elements of the first data set are determined as reference positions.

In step S1004, linearly combined positions of elements of the first data set are determined as reference positions.

In step S1004, a boundary is determined in consideration of positions of elements of the first data set.

In step S1005, a subset of output candidates is acquired and output is performed according to a preset method.

The embedding-based data set processing method according to an embodiment of the present invention further includes performing dimensionality reduction on the embedding result (S1003) after step S1002 and before step S1004.

Referring to FIG. 11 , a computer system 1000 includes a processor 1010, a memory 1030, an input interface device 1050, an output interface device 1060, and a storage device 1040 communicating through a bus 1070. ) may include at least one of Computer system 1000 may also include a communication device 1020 coupled to a network. The processor 1010 may be a central processing unit (CPU) or a semiconductor device that executes instructions stored in the memory 1030 or the storage device 1040 . The memory 1030 and the storage device 1040 may include various types of volatile or non-volatile storage media. For example, memory can include read only memory (ROM) and random access memory (RAM). In an embodiment of the present description, the memory may be located inside or outside the processor, and the memory may be connected to the processor through various known means. Memory is a volatile or non-volatile storage medium in various forms, and may include, for example, read-only memory (ROM) or random access memory (RAM).

Accordingly, an embodiment of the present invention may be implemented as a computer-implemented method or as a non-transitory computer-readable medium in which computer-executable instructions are stored. In one embodiment, when executed by a processor, the computer readable instructions may perform a method according to at least one aspect of the present disclosure.

The communication device 1020 may transmit or receive a wired signal or a wireless signal.

In addition, the method according to the embodiment of the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer readable medium.

The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the computer readable medium may be specially designed and configured for the embodiments of the present invention, or may be known and usable to those skilled in the art in the field of computer software. A computer-readable recording medium may include a hardware device configured to store and execute program instructions. For example, computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, floptical disks and It may be the same magneto-optical media, ROM, RAM, flash memory, or the like. The program instructions may include high-level language codes that can be executed by a computer through an interpreter, as well as machine language codes generated by a compiler.

Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements made by those skilled in the art using the basic concept of the present invention defined in the following claims are also included in the scope of the present invention. that fall within the scope of the right.

Claims

a multi-vector space embedding unit that expresses input data in a vector space;

a first data set-based search range determining unit for determining a search range using an embedding result of the multi-vector space embedding unit; and

A second data set search and output unit that constructs a new data set by extracting subset data from the second data set, which is a data search space.

Embedding-based data set processing unit comprising a.
According to claim 1,

The multi-vector space embedding unit passes elements of the input data through a plurality of embedding units and embeds them in each vector space.

A processing unit for in-embedding-based data sets.
According to claim 1,

The first data set-based search range determination unit determines a reference position or boundary for search, and determines a search range using a preset search distance parameter from the determined position.

A processing unit for in-embedding-based data sets.
According to claim 3,

The first data set-based search range determination unit sets positions of elements of the first data set as reference positions, and the second data set search and output unit performs a search in the second data set to obtain a subset of the second data set. to output

A processing unit for in-embedding-based data sets.
According to claim 3,

The first data set-based search range determining unit determines the linearly combined positions of the elements of the first data set as a reference location, and the second data set search and output unit performs a search on the search range to determine a portion of the second data set. outputting a set

A processing unit for in-embedding-based data sets.
According to claim 3,

The first data set-based search range determination unit determines a boundary by considering the positions of elements of the first data set, and the second data set search and output unit performs a search based on the boundary and determines a subset of the second data set. to output

A processing unit for in-embedding-based data sets.
According to any one of claims 4 to 6,

The second data set search and output unit obtains a subset of output candidates and randomly selects and outputs them, outputs all of them, or outputs the processing result of a specific embedding unit.

A processing unit for in-embedding-based data sets.
According to claim 1,

An embedding vector space reduction unit for performing dimensionality reduction on the embedding result.

Embedding-based data set processing unit further comprising a.
(a) receiving a data set;

(b) performing multiple vector space embeddings;

(c) determining a search range based on a first data set; and

(d) performing a second data set search and output;

A method of processing an embedding-based data set comprising a.
According to claim 9,

The step (b) expresses the data set in a vector space, and embeds the elements of the input data into each vector space by passing them through a plurality of embedding units.

How to process in-embedding based datasets.
According to claim 9,

The step (c) is to determine a reference position or boundary for search and to determine a search range using a preset search distance parameter.

How to process in-embedding based datasets.
According to claim 11,

The step (c) is to determine the positions of the elements of the first data set as reference positions.

How to process in-embedding based datasets.
According to claim 11,

The step (c) is to determine the linearly combined positions of the elements of the first data set as reference positions.

How to process in-embedding based datasets.
According to claim 11,

Step (c) is to determine the boundary by considering the positions of elements of the first data set.

How to process in-embedding based datasets.
According to claim 9,

The step (d) is to obtain a subset of output candidates and perform output according to a preset method.

How to process in-embedding based datasets.
According to claim 9,

After step (b) and before step (c), performing dimensionality reduction on the embedding result.

A method of processing an embedding-based data set further comprising a.