WO2024173841A1 - Systèmes et procédés de modélisation neuronale de sujets de base - Google Patents
Systèmes et procédés de modélisation neuronale de sujets de base Download PDFInfo
- Publication number
- WO2024173841A1 WO2024173841A1 PCT/US2024/016227 US2024016227W WO2024173841A1 WO 2024173841 A1 WO2024173841 A1 WO 2024173841A1 US 2024016227 W US2024016227 W US 2024016227W WO 2024173841 A1 WO2024173841 A1 WO 2024173841A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- topic
- words
- bag
- seed
- word
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000001537 neural effect Effects 0.000 title description 15
- 238000009826 distribution Methods 0.000 claims abstract description 86
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000004044 response Effects 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims abstract description 14
- 238000004590 computer program Methods 0.000 claims description 53
- 238000003860 storage Methods 0.000 claims description 13
- 238000001914 filtration Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 42
- 230000015654 memory Effects 0.000 description 35
- 230000008569 process Effects 0.000 description 10
- 238000010899 nucleation Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241001632422 Radiola linoides Species 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
Definitions
- a method may include: (1) receiving, by a computer program, a seed topic word distribution comprising a plurality of seed topic words and having a plurality of topics; (2) receiving, by the computer program, a corpus of documents; (3) generating, by the computer program, bag of words representations for the corpus of documents; (4) converting, by the computer program, the corpus of documents to vector representations; (5) concatenating, by the computer program, the bag of words representations and the vector representations; (6) training, by the computer program, a topic modeling system using the seed topic word distribution and the concatenated bag of words representations and the vector representations resulting in a topic word distribution and a document word distribution; (7) generating, by the computer program, a plurality of new generated topics based on
- the method may also include receiving, by the computer program, a number of unseeded topics.
- the method may also include lemmatizing bag of words tokens for the bag of word representations and the seed topic words; and filtering, by the computer program, each bag of word token.
- each of the bag of words tokens may be filtered according to a number of documents in the corpus of documents that a word associated with bag of word token appears, may be filtered according a percentage of documents in the corpus of documents that a word associated with bag of word token appears, may be filtered by removing seed topic words having a cosine similarity that may be lower than a threshold when compared to each seed topic word, etc.
- the threshold may be user defined.
- the method may also include removing a subset from the seed topic words from the seed topic words based on an entity attribution and/or a part of speech tag.
- the topic word distribution penalty and the topic word distribution reward have different scalings in the total loss.
- the topic modeling system may be trained for a plurality of epochs.
- the neural network loss may be from a Contextualized Topic Models loss function. PATENT APPLICATION ATTORNEY DOCKET NO.
- a non-transitory computer readable storage medium may include instructions stored thereon, which when read and executed by one or more computer processors, cause the one or more computer processors to perform steps comprising: receiving a seed topic word distribution comprising a plurality of seed topic words and having a plurality of topics; receiving a corpus of documents; generating bag of words representations for the corpus of documents; converting the corpus of documents to vector representations; concatenating the bag of words representations and the vector representations; training a topic modeling system using the seed topic word distribution and the concatenated bag of words representations and the vector representations resulting in a topic word distribution and a document word distribution; generating a plurality of new generated topics based on the topic word distribution; precomputing a topic word distribution penalty and a topic word distribution reward for the plurality of topics; penalizing the topic modeling system in response to the plurality of new generated topics diverging from the seed topic words and rewarding the topic modeling system in response to
- the non-transitory computer readable storage medium may also include instructions stored thereon, which when read and executed by one or more computer processors, cause the one or more computer processors to receive a number of unseeded topics.
- the non-transitory computer readable storage medium may also include instructions stored thereon, which when PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 read and executed by one or more computer processors, cause the one or more computer processors to lemmatize the bag of words tokens for the bag of word representations and the seed topic words; and filter each bag of word token.
- each of the bag of words tokens may be filtered according to a number of documents in the corpus of documents that a word associated with bag of word token appears, may be filtered according a percentage of documents in the corpus of documents that a word associated with bag of word token appears, may be filtered by removing seed topic words having a cosine similarity that is lower than a threshold when compared to each seed topic word, wherein the threshold may be user defined, etc.
- the non-transitory computer readable storage medium may also include instructions stored thereon, which when read and executed by one or more computer processors, cause the one or more computer processors to remove a subset from the seed topic words from the seed topic words based on an entity attribution and/or a part of speech tag.
- the topic word distribution penalty and the topic word distribution reward have different scalings in the total loss.
- the topic modeling system may be trained for a plurality of epochs.
- a method for seeded neural topic modeling may include: (1) receiving, by a computer program, a plurality of topic seed words (i.e.
- a seed topic word distribution and a corpus of documents in text form and optionally, a number of unseeded topics K;
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 generating, by the computer program, a vocabulary comprising a bag of words for the corpus of documents; (3) pre-processing, by the computer program, the bag of words; (4) training, by the computer program, a topic modeling system using the topic seed words resulting in a topic word distribution and a document word distribution; (5) comparing, by the computer program, the N generated topics using the seed topic word distribution; (6) penalizing, by the computer program, the topic modeling system in response to the N generated topics diverging from the seed topic words; (7) rewarding, by the computer program, the topic modeling system in response to the N generated topics being similar to the seed topic words; (8) determining, by the computer program, a total loss from a neural network loss (reconstruction loss), a penalty loss from the penalizing, and a reward loss from the rewarding; and (
- the step of pre-processing the bag of words may include: lemmatizing bag of words tokens and the seed topic words to reduce the vocabulary as all tokens become the lemma; and reducing, by the computer program, a size of the vocabulary and the bag of words by filtering each bag of word token.
- the step of pre-processing the bag of words may include: removing topic seed words based on an entity attribution and/or a part of speech tag.
- each of the bag of words tokens may be filtered according to a number of documents in the document corpus the word associated with bag of word token appears. PATENT APPLICATION ATTORNEY DOCKET NO.
- each of the bag of words tokens may be filtered according to a percentage of documents in the document corpus the word associated with bag of word token appears. [0026] In one embodiment, each of the bag of words tokens may be filtered by removing words having a cosine similarity of the BERT, Word2Vec, or Glove embedding of that word that is lower than a threshold when compared to each seed topic word across all seed topics. [0027] In one embodiment, the threshold may be user defined. [0028] In one embodiment, the step of penalizing and rewarding are two separate factors and have a different scaling in the final loss function. [0029] In one embodiment, the rewarding may be promoted more than the penalizing.
- the penalizing may be promoted more than the rewarding.
- the rewarding and the penalizing may scale at the same rate.
- the topic modeling system may be trained for a plurality of epochs.
- the neural network loss may be from a Contextualized Topic Models loss function. BRIEF DESCRIPTION OF THE DRAWINGS [0034]
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 should not be construed as limiting the present invention but are intended only to illustrate different aspects and embodiments.
- Figure 1 depicts a system for seeded neural topic modeling according to an embodiment
- Figure 2 depicts a method for training a topic modeling system according to one embodiment
- Figure 3 depicts an exemplary computing system for implementing aspects of the present disclosure.
- Embodiments relate generally to systems and methods for seeded neural topic modeling.
- Embodiments may introduce a seed to a Neural Topic Model architecture resulting in an effective semi-supervised topic modeling framework that yields cleaner, more domain-relevant topics when compared to the state-of-the-art open-source alternatives.
- Embodiments may be used with Neural Topic Models and Contextualized Topic Models.
- Embodiments may leverage a list of seed words by topic to guide the neural topic modeling process to generate domain-relevant topic words and domain of interest.
- Embodiments may surface un-specified topics in addition to the seeded topics.
- the framework disclosed herein may provide the following: PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 [0043] 1.
- Neural seeding Seeding is the process of initializing the topic modeling algorithms with a number of known, usually subject matter expert (SME)-curated topics.
- SME subject matter expert
- Embodiments may use, for example, a neural topic modeling architecture by defining a novel loss function.
- the loss function may combine concepts from the ProdLDA model and a reward and penalty factor aiming at promoting more distinctive topics based on the seed.
- Emerging topic detection While the framework expects users to input seeding, it may also promote learning new topics when those do not fit within the existing seed. As a result, emerging topics tend to be more distinctive and useful.
- Taxonomy curation Even with a great seeding mechanism, human language proposes significant challenges if one attempts to create a taxonomy. For instance, between the words “account”, “payments”, “card”, and “loan”, it is not immediately clear which ones should be used as the topic’s name and which ones should be members of that topic’s word distribution. This may lead to bad seeding.
- Embodiments may optimize the taxonomy by considering the dataset each time, and may also include actions to improve the taxonomy.
- embodiments may combine contextualized representations obtained from large language models with bag-of-words based on neural topic models to obtain superior performance.
- this topic model learns two key components: (1) ⁇ which represents the topic-word distribution, and (2) ⁇ which represents the document-word distribution. From the final ⁇ distribution learnt, the most relevant keywords may be extracted for each topic group. From the final ⁇ learnt, the most PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 likely topic group each text document belongs to may be identified.
- Equation 1 The loss signal to learn these components is explained in equation 1: where ⁇ is the document-topic distribution, ⁇ is the topic-word distribution, ⁇ are the parameters of the neural network, w is the word distribution matrix, J is the number of topics, ⁇ and ⁇ are statistics of the ⁇ Gaussian distribution sampled from the network, ⁇ reward and ⁇ penalize are the scaling factors of the reward and penalize loss respectively, and ⁇ reward and ⁇ penalize are the reward and penalize ⁇ that is initialized based on seeding strategy. [0047] Embodiments input seeded words thereby guiding the model training process to retrieve keywords that are relevant to the seeded words.
- seeded words for topic group “Rewards” could be “cash back”, “bonus”, “points”, etc., and users may add any number of such topic groups and their corresponding seed words, as long as they would intuitively be useful for the particular dataset.
- the number of output topic groups is equal or higher than the number of seeded topic groups.
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 embodiments may generate topic keywords for both topic groups that are seeded and not seeded.
- Embodiments may guide the model training process to retrieve keywords that are relevant to the seeded words of each seeded topic group with the loss function outlined in equation 1.
- System 100 may include electronic device 110, which may be a server (e.g., cloud-based and/or physical), a computer (e.g., workstation, desktop, laptop, notebook, tablet, etc.), a smart device (e.g., smart phone, smart watch, etc.), an Internet of Things (IoT) appliance, etc.
- server e.g., cloud-based and/or physical
- computer e.g., workstation, desktop, laptop, notebook, tablet, etc.
- smart device e.g., smart phone, smart watch, etc.
- IoT Internet of Things
- Electronic device 110 may execute modeling computer program 115, which may receive data from data source(s) 120.
- Data sources 120 may provide documents, such as text documents.
- the documents may be any type of data in text form - printed publications, web articles, emails, customer reviews, chat messages, call transcripts, etc.
- the documents may include topics that are not detectable.
- Modeling computer program 115 may receive the data and may use contextual topic modelling to assign a likely topic group to the data. Modeling computer program 115 may output a filtered and/or ranked list of topics for the document. PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 [0053] Referring to Figure 2, a method for seeded neural topic modeling is disclosed according to one embodiment.
- a user may provide a seed topic word distribution ⁇ ' having a number N of topics to a computer program, such as a modeling computer program.
- a topic modelling system may identify, from the received word distribution ⁇ ', a learnable topic with a word distribution ⁇ , and a learnable document with a word distribution ⁇ .
- the ⁇ distribution may be initialized as neural network parameter weights to be tuned throughout the training process, and ⁇ distribution may be sampled from ⁇ and ⁇ distributions which are generated from hidden layer outputs of the neural network.
- the user may also provide a number of unseeded topics.
- the topic modelling system may generate a trainable variational autoencoder neural network using ⁇ and ⁇ .
- the computer program may receive a plurality of documents in textual form.
- the computer program may convert documents to word embedding vector representations using, for example, BERT.
- the computer program may also convert the documents to Bag of Words (BOW) vector representations.
- the compute program may concatenate the BOW vector representations and the word embedding vector representations. The concatenation is then provided to the topic modelling system.
- This combined representation of the input documents may be transformed by the PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 hidden layers of the neural network to generate ⁇ and ⁇ distributions, and the ⁇ and ⁇ may be transformed to sample a ⁇ distribution.
- the model’s ⁇ distribution and the sampled ⁇ distribution may be used to reconstruct the bag of word representation.
- This reconstructed bag-of-words may be compared with the original bag-of-words to generate a reconstruction loss.
- the bag of words may be pre-processed.
- the bag of words and the seed topic words may be lemmatized to reduce the vocabulary as all tokens become the lemma.
- step 245 once training is complete (e.g., the training epochs are complete), in step 285, the computer program may generate N+K new topics using the topic: word distribution ⁇ , where K is a number of unseeded topics.
- the final ⁇ provides the distribution of words in both seeded and non-seeded topic groups based on the user’s configuration. This allows the computer program to identify salient keywords per topic groups analyzed.
- the final ⁇ provides information to signal which topic group each document may belong to.
- the computer program may generate N + K new topics using the topic: word distribution ⁇ . by taking the model’s current ⁇ distribution weights.
- the computer program may compare the topic: word distribution ⁇ with both ⁇ (penalize) and ⁇ (reward) to generate, in step 260, a penalty loss for divergence and in PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 step 265, a penalty loss for similarity.
- the ⁇ (penalize) and ⁇ (reward) may be pre-computed based on the seed topic-word distribution provided by the user ( ⁇ ').
- the ⁇ (reward) has higher value for BOW tokens that are more similar to the seed, and the ⁇ (penalize) has higher value for BOW tokens that are more divergent to the seed.
- This similarity value may be assigned using various methods such as cosine semantic similarity, scaling with inverse document frequency (IDF), etc.
- a higher reward signal may be given when the current ⁇ learnt is more similar to the ⁇ (reward), and a higher penalize signal may be given when the current ⁇ learnt is more similar to the ⁇ (penalize).
- a penalty loss for the first N generated Topic:Words distributions divergent from the seed may be generated, and in step 265, a penalty loss for first N generated Topic:Words distributions similar to seed may be generated.
- the computer program may generate a total loss from the penalty losses. The total loss may promote N topics similar to the seed, and K new topics divergent from the seed.
- the computer program may also reconstruct the original Bag of Words vector representation from the output of the topic modelling system.
- the computer program may generate a reconstruction loss based on the comparison of the original Bag of Words vector representation and the reconstruction of the original Bag of Words vector representation.
- the reconstruction loss may be fed to the topic modelling system.
- the following pseudocode outlines how the seeding components, namely ⁇ seed , ⁇ reward , and ⁇ penalize , are initialized.
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 sim is a similarity measurement that measures the similarity between bag-of-words token b with seed words in seed i
- scale is the scaling method that scales this similarity value
- bows is the bag-of-words tokens.
- ⁇ reward defines which bag-of-words tokens should be rewarded for having higher values learnt in the current ⁇
- scale defines a scaling factor to re-weight these rewards.
- ⁇ reward gives higher values to relevant bag-of-words tokens that are more similar to the seed words in the current topic, but less similar to seed words in other topics.
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 where ⁇ penalize defines which bag-of-words tokens should be penalized for having higher values learnt in the current ⁇ , and scale defines a scaling factor to re-weight these rewards.
- ⁇ penalize gives higher values to relevant bag-of-words tokens that are least similar to the seed words in the current topic, but more similar to seed words in other topics.
- FIG. 3 depicts an exemplary computing system for implementing aspects of the present disclosure.
- Figure 3 depicts exemplary computing device 300.
- Computing device 300 may represent the system components described herein.
- Computing device 300 may include processor 305 that may be coupled to memory 310.
- Memory 310 may include volatile memory.
- Processor 305 may execute computer-executable program code stored in memory 310, such as software programs 315.
- Software programs 315 may include one or more of the logical steps disclosed herein as a programmatic instruction, which may be executed by processor 305.
- Memory 310 may also include data repository 320, which may be nonvolatile memory for data persistence.
- Processor 305 and memory 310 may be coupled by bus 330.
- Bus 330 may also be coupled to one or more network interface connectors 340, such as wired network interface 342 or wireless network interface 344.
- Computing device 300 may also have user interface components, such as a screen for displaying graphical user interfaces and receiving input from the user, a mouse, a keyboard and/or other input/output components (not shown).
- Embodiments of the system or portions of the system may be in the form of a “processing machine,” such as a general-purpose computer, for example.
- processing machine is to be understood to include at least one processor that uses at least one memory.
- the at least one memory stores a set of instructions.
- the instructions may be either permanently or temporarily stored in the memory or memories of the processing machine.
- the processor executes the instructions that are stored in the memory or memories in order to process data.
- the set of instructions may include various instructions that perform a particular task or tasks, such as those tasks described above.
- Such a set of instructions for performing a particular task may be characterized as a program, software program, or simply software.
- the processing machine may be a specialized processor.
- the processing machine may be a cloud- based processing machine, a physical processing machine, or combinations thereof.
- the processing machine executes the instructions that are stored in the memory or memories to process data. This processing of data may be in response to commands by a user or users of the PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 processing machine, in response to previous processing, in response to a request by another processing machine and/or any other input, for example.
- the processing machine used to implement embodiments may be a general-purpose computer.
- the processing machine described above may also utilize any of a wide variety of other technologies including a special purpose computer, a computer system including, for example, a microcomputer, mini-computer or mainframe, a programmed microprocessor, a micro-controller, a peripheral integrated circuit element, a CSIC (Customer Specific Integrated Circuit) or ASIC (Application Specific Integrated Circuit) or other integrated circuit, a logic circuit, a digital signal processor, a programmable logic device such as a FPGA (Field-Programmable Gate Array), PLD (Programmable Logic Device), PLA (Programmable Logic Array), or PAL (Programmable Array Logic), or any other device or arrangement of devices that is capable of implementing the steps of the processes disclosed herein.
- a programmable logic device such as a FPGA (Field-Programmable Gate Array), PLD (Programmable Logic Device), PLA (Programmable Logic Array), or PAL
- the processing machine used to implement embodiments may utilize a suitable operating system.
- the processors and/or the memories of the processing machine be physically located in the same geographical place. That is, each of the processors and the memories used by the processing machine may be located in geographically distinct locations and connected so as to communicate in any suitable manner. Additionally, it is appreciated that each of the processor and/or the memory may be composed of different physical pieces of equipment. Accordingly, it is not necessary that the processor be one single piece of equipment in one PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 location and that the memory be another single piece of equipment in another location.
- the processor may be two pieces of equipment in two different physical locations.
- the two distinct pieces of equipment may be connected in any suitable manner.
- the memory may include two or more portions of memory in two or more physical locations.
- processing, as described above is performed by various components and various memories. However, it is appreciated that the processing performed by two distinct components as described above, in accordance with a further embodiment, may be performed by a single component. Further, the processing performed by one distinct component as described above may be performed by two distinct components. [0080] In a similar manner, the memory storage performed by two distinct memory portions as described above, in accordance with a further embodiment, may be performed by a single memory portion.
- the memory storage performed by one distinct memory portion as described above may be performed by two memory portions.
- various technologies may be used to provide communication between the various processors and/or memories, as well as to allow the processors and/or the memories to communicate with any other entity; i.e., so as to obtain further instructions or to access and use remote memory stores, for example.
- Such technologies used to provide such communication might include a network, the Internet, Intranet, Extranet, a LAN, an Ethernet, wireless communication via cell tower or satellite, or any client server system that provides communication, for example.
- Such PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 communications technologies may use any suitable protocol such as TCP/IP, UDP, or OSI, for example.
- a set of instructions may be used in the processing of embodiments.
- the set of instructions may be in the form of a program or software.
- the software may be in the form of system software or application software, for example.
- the software might also be in the form of a collection of separate programs, a program module within a larger program, or a portion of a program module, for example.
- the software used might also include modular programming in the form of object-oriented programming.
- the software tells the processing machine what to do with the data being processed.
- the instructions or set of instructions used in the implementation and operation of embodiments may be in a suitable form such that the processing machine may read the instructions.
- the instructions that form a program may be in the form of a suitable programming language, which is converted to machine language or object code to allow the processor or processors to read the instructions. That is, written lines of programming code or source code, in a particular programming language, are converted to machine language using a compiler, assembler or interpreter.
- the machine language is binary coded machine instructions that are specific to a particular type of processing machine, i.e., to a particular type of computer, for example. The computer understands the machine language.
- Any suitable programming language may be used in accordance with the various embodiments.
- the instructions and/or data used in the practice of embodiments may utilize any compression or encryption PATENT APPLICATION ATTORNEY DOCKET NO.
- the embodiments may illustratively be embodied in the form of a processing machine, including a computer or computer system, for example, that includes at least one memory.
- the set of instructions i.e., the software for example, that enables the computer operating system to perform the operations described above may be contained on any of a wide variety of media or medium, as desired.
- the data that is processed by the set of instructions might also be contained on any of a wide variety of media or medium.
- the particular medium i.e., the memory in the processing machine, utilized to hold the set of instructions and/or the data used in embodiments may take on any of a variety of physical forms or transmissions, for example.
- the medium may be in the form of a compact disc, a DVD, an integrated circuit, a hard disk, a floppy disk, an optical disc, a magnetic tape, a RAM, a ROM, a PROM, an EPROM, a wire, a cable, a fiber, a communications channel, a satellite transmission, a memory card, a SIM card, or other remote transmission, as well as any other medium or source of data that may be read by the processors.
- the memory or memories used in the processing machine that implements embodiments may be in any of a wide variety of forms to allow the memory to hold instructions, data, or other information, as is desired.
- the memory might be in the form of a database to hold data.
- the database might use any desired arrangement of files such as a flat file arrangement or a relational database arrangement, for example.
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 [0087]
- a variety of “user interfaces” may be utilized to allow a user to interface with the processing machine or machines that are used to implement embodiments.
- a user interface includes any hardware, software, or combination of hardware and software used by the processing machine that allows a user to interact with the processing machine.
- a user interface may be in the form of a dialogue screen for example.
- a user interface may also include any of a mouse, touch screen, keyboard, keypad, voice reader, voice recognizer, dialogue screen, menu box, list, checkbox, toggle switch, a pushbutton or any other device that allows a user to receive information regarding the operation of the processing machine as it processes a set of instructions and/or provides the processing machine with information.
- the user interface is any device that provides communication between a user and a processing machine.
- the information provided by the user to the processing machine through the user interface may be in the form of a command, a selection of data, or some other input, for example.
- a user interface is utilized by the processing machine that performs a set of instructions such that the processing machine processes data for a user.
- the user interface is typically used by the processing machine for interacting with a user either to convey information or receive information from the user.
- the other processing machine might be characterized as a user.
- PATENT APPLICATION ATTORNEY DOCKET NO. 052227.501508 contemplated that a user interface utilized in the system and method may interact partially with another processing machine or processing machines, while also interacting partially with a human user.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un procédé qui peut consister à : recevoir une répartition de mots de sujets de base; recevoir un corpus de documents; générer un ensemble de représentations de mots pour le corpus de documents; convertir le corpus de documents en représentations vectorielles; entraîner un système de modélisation de sujets à l'aide de la répartition de mots de sujets de base et un ensemble concaténé de représentations de mots et les représentations vectorielles résultant en une répartitions de mots de sujets et une répartitions de mots de documents; générer une pluralité de nouveaux sujets générés sur la base de la répartition de mots de sujets; précalculer une pénalité de répartition de mots de sujets et une récompense de répartition de mots de sujets pour la pluralité de sujets; pénaliser le système de modélisation de sujets en réponse à une divergence et récompenser le système de modélisation de sujets en réponse à une similarité; déterminer une perte totale à partir d'une perte de réseau neuronal, de la pénalité de répartition de mots de sujets et de la récompense de répartition de mots de sujets; et entraîner le système de modélisation de sujets sur la base de la perte totale.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GR20230100129 | 2023-02-16 | ||
GR20230100129 | 2023-02-16 | ||
US18/442,982 | 2024-02-15 | ||
US18/442,982 US20240281603A1 (en) | 2023-02-16 | 2024-02-15 | Systems and methods for seeded neural topic modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024173841A1 true WO2024173841A1 (fr) | 2024-08-22 |
Family
ID=90458029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2024/016227 WO2024173841A1 (fr) | 2023-02-16 | 2024-02-16 | Systèmes et procédés de modélisation neuronale de sujets de base |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024173841A1 (fr) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210103608A1 (en) * | 2019-10-08 | 2021-04-08 | International Business Machines Corporation | Rare topic detection using hierarchical clustering |
-
2024
- 2024-02-16 WO PCT/US2024/016227 patent/WO2024173841A1/fr unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210103608A1 (en) * | 2019-10-08 | 2021-04-08 | International Business Machines Corporation | Rare topic detection using hierarchical clustering |
Non-Patent Citations (1)
Title |
---|
YU ZHANG ET AL: "Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 January 2023 (2023-01-11), XP091411878 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111859960B (zh) | 基于知识蒸馏的语义匹配方法、装置、计算机设备和介质 | |
Govorkova et al. | Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider | |
CN110069709B (zh) | 意图识别方法、装置、计算机可读介质及电子设备 | |
CN112749274B (zh) | 基于注意力机制和干扰词删除的中文文本分类方法 | |
CN107229627B (zh) | 一种文本处理方法、装置及计算设备 | |
CN112287672A (zh) | 文本意图识别方法及装置、电子设备、存储介质 | |
CN110490304B (zh) | 一种数据处理方法及设备 | |
CN111985243A (zh) | 情感模型的训练方法、情感分析方法、装置及存储介质 | |
Jeyakarthic et al. | Optimal bidirectional long short term memory based sentiment analysis with sarcasm detection and classification on twitter data | |
US12008365B2 (en) | Systems and method for automated code analysis and tagging | |
Zhang et al. | Chatbot design method using hybrid word vector expression model based on real telemarketing data | |
You et al. | Weakly supervised dictionary learning | |
Kang et al. | Sentiment analysis on Malaysian airlines with BERT | |
US20240281603A1 (en) | Systems and methods for seeded neural topic modeling | |
WO2024173841A1 (fr) | Systèmes et procédés de modélisation neuronale de sujets de base | |
Patankar et al. | Image Captioning with Audio Reinforcement using RNN and CNN | |
CN112446206A (zh) | 一种菜谱标题的生成方法及装置 | |
CN115329173A (zh) | 一种基于舆情监控的企业信用确定方法及装置 | |
Zakir et al. | Convolutional neural networks method for analysis of e-commerce customer reviews | |
CN114610576A (zh) | 一种日志生成监控方法和装置 | |
Vilalta et al. | Studying the impact of the full-network embedding on multimodal pipelines | |
Zhong et al. | Automated Investor Sentiment Classification using Financial Social Media | |
Lucaci et al. | Towards unifying the explainability evaluation methods for NLP | |
CN111008281A (zh) | 文本分类方法、装置、计算机设备和存储介质 | |
EP4113289A1 (fr) | Analyse automatique du code et (procédé et systèmes d'étiquetage) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24713831 Country of ref document: EP Kind code of ref document: A1 |