[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021118697A1 - Process to learn new image classes without labels - Google Patents

Process to learn new image classes without labels Download PDF

Info

Publication number
WO2021118697A1
WO2021118697A1 PCT/US2020/057404 US2020057404W WO2021118697A1 WO 2021118697 A1 WO2021118697 A1 WO 2021118697A1 US 2020057404 W US2020057404 W US 2020057404W WO 2021118697 A1 WO2021118697 A1 WO 2021118697A1
Authority
WO
WIPO (PCT)
Prior art keywords
source
target data
image components
generating
models
Prior art date
Application number
PCT/US2020/057404
Other languages
French (fr)
Inventor
Heiko Hoffmann
Soheil KOLOURI
Original Assignee
Hrl Laboratories, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hrl Laboratories, Llc filed Critical Hrl Laboratories, Llc
Priority to CN202080075144.7A priority Critical patent/CN114600130A/en
Priority to EP20808575.3A priority patent/EP4073704A1/en
Publication of WO2021118697A1 publication Critical patent/WO2021118697A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/0088Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present invention relates to a system for learning new image classes from new sensory input and, more particularly, to a system for learning new image classes from new sensory input without labels.
  • the present invention relates to a system for learning new image classes from new sensory input and, more particularly, to a system for learning new image classes from new sensory input without labels.
  • (2) Description of Related Art [0009] Existing methods for learning from few labeled samples use a deep neural network that is pre-trained on a different, but similar, dataset with many labeled samples. The existing methods then re-tune the final layer or layers of the network to classify the new target dataset. This approach has two weaknesses.
  • the present invention relates to a system for learning new image classes from new sensory input and, more particularly, to a system for learning new image classes from new sensory input without labels.
  • the system comprises one or more processors and a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, the one or more processors perform multiple operations.
  • the system performs pseudo-task optimization to identify an optimal pseudo-task for each source model of one or more source models.
  • An initial target network is trained with self-supervised learning using the optimal pseudo-task.
  • a plurality of source image components is extracted from the one or more source models.
  • An attribute dictionary of abstract attributes is generated from the plurality of source image components.
  • a set of unlabeled target data is aligned with the one or more source models that are similar to the set of unlabeled target data.
  • the set of unlabeled target data are mapped onto a plurality of abstract attributes in the attribute dictionary.
  • a new target network is generated from the mapping. Using the new target network, an object label is assigned to an object in the unlabeled target data. The autonomous platform is controlled based on the assigned object label.
  • the set of unlabeled target data is an input image
  • mapping the unlabeled target data onto abstract attributes further comprises dissecting the input image into a plurality of target image components; comparing the plurality of target image components with the plurality of source image components; assigning the object label to the object based on the comparison; generating an executable control script appropriate for the object label; and causing the autonomous platform to execute the control script and perform an action corresponding to the control script.
  • a source similarity graph is used to select the one or more source models that are similar to the set of unlabeled target data and performing pseudo-task optimization further comprises computing a similarity measure between the one or more source models; generating the source similarity graph based on the similarity measure; and using the source similarity graph, identifying one or more source models in the plurality of source models that are similar to the set of unlabeled target data.
  • extracting the plurality of source image components and generating the attribute dictionary further comprises generating the plurality of source image components for each source model using unsupervised data decomposition; mapping the plurality of source image components and their corresponding labels onto the plurality of abstract attributes, resulting in clusters of abstract attributes; and generating the attribute dictionary from the clusters of abstract attributes.
  • the autonomous platform is a vehicle, and the system causes the vehicle to perform a driving operation in accordance with the assigned object label.
  • a source similarity graph between two or more source models is generated, pseudo-tasks for each source model are learned, a plurality of source image components is extracted from each source model, an attribute dictionary of abstract attributes is generated from the plurality of source image components, a set of target data from a new target domain is mapped onto the attribute dictionary, and a new target network is generated from the mapping.
  • data from the new target network is collected, and object labels are propagated in a latent feature space, resulting in an improved dictionary of abstract attributes and a refined target network.
  • the present invention also includes a computer program product and a computer implemented method.
  • the computer program product includes computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors, such that upon execution of the instructions, the one or more processors perform the operations listed herein.
  • the computer implemented method includes an act of causing a computer to execute such instructions and perform the resulting operations.
  • FIG.1 is a block diagram depicting the components of a system for learning new image classes without labels according to some embodiments of the present disclosure
  • FIG.2 is an illustration of a computer program product according to some embodiments of the present disclosure
  • FIG.3 is an illustration of key components of the system for learning new image classes without labels according to some embodiments of the present disclosure
  • FIG.4 is a flow chart illustrating a process to leverage old data for learning from new data according to some embodiments of the present disclosure
  • FIG.5A is an illustration of Steps 1 through 3 of a process that enables learning with a fraction of labels per class in a new target domain according to some embodiments of the present disclosure
  • FIG.5B is an illustration of Steps 4 through
  • the present invention relates to a system for learning new image classes without labels and, more particularly, to a system for learning new image classes without labels.
  • the following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of aspects. Thus, the present invention is not intended to be limited to the aspects presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
  • Various embodiments of the invention include three “principal” aspects.
  • the first is a system for learning new image classes without labels.
  • the system is typically in the form of a computer system operating software or in the form of a “hard-coded” instruction set. This system may be incorporated into a wide variety of devices that provide different functionalities.
  • the second principal aspect is a method, typically in the form of software, operated using a data processing system (computer).
  • the third principal aspect is a computer program product.
  • the computer program product generally represents computer-readable instructions stored on a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape.
  • a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape.
  • Other, non- limiting examples of computer-readable media include hard disks, read-only memory (ROM), and flash-type memories. These aspects will be described in more detail below.
  • FIG.1 A block diagram depicting an example of a system (i.e., computer system 100) of the present invention is provided in FIG.1.
  • the computer system 100 is configured to perform calculations, processes, operations, and/or functions associated with a program or algorithm.
  • certain processes and steps discussed herein are realized as a series of instructions (e.g., software program) that reside within computer readable memory units and are executed by one or more processors of the computer system 100. When executed, the instructions cause the computer system 100 to perform specific actions and exhibit specific behavior, such as described herein.
  • the computer system 100 may include an address/data bus 102 that is configured to communicate information. Additionally, one or more data processing units, such as a processor 104 (or processors), are coupled with the address/data bus 102.
  • the processor 104 is configured to process information and instructions. In an aspect, the processor 104 is a microprocessor.
  • the processor 104 may be a different type of processor such as a parallel processor, application-specific integrated circuit (ASIC), programmable logic array (PLA), complex programmable logic device (CPLD), or a field programmable gate array (FPGA).
  • ASIC application-specific integrated circuit
  • PLA programmable logic array
  • CPLD complex programmable logic device
  • FPGA field programmable gate array
  • the computer system 100 is configured to utilize one or more data storage units.
  • the computer system 100 may include a volatile memory unit 106 (e.g., random access memory (“RAM”), static RAM, dynamic RAM, etc.) coupled with the address/data bus 102, wherein a volatile memory unit 106 is configured to store information and instructions for the processor 104.
  • RAM random access memory
  • static RAM static RAM
  • dynamic RAM dynamic RAM
  • the computer system 100 further may include a non-volatile memory unit 108 (e.g., read-only memory (“ROM”), programmable ROM (“PROM”), erasable programmable ROM (“EPROM”), electrically erasable programmable ROM “EEPROM”), flash memory, etc.) coupled with the address/data bus 102, wherein the non-volatile memory unit 108 is configured to store static information and instructions for the processor 104.
  • the computer system 100 may execute instructions retrieved from an online data storage unit such as in “Cloud” computing.
  • the computer system 100 also may include one or more interfaces, such as an interface 110, coupled with the address/data bus 102.
  • the one or more interfaces are configured to enable the computer system 100 to interface with other electronic devices and computer systems.
  • the communication interfaces implemented by the one or more interfaces may include wireline (e.g., serial cables, modems, network adaptors, etc.) and/or wireless (e.g., wireless modems, wireless network adaptors, etc.) communication technology.
  • the computer system 100 may include an input device 112 coupled with the address/data bus 102, wherein the input device 112 is configured to communicate information and command selections to the processor 104.
  • the input device 112 is an alphanumeric input device, such as a keyboard, that may include alphanumeric and/or function keys.
  • the input device 112 may be an input device other than an alphanumeric input device.
  • the computer system 100 may include a cursor control device 114 coupled with the address/data bus 102, wherein the cursor control device 114 is configured to communicate user input information and/or command selections to the processor 104.
  • the cursor control device 114 is implemented using a device such as a mouse, a track-ball, a track- pad, an optical tracking device, or a touch screen. The foregoing notwithstanding, in an aspect, the cursor control device 114 is directed and/or activated via input from the input device 112, such as in response to the use of special keys and key sequence commands associated with the input device 112.
  • the cursor control device 114 is configured to be directed or guided by voice commands.
  • the computer system 100 further may include one or more optional computer usable data storage devices, such as a storage device 116, coupled with the address/data bus 102.
  • the storage device 116 is configured to store information and/or computer executable instructions.
  • the storage device 116 is a storage device such as a magnetic or optical disk drive (e.g., hard disk drive (“HDD”), floppy diskette, compact disk read only memory (“CD-ROM”), digital versatile disk (“DVD”)).
  • a display device 118 is coupled with the address/data bus 102, wherein the display device 118 is configured to display video and/or graphics.
  • the display device 118 may include a cathode ray tube ("CRT"), liquid crystal display (“LCD”), field emission display (“FED”), plasma display, or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user.
  • CTR cathode ray tube
  • LCD liquid crystal display
  • FED field emission display
  • plasma display or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user.
  • the computer system 100 presented herein is an example computing environment in accordance with an aspect.
  • the non-limiting example of the computer system 100 is not strictly limited to being a computer system.
  • an aspect provides that the computer system 100 represents a type of data processing analysis that may be used in accordance with various aspects described herein.
  • other computing systems may also be implemented. Indeed, the spirit and scope of the present technology is not limited to any single data processing environment.
  • one or more operations of various aspects of the present technology are controlled or implemented using computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components and/or data structures that are configured to perform particular tasks or implement particular abstract data types.
  • an aspect provides that one or more aspects of the present technology are implemented by utilizing one or more distributed computing environments, such as where tasks are performed by remote processing devices that are linked through a communications network, or such as where various program modules are located in both local and remote computer-storage media including memory- storage devices.
  • FIG.2 An illustrative diagram of a computer program product embodying the present invention is depicted in FIG.2.
  • the computer program product is depicted as floppy disk 200 or an optical disk 202 such as a CD or DVD.
  • the computer program product generally represents computer-readable instructions stored on any compatible non-transitory computer-readable medium.
  • the term “instructions” as used with respect to this invention generally indicates a set of operations to be performed on a computer, and may represent pieces of a whole program or individual, separable, software modules.
  • Non-limiting examples of “instruction” include computer program code (source or object code) and “hard-coded” electronics (i.e. computer operations coded into a computer chip).
  • the “instruction” is stored on any non-transitory computer-readable medium, such as in the memory of a computer or on a floppy disk, a CD-ROM, and a flash drive. In either event, the instructions are encoded on a non-transitory computer-readable medium.
  • (3) Specific Details of Various Embodiments Described herein is a process that enables autonomous platforms (e.g., robots, vehicles) in the field to quickly learn object classes without any training labels or with only a small set of training labels.
  • the invented process learns a new machine learning model based on novel (i.e., previously unseen) sensory input, where the input may be from an optical camera.
  • This learning process leverages a dataset of prior trained models by finding structure in these models that enable learning, without labels, an object class within new sensory data.
  • the system described herein automatically finds object components in the prior trained models as well as in the novel sensory data to automatically match the distribution of components between prior models and novel data. By matching the component distributions, even if they come from different sensors, new data can be identified by mapping component compositions to objects in a database.
  • Non-limiting examples of “different sensors” include two (or more) cameras with different lenses, and a regular camera and an infrared camera.
  • the method according to embodiments of the present disclosure is used to bootstrap the learning of new machine learning models.
  • the system optimizes a pseudo, or surrogate, task for learning a new model.
  • a pseudo task is, for example, taking away the color of an image and training a network to predict again the color for each pixel (this process is also called self-supervised learning).
  • This pseudo task is not the actual task but trains the network in a way that makes it easier to learn the actual task. Finding the right pseudo task has been done manually in the prior art. In this invention, the pseudo task is found automatically, as will be described in detail below.
  • FIG.3 shows the key elements of the invention.
  • an autonomous platform e.g., robot (element 301)
  • the robot processes the camera input (element 302), which are images of the object.
  • a new object means that the robot (element 301) was not trained on the object for its mission, but the nature of the object is known to civilization, non-limiting examples of which include a traffic cone, a vehicle that is different than one the robot (element 301) has been trained on, an animal, etc.
  • the invented process begins to automatically dissect the image (i.e., camera input (element 302) into its components (element 306).
  • An example embodiment for such automatic dissection is described in Literature Reference No 3.
  • the image components (element 306) are matched with components from a database (element 308).
  • the identity of the object is revealed (i.e., identify new object based on components (element 310)) and assigned an object label (element 312).
  • the system described herein is used to generate a control script to automate execution of a task (or action) to be performed by an autonomous platform, such as a robot (element 301).
  • the autonomous platform executes the control script (element 314) that is appropriate for, or corresponds to, the labeled object (element 312), such as instructions to disassemble the object, and the autonomous platform performs the action (element 316) in accordance with the control script (element 314).
  • the control script is a predefined control script that can be either manually written, learned from human observation, or automatically generated in an optimization process (e.g., using genetic algorithms). Learning from human observation is described in detail in Literature Reference No.9.
  • the robot (element 301) trains a new machine learning model with the images of the object (i.e., camera input (element 302) and the corresponding label (i.e., object label (element 312).
  • FIG.4 provides more details on identifying the object based on components (element 310), which includes identifying a new object based on components (FIG.3, element 310) matched with components in the database of object components (FIG.3, element 308).
  • Both the components from the database (i.e., source or old data) as well as the components from the new data are mapped onto a feature space (element 400).
  • the data points are clustered, revealing clusters of components (element 402).
  • the clustering is carried out based solely on the components from the database.
  • the components in feature space are aligned with the feature space representation of the components from the database (i.e., old data) (element 404).
  • the components from the new data can be mapped onto corresponding attributes in the database to identify new data (element 406). Since each object in the database has a known composition of attributes, the object identity or label can be retrieved based on the attributes.
  • Steps 1 to 4 are Steps 1 to 4 in the 6-step outline below and illustrated in FIGs.5A and 5B.
  • FIG.5A depicts Steps 1 to 3 of the process described herein
  • FIG.5B depicts Steps 4-6 of the process.
  • Step 5 includes the alignment (element 404), and Step 6 is optional to improve the new machine learning model with a few auto-selected labels.
  • the new six-step process consists of 1) creating a similarity graph between source models (element 500); 2) learning optimal pseudo tasks for each source model (element 502); 3) extracting components from source models (element 504); 4) building a dictionary of abstract attributes from the components (element 506); 5) aligning target data with model data to map the target data onto abstract attributes in the dictionary (element 508); and 6) active learning for data for which there is not a good match with existing attributes (element 510).
  • the six steps are first outlined and then the entire system is described in detail below.
  • Step 1 Source-Similarity-Graph (SSG) Creation (element 500)
  • the SSG creation step leverages a multitude of pre- existing data sources from a database of models with training data (element 512), which include machine learning datasets and associated models.
  • the system measures similarity between models (element 514) and, based on a similarity measure computed between the source models, the SSG (element 516) is created that links the models based on their nearest neighbors.
  • the benefit of the SSG (element 516) is to quickly find a small set of models that are closest to the target data.
  • Step 2 Pseudo-Tasks Optimization (element 502)
  • Pseudo-task optimization for SSL is performed on all source models (element 518) to find the best pseudo task (element 520) for each source model.
  • the optimization criterium is to create networks that allow a discriminative component extraction comparable to what is obtained from a network trained in a supervised way, since source labels are known.
  • SSL self-supervised learning
  • a network is trained for target data with SSL using pseudo tasks from the closest source models (element 524).
  • Step 3 Unsupervised Data Decomposition (element 504)
  • the process according to embodiments of the present disclosure automatically extracts salient components from input data (e.g., extracting the cropped image of a car tire) using unsupervised data decomposition (element 528), as described in detail in Literature Reference No.3.
  • the end result is a set of components for each source model (element 530) in input space for each labeled source input.
  • Step 4 Attribute-Dictionary Creation (element 506)
  • components of all source models and corresponding labels (element 532) are mapped across input data and sources onto abstract attributes (element 534).
  • This mapping can be carried out through unsupervised clustering of the abstract attributes (element 536), resulting in semantically meaningful clusters, such as a cluster of bike wheels (described in Literature Reference No.3).
  • a dictionary of sets of abstract attributes is built (element 538), resulting in an attribute dictionary (element 540).
  • the corresponding labels from the sources are obtained.
  • a mapping from the attributes onto labels can be learned.
  • Step 5 Zero-Shot Attribute Distillation (element 508)
  • the unlabeled target data (element 522) needs to be processed so that the unlabeled target data (element 522) maps onto the right set of attributes (i.e., map target data onto abstract attributes (element 542).
  • the closest source model or models is selected based on the SSG (element 516) and then a zero-shot attribute distillation method (element 544) is used to align the sources with the target.
  • Distilled embedding refers to the set of representations (i.e., deep neural activations) of the unlabeled target data (element 602) that are already aligned with one or more source models.
  • Update embedding refers to updating the target representations (via updating the target model) using the information provided by the oracle (element 616) through active learning (element 510) (through inquiry).
  • Step 6 Active Learning (element 510)
  • a good attribute representation may not be found, where goodness is, for instance, measured by having high values in the attribute probability distribution or high confidence in these values.
  • an active learning method (element 548) is used.
  • a trained model from the six-step process is used as a starting point, which becomes the new source model, which is input to the feature space (element 400) and part of the different source datasets (element 604).
  • the invented process is described in more detail.
  • a functional-level representation of the architecture for learning a target model is shown in FIG.6.
  • the output of this process is a new target model that is generated by zero-shot attribute distillation (element 508) and used in the components (element 310) to identify a new object.
  • the inputs (element 600) to the system described herein are the target dataset with no labels (element 602) and a set of annotated/labeled source datasets (element 604) that contain relevant and irrelevant datasets to the target dataset.
  • Meta-learning on the source datasets is used to learn three things: the SSG (element 516) for different datasets, optimal pseudo-tasks (OPT) for source datasets (element 520), and a set of canonical attributes (element 608).
  • the meta-learning module (element 606) utilizes source similarity graph generation (element 500), attribute generation (element 506), and meta learning of pseudo-tasks for SSL (element 502).
  • the target data (element 602) is then placed on the SSG (element 516) in source dataset retrieval (element 610) to retrieve relevant source datasets.
  • the optimal pseudo-tasks (OPT) (element 520) for relevant sources (i.e., K-NN sources (element 612)) are used to define a target pseudo-task and an SSL model is learned from scratch for the target dataset (i.e., SSL of target with optimal pseudo-tasks (element 524)).
  • the relevant sources i.e., K-NN sources (element 612)
  • their corresponding attributes i.e., canonical attributes (element 608) are used together with the target data (i.e., initial target model (element 526)) to tune the SSL trained model and perform zero-shot attribute distillation (element 508).
  • Step 1 Source Similarity Graph (element 516)
  • Transfer learning is a natural choice for learning a target dataset with few (or no) labels (element 602) in the presence of relevant source dataset(s) with abundant labeled data (element 604).
  • Identifying the relevant source dataset(s) (i.e., K-NN sources (element 612)), however, remains to be a core challenge in transfer learning applications. This difficulty is due to a phenomenon known as ⁇ negative transfer’, which occurs when knowledge is transferred from irrelevant source domains and in which transfer learning degrades the learning performance instead of improving it.
  • Current transfer learning methods often assume known relevant source domains, which are handpicked by a human expert.
  • Obtaining a fully automated transfer learning system requires a quantitative notion of relevance between different datasets, so that a machine is capable of choosing the relevant source dataset(s) from a large pool of datasets to solve a task for the input target dataset. [00078]
  • a unique similarity measure is computed between different datasets in the present invention.
  • FIG.7 illustrates measuring similarity between datasets, X 1 (e.g., MNIST (Modified National Institute of Standards and Technology) (element 700) and X 2 e.g., USPS (U.S. Postal Service) or SVHN (Street View House Numbers) (element 702), where the similarity measures their distributional distance in a lower dimensional shared latent space (element 706).
  • X 1 e.g., MNIST (Modified National Institute of Standards and Technology)
  • X 2 e.g., USPS (U.S. Postal Service) or SVHN (Street View House Numbers) (element 702)
  • SVHN String View House Numbers
  • a shared encoder (element 708), is used for both datasets (elements 702), and the latent space (element 706) of the encoder (i.e., the output of the encoder) is required to be generative for both datasets in the sense that both domains can be recovered from the latent space (element 706) via decoders, (element 710) and (element 712).
  • the process according to embodiments of the present disclosure compares the distributions in the latent space for two different datasets and generates a similarity graph. [00079] denote the samples from the i’th source dataset.
  • a mapping e.g., a deep neural encoder
  • a mapping that encodes the datasets into a shared latent space, while requiring this latent space to be generative for the two datasets, as described above.
  • decoders ⁇ ⁇ ⁇ and ⁇ ⁇ ⁇ : ⁇ ⁇ ⁇ are learned alongside the shared encoder ⁇ such that ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ and ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the dissimilarity between the datasets is defined as a metric between the empirical distributions of the two datasets in the latent space.
  • a possible metric is the sliced-Wasserstein metric, which has theoretical merits against classic information-theoretic dissimilarity metrics, such as KL-divergence (see Literature Reference Nos.4 and 5).
  • K- NN K-Nearest Neighbors
  • the system described herein utilizes an efficient approximation of a K-NN graph, e.g., the one proposed by Dong et al. (see Literature Reference No.6).
  • a computationally efficient algorithm e.g., Literature Reference No.10 is used for proximity search on large graphs. The result is an automated way of obtaining the nearest (i.e., most relevant) source datasets to the target dataset, which leads to a seamless transfer learning with minimized negative transfer effect.
  • FIG.8 depicts the Source Similarity Graph (SSG) (element 516) allowing efficient retrieval of relevant sources (element 612) when presenting a new target dataset (element 800).
  • SSG Source Similarity Graph
  • an SSG is formed, where each node (represented by a circle, e.g., element 802)) of the SSG (element 516) is a source dataset, and each edge’s (represented by a line, e.g., element 804) length/weight identifies the similarity between two datasets.
  • an efficient proximity search is used on the SSG (element 516) to retrieve the most relevant source datasets (element 612).
  • Step 2 Pseudo-Tasks Optimization (element 502)
  • SSL Self-Supervised Learning
  • the pseudo-tasks in current SSL methods are hand-designed such that they uncover relevant features of the data for a target task, while the performance of the network is easily measured for the pseudo-task.
  • Non-limiting examples of common pseudo-tasks are 1) recovering the data from a lower dimensional embedding space, as in auto-encoders; 2) leaving out parts of the input data and requiring the network to reconstruct the missing data; and 3) permuting the information in the input data and requiring the network to recover the un-permuted data.
  • the pseudo tasks are learned.
  • FIG.9 depicts training of the meta pseudo-task generator for SSL, which enables SSL to generate attribute/component distributions similar to those learned by a model trained with full supervision.
  • the pseudo task generator (element 900) is a function that is applied to the input image to discard some of the information in the input image (e.g., cut out part of the image, convert from color to grayscale).
  • the neural encoder (element 902) and neural decoder (element 904) are deep convolutional neural networks with mirrored architectures.
  • the neural encoder (element 902) squeezes the information of the pseudo-input image (i.e., input image that has gone through the pseudo task generator (element 900) into a low- dimensional vector space/latent space.
  • the neural decoder (element 904) aims to recover the information of the original input image from this low-dimensional vector representation.
  • the SSG module (element 516) described above provides the relevance of source datasets with the target dataset. The relevance is used for two things: 1) attribute distillation; and 2) providing a near optimal pseudo-task for the target data that enables one to learn a good initial model for the target dataset from scratch.
  • an input data point e.g., an image
  • a trained neural network model e.g., via SSL, as above.
  • NMF Nonnegative Matrix Factorization
  • CNN pre- trained convolutional neural network
  • FIG. 10 the system according to embodiments of the present disclosure dissects an input image (element 302) into its image components (element 306), or visual words.
  • the image components (element 306) have a higher chance of being shared among different source datasets.
  • each unsupervised data decomposition module enables the use of the learned/trained models (element 1100) (e.g., an a priori trained model on a related dataset) and dissection of the input images into image components (element 306).
  • Image components (element 306) extracted from all source models and datasets (element 604) are unified to provide groups of components, referred to as attributes, that cover all source datasets (element 604).
  • Another challenge is to unify the extracted image components (element 306) from different source datasets (element 604) into a shared set of components, or attributes.
  • a joint embedding, or attribute embedding is learned for all extracted image components (element 306) from source datasets (element 604) via a joint encoder ⁇ (joint model; element 1104).
  • SSL is used to learn such neural encoder, ⁇ (element 1104) for all image components (element 306). Having a joint embedding on the image components (element 306), in the embedding space, clustering is performed (e.g., using Sliced-Wasserstein Gaussian Mixture Model (see Literature Reference No.7) to obtain machine learned attributes for source datasets (i.e., attribute embedding (element 1102)). [00089] As described above, the source datasets (element 604) are processed to obtain the attributes for each sample in each dataset.
  • the input is dissected into its components (i.e., into textual or visual words) (element 306), then the image components (element 306) are embedded into the joint embedding via joint model ⁇ (element 1104), and the corresponding attributes for are obtained.
  • the source datasets will have data, labels, and attributes, (which are collected into a dictionary. The mined attributes enable performance of zero-shot learning on the target dataset.
  • Step 5 Zero-Shot Attribute Distillation (element 544)
  • images from the source datasets (element 604) and the target dataset (element 602) both go through the shared neural encoder, ⁇ ⁇ (element 1200), and a latent representation (a vector) for each image from each dataset/domain is obtained.
  • Images from the source datasets (element 604) have ground truth attributes, and the goal is to train the shared neural encoder, ⁇ ⁇ (element 1200) such that its output is representative of the attributes for the source datasets (element 604) (i.e., one should be able to recover the attributes of images from the source domain from this latent space).
  • the target dataset doesn’t have ground truth attributes, so the output of ⁇ ⁇ (element 1200) must be made representative of the target domain for images from the target dataset (element 602).
  • a neural decoder ⁇ (element 1202) that reconstructs the target images from the target dataset using the same latent representation as for the source data, resulting in a reconstructed target dataset (element 1204).
  • the decoder ⁇ (element 1202) enforces the shared neural encoder, ⁇ ⁇ (element 1200) to maintain the critical information of the target dataset (element 602).
  • the zero-shot attribute distillation module receives the target dataset (element 602), samples from relevant source datasets (K nearest neighbor sources), their corresponding mined attributes (i.e., canonical attributes (element 608)), and the SSL-trained target model (element 526).
  • the attributes are different from the labels and are mined as described above. For instance, while the label for a sample is ⁇ car’, its attributes could be ‘has wheels’, ‘is metal’, etc.
  • the twist is that since the attributes are mined by a machine there is no human labeling them (i.e., they are abstract, but still encode class-specific characteristics).
  • the challenge is to map new target data onto the same attributes.
  • Zero-shot attribute distillation uses a shared embedding to map the target dataset (element 602) onto mined attributes.
  • the SSL trained model (element 526) for the target dataset (element 602) is jointly tuned on relevant source datasets (element 604), such that the latent space is predictive of the components/attributes for the source datasets, while the latent space should remain a generative space for the target domain.
  • Step 6 Active Learning of Novel Classes and/or Attributes (element 548) [00096]
  • the target samples first go through the zero-shot distillation pipeline (element 544), which predicts target attributes (element 1202).
  • the target model (element 1100) is then used together with the unsupervised data decomposition (element 528) to dissect data into their components (element 306).
  • the dissected components (element 306) are then fed to the joint component model, ⁇ (element 1104), and novel attributes are detected via cluster analysis in the latent space of ⁇ (element 1104). This process filters the samples that are ambiguous and require further clarification.
  • active learning (element 548) is performed on these ambiguous samples such that the ambiguity is resolved based on minimal inquiries to a human user/operator.
  • a non-limiting example is a combination of least confidence, margin sampling, and entropy sampling (see Literature Reference No.8) to select the top most informative/uncertain data points among samples that have the largest uncertainty.
  • the invention meta-learns unsupervised learning algorithms by optimizing pseudo tasks for the source datasets. Moreover, it will generate machine-learned attributes from the source datasets that are leveraged to predict classes of the target dataset without requiring labels.
  • the system described herein is capable of inquiring new attributes and learning new classes using its unsupervised component decomposition together with active learning.
  • a shared latent space is used, and the old and new target datasets are embedded into this space.
  • the models are then tuned to achieve three things: 1) solve an optimal pseudo task for the new target data; 2) remain discriminative for the old target dataset; and 3) be domain agnostic.
  • the latent space By requiring the latent space to be domain agnostic, the latent distributions of the two datasets are enforced to be undistinguishable from one another.
  • the pseudo-task optimization for the target dataset forces the model to carefully preserve information in the new target domain while, simultaneously, being discriminative for the old dataset.
  • zero-shot attribute distillation is implemented, which recovers shared attributes between the source and target datasets.
  • the target data could contain attributes that did not exist in the source datasets, and therefore, there are no corresponding attributes in the attribute dictionary.
  • active learning is leveraged, where the system inquires class labels from a human labeler for samples that would disambiguate the uncertainty associated with the novel attributes. The feedback from the human is then used to further tune the network to achieve an optimal embedding. Finally, the newly identified attributes are added to the set of mined dictionary attributes for solving future tasks.
  • the system and method according to embodiments of the present disclosure can be used in automatic control of an autonomous platform, such as a robot, autonomous self-driving ground vehicle, unmanned aerial vehicle (UAV).
  • an autonomous platform such as a robot, autonomous self-driving ground vehicle, unmanned aerial vehicle (UAV).
  • UAV unmanned aerial vehicle
  • devices that can be controlled via the processor 104 include a motor vehicle or a motor vehicle component (electrical, non-electrical, mechanical), such as a brake, a steering mechanism, suspension, or safety device (e.g., airbags, seatbelt tensioners, etc.).
  • the action to be performed can be a driving operation/maneuver (such as steering or another command) in line with driving parameters in accordance with the now labeled object.
  • a driving operation/maneuver such as steering or another command
  • the system described herein can cause a vehicle maneuver/operation to be performed to avoid a collision with the bicyclist or vehicle (or any other object that should be avoided while driving).
  • the system can cause the autonomous vehicle to apply a functional movement response, which may be the task to be performed, such as a braking operation followed by a steering operation (etc.), to redirect vehicle away from the object, thereby avoiding a collision.
  • Other appropriate actions may include one or more of a steering operation, a throttle operation to increase speed or to decrease speed, or a decision to maintain course and speed without change.
  • the responses may be appropriate for avoiding a collision, improving travel speed, or improving efficiency.
  • control of other device types is also possible.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Remote Sensing (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Databases & Information Systems (AREA)
  • Automation & Control Theory (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Electromagnetism (AREA)
  • Image Analysis (AREA)

Abstract

Described is a system for learning object labels for control of an autonomous platform. Pseudo-task optimization is performed to identify an optimal pseudo-task for each source model of one or more source models. An initial target network is trained using the optimal pseudo-task. Source image components are extracted from source models, and an attribute dictionary of attributes is generated from the source image components. Using zero-shot attribution distillation, the unlabeled target data is aligned with the source models similar to the unlabeled target data. The unlabeled target data are mapped onto attributes in the attribute dictionary. A new target network is generated from the mapping, and the new target network is used to assign an object label to an object in the unlabeled target data. The autonomous platform is controlled based on the object label.

Description

[0001] PROCESS TO LEARN NEW IMAGE CLASSES WITHOUT LABELS [0002] CROSS-REFERENCE TO RELATED APPLICATIONS [0003] This is a Continuation-in-Part Application of U.S. Application No.16/532,321 filed August 5, 2019, entitled “System and Method for Few-Shot Transfer Learning”, the entirety of which is incorporated herein by reference. U.S. Application No.16/532,321 is a Non-Provisional Application of U.S. Provisional Patent Application No.62/752,166, filed October 29, 2018, entitled “System and Method for Few-Shot Transfer Learning”, the entirety of which is incorporated herein by reference. [0004] This is a Non-Provisional Application of U.S. Provisional Patent Application No.62/946,277, filed December 10, 2019, entitled, “Process to Learn New Image Classes Without Labels”, the entirety of which is incorporated herein by reference. [0005] BACKGROUND OF INVENTION [0006] (1) Field of Invention [0007] The present invention relates to a system for learning new image classes from new sensory input and, more particularly, to a system for learning new image classes from new sensory input without labels. [0008] (2) Description of Related Art [0009] Existing methods for learning from few labeled samples use a deep neural network that is pre-trained on a different, but similar, dataset with many labeled samples. The existing methods then re-tune the final layer or layers of the network to classify the new target dataset. This approach has two weaknesses. First, the approach assumes common features between datasets without enforcing commonality, leading to errors. Second, the approach neglects the abundance of unlabeled data, limiting its performance (label reduction is limited to about 100x before dramatically losing accuracy). [00010] For learning without labels, state-of-the-art zero-shot learning (ZSL) approaches struggle with two things: 1) defining semantically meaningful attributes, which often come from human annotation or from paired textual domains; and 2) not knowing if the input samples belong to seen or unseen classes of data (i.e., generalized-ZSL versus classic ZSL), leading to performance much lower than for supervised learning (8x higher prediction error than that described in Literature Reference No.1 in the List of Incorporated Literature References). Self-supervised learning methods have been recently used for transfer learning leading to few-shot learning accuracies of approximately 90% of fully supervised learning accuracy in the target domain. These methods, however, still require 10- 100 labels per class, such as that described in Literature Reference No.2. [00011] Thus, a continuing need exists for a method that allows for learning object classes without any training label or with only a small set of training labels. [00012] SUMMARY OF INVENTION [00013] The present invention relates to a system for learning new image classes from new sensory input and, more particularly, to a system for learning new image classes from new sensory input without labels. The system comprises one or more processors and a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, the one or more processors perform multiple operations. The system performs pseudo-task optimization to identify an optimal pseudo-task for each source model of one or more source models. An initial target network is trained with self-supervised learning using the optimal pseudo-task. A plurality of source image components is extracted from the one or more source models. An attribute dictionary of abstract attributes is generated from the plurality of source image components. Using zero-shot attribution distillation, a set of unlabeled target data is aligned with the one or more source models that are similar to the set of unlabeled target data. The set of unlabeled target data are mapped onto a plurality of abstract attributes in the attribute dictionary. A new target network is generated from the mapping. Using the new target network, an object label is assigned to an object in the unlabeled target data. The autonomous platform is controlled based on the assigned object label. [00014] In another aspect, the set of unlabeled target data is an input image, and mapping the unlabeled target data onto abstract attributes further comprises dissecting the input image into a plurality of target image components; comparing the plurality of target image components with the plurality of source image components; assigning the object label to the object based on the comparison; generating an executable control script appropriate for the object label; and causing the autonomous platform to execute the control script and perform an action corresponding to the control script. [00015] In another aspect, a source similarity graph is used to select the one or more source models that are similar to the set of unlabeled target data and performing pseudo-task optimization further comprises computing a similarity measure between the one or more source models; generating the source similarity graph based on the similarity measure; and using the source similarity graph, identifying one or more source models in the plurality of source models that are similar to the set of unlabeled target data. [00016] In another aspect, extracting the plurality of source image components and generating the attribute dictionary further comprises generating the plurality of source image components for each source model using unsupervised data decomposition; mapping the plurality of source image components and their corresponding labels onto the plurality of abstract attributes, resulting in clusters of abstract attributes; and generating the attribute dictionary from the clusters of abstract attributes. [00017] In another aspect, the autonomous platform is a vehicle, and the system causes the vehicle to perform a driving operation in accordance with the assigned object label. [00018] In another aspect, a source similarity graph between two or more source models is generated, pseudo-tasks for each source model are learned, a plurality of source image components is extracted from each source model, an attribute dictionary of abstract attributes is generated from the plurality of source image components, a set of target data from a new target domain is mapped onto the attribute dictionary, and a new target network is generated from the mapping. [00019] In another aspect, data from the new target network is collected, and object labels are propagated in a latent feature space, resulting in an improved dictionary of abstract attributes and a refined target network. [00020] Finally, the present invention also includes a computer program product and a computer implemented method. The computer program product includes computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors, such that upon execution of the instructions, the one or more processors perform the operations listed herein. Alternatively, the computer implemented method includes an act of causing a computer to execute such instructions and perform the resulting operations. [00021] BRIEF DESCRIPTION OF THE DRAWINGS [00022] The objects, features and advantages of the present invention will be apparent from the following detailed descriptions of the various aspects of the invention in conjunction with reference to the following drawings, where: [00023] FIG.1 is a block diagram depicting the components of a system for learning new image classes without labels according to some embodiments of the present disclosure; [00024] FIG.2 is an illustration of a computer program product according to some embodiments of the present disclosure; [00025] FIG.3 is an illustration of key components of the system for learning new image classes without labels according to some embodiments of the present disclosure; [00026] FIG.4 is a flow chart illustrating a process to leverage old data for learning from new data according to some embodiments of the present disclosure; [00027] FIG.5A is an illustration of Steps 1 through 3 of a process that enables learning with a fraction of labels per class in a new target domain according to some embodiments of the present disclosure; [00028] FIG.5B is an illustration of Steps 4 through 6 of a process that enables learning with a fraction of labels per class in a new target domain according to some embodiments of the present disclosure; [00029] FIG.6 is an illustration of an architecture that integrates meta-learning, self- supervision, and zero-shot learning to achieve learning with less than one label per class according to some embodiments of the present disclosure; [00030] FIG.7 is an illustration of measuring similarity between datasets using a shared encoder according to some embodiments of the present disclosure; [00031] FIG.8 is an illustration of using a Source Similarity Graph (SSG) for efficient retrieval of relevant sources, depicted as source dataset retrieval in FIG.6, when presenting a new target dataset according to some embodiments of the present disclosure; [00032] FIG.9 is an illustration of pseudo-task optimization, depicted as meta learning of pseudo-tasks for SSL in FIG.6, enabling self-supervised learning (SSL) to generate attribute distributions similar to those learned by a model trained with full supervision according to some embodiments of the present disclosure; [00033] FIG.10 is an illustration of dissecting an input image into its components according to some embodiments of the present disclosure; [00034] FIG.11 is an illustration of using learned models to dissect the input images into components using unsupervised data decomposition, depicted as attribute generation in FIG.6, according to some embodiments of the present disclosure; and [00035] FIG.12 is an illustration of zero-shot attribute distillation using a shared embedding to map target data onto mined attributes according to some embodiments of the present disclosure. [00036] DETAILED DESCRIPTION [00037] The present invention relates to a system for learning new image classes without labels and, more particularly, to a system for learning new image classes without labels. The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of aspects. Thus, the present invention is not intended to be limited to the aspects presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. [00038] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention. [00039] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. [00040] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of” or “act of” in the claims herein is not intended to invoke the provisions of 35 U.S.C.112, Paragraph 6. [00041] Before describing the invention in detail, first a list of cited references is provided. Next, a description of the various principal aspects of the present invention is provided. Finally, specific details of various embodiment of the present invention are provided to give an understanding of the specific aspects. [00042] (1) List of Incorporated Literature References [00043] The following references are cited and incorporated throughout this application. For clarity and convenience, the references are listed herein as a central resource for the reader. The following references are hereby incorporated by reference as though fully set forth herein. The references are cited in the application by referring to the corresponding literature reference number, as follows: 1. Xian, Y., Schiele, B. and Akata, Z., 2017. Zero-shot learning-the good, the bad and the ugly. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4582-4591. 2. Noroozi, M., Vinjimoor, A., Favaro, P. and Pirsiavash, H., 2018. Boosting Self-Supervised Learning via Knowledge Transfer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3. Kolouri, S., Martin, C.E. and Hoffmann, H., 2017, July. Explaining distributed neural activations via unsupervised learning. CVPR Workshop on Explainable Computer Vision and Job Candidate Screening Competition (Vol. 2). 4. Kolouri, S., Zou, Y. and Rohde, G.K., 2016. Sliced Wasserstein Kernels for Probability Distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5258-5267. 5. Kolouri, S., Pope, P.E., Martin, C.E. and Rohde, G.K., 2018. Sliced- Wasserstein Autoencoder: An Embarrassingly Simple Generative Model. arXiv preprint arXiv:1804.01947. 6. Dong, W., Moses, C. and Li, K., 2011, March. Efficient k-Nearest Neighbor Graph Construction for Generic Similarity Measures. Proceedings of the 20th International Conference on World Wide Web, pp.577-586. 7. Kolouri, S., Rohde, G.K. and Hoffmann, H., Sliced Wasserstein Distance for Learning Gaussian Mixture Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3427-3436. 8. Wang, K., Zhang, D., Li, Y., Zhang, R. and Lin, L., 2017. Cost-Effective Active Learning for Deep Image Classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12), pp.2591-2600. 9. Pastor, P., Hoffmann, H., Asfour, T., and Schaal, S., 2009. Learning and Generalization of Motor Skills by Learning from Demonstration. IEEE International Conference on Robotics and Automation. 10. Sarkar, P., Moore, A. W., & Prakash, A., 2008. Fast Incremental Proximity Search in Large Graphs. In Proceedings of the 25th International Conference on Machine learning, pp.896-903. [00044] (2) Principal Aspects [00045] Various embodiments of the invention include three “principal” aspects. The first is a system for learning new image classes without labels. The system is typically in the form of a computer system operating software or in the form of a “hard-coded” instruction set. This system may be incorporated into a wide variety of devices that provide different functionalities. The second principal aspect is a method, typically in the form of software, operated using a data processing system (computer). The third principal aspect is a computer program product. The computer program product generally represents computer-readable instructions stored on a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape. Other, non- limiting examples of computer-readable media include hard disks, read-only memory (ROM), and flash-type memories. These aspects will be described in more detail below. [00046] A block diagram depicting an example of a system (i.e., computer system 100) of the present invention is provided in FIG.1. The computer system 100 is configured to perform calculations, processes, operations, and/or functions associated with a program or algorithm. In one aspect, certain processes and steps discussed herein are realized as a series of instructions (e.g., software program) that reside within computer readable memory units and are executed by one or more processors of the computer system 100. When executed, the instructions cause the computer system 100 to perform specific actions and exhibit specific behavior, such as described herein. [00047] The computer system 100 may include an address/data bus 102 that is configured to communicate information. Additionally, one or more data processing units, such as a processor 104 (or processors), are coupled with the address/data bus 102. The processor 104 is configured to process information and instructions. In an aspect, the processor 104 is a microprocessor. Alternatively, the processor 104 may be a different type of processor such as a parallel processor, application-specific integrated circuit (ASIC), programmable logic array (PLA), complex programmable logic device (CPLD), or a field programmable gate array (FPGA). [00048] The computer system 100 is configured to utilize one or more data storage units. The computer system 100 may include a volatile memory unit 106 (e.g., random access memory ("RAM"), static RAM, dynamic RAM, etc.) coupled with the address/data bus 102, wherein a volatile memory unit 106 is configured to store information and instructions for the processor 104. The computer system 100 further may include a non-volatile memory unit 108 (e.g., read-only memory ("ROM"), programmable ROM ("PROM"), erasable programmable ROM ("EPROM"), electrically erasable programmable ROM "EEPROM"), flash memory, etc.) coupled with the address/data bus 102, wherein the non-volatile memory unit 108 is configured to store static information and instructions for the processor 104. Alternatively, the computer system 100 may execute instructions retrieved from an online data storage unit such as in “Cloud” computing. In an aspect, the computer system 100 also may include one or more interfaces, such as an interface 110, coupled with the address/data bus 102. The one or more interfaces are configured to enable the computer system 100 to interface with other electronic devices and computer systems. The communication interfaces implemented by the one or more interfaces may include wireline (e.g., serial cables, modems, network adaptors, etc.) and/or wireless (e.g., wireless modems, wireless network adaptors, etc.) communication technology. [00049] In one aspect, the computer system 100 may include an input device 112 coupled with the address/data bus 102, wherein the input device 112 is configured to communicate information and command selections to the processor 104. In accordance with one aspect, the input device 112 is an alphanumeric input device, such as a keyboard, that may include alphanumeric and/or function keys. Alternatively, the input device 112 may be an input device other than an alphanumeric input device. In an aspect, the computer system 100 may include a cursor control device 114 coupled with the address/data bus 102, wherein the cursor control device 114 is configured to communicate user input information and/or command selections to the processor 104. In an aspect, the cursor control device 114 is implemented using a device such as a mouse, a track-ball, a track- pad, an optical tracking device, or a touch screen. The foregoing notwithstanding, in an aspect, the cursor control device 114 is directed and/or activated via input from the input device 112, such as in response to the use of special keys and key sequence commands associated with the input device 112. In an alternative aspect, the cursor control device 114 is configured to be directed or guided by voice commands. [00050] In an aspect, the computer system 100 further may include one or more optional computer usable data storage devices, such as a storage device 116, coupled with the address/data bus 102. The storage device 116 is configured to store information and/or computer executable instructions. In one aspect, the storage device 116 is a storage device such as a magnetic or optical disk drive (e.g., hard disk drive ("HDD"), floppy diskette, compact disk read only memory ("CD-ROM"), digital versatile disk ("DVD")). Pursuant to one aspect, a display device 118 is coupled with the address/data bus 102, wherein the display device 118 is configured to display video and/or graphics. In an aspect, the display device 118 may include a cathode ray tube ("CRT"), liquid crystal display ("LCD"), field emission display ("FED"), plasma display, or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user. [00051] The computer system 100 presented herein is an example computing environment in accordance with an aspect. However, the non-limiting example of the computer system 100 is not strictly limited to being a computer system. For example, an aspect provides that the computer system 100 represents a type of data processing analysis that may be used in accordance with various aspects described herein. Moreover, other computing systems may also be implemented. Indeed, the spirit and scope of the present technology is not limited to any single data processing environment. Thus, in an aspect, one or more operations of various aspects of the present technology are controlled or implemented using computer-executable instructions, such as program modules, being executed by a computer. In one implementation, such program modules include routines, programs, objects, components and/or data structures that are configured to perform particular tasks or implement particular abstract data types. In addition, an aspect provides that one or more aspects of the present technology are implemented by utilizing one or more distributed computing environments, such as where tasks are performed by remote processing devices that are linked through a communications network, or such as where various program modules are located in both local and remote computer-storage media including memory- storage devices. [00052] An illustrative diagram of a computer program product (i.e., storage device) embodying the present invention is depicted in FIG.2. The computer program product is depicted as floppy disk 200 or an optical disk 202 such as a CD or DVD. However, as mentioned previously, the computer program product generally represents computer-readable instructions stored on any compatible non-transitory computer-readable medium. The term “instructions” as used with respect to this invention generally indicates a set of operations to be performed on a computer, and may represent pieces of a whole program or individual, separable, software modules. Non-limiting examples of “instruction” include computer program code (source or object code) and “hard-coded” electronics (i.e. computer operations coded into a computer chip). The “instruction” is stored on any non-transitory computer-readable medium, such as in the memory of a computer or on a floppy disk, a CD-ROM, and a flash drive. In either event, the instructions are encoded on a non-transitory computer-readable medium. [00053] (3) Specific Details of Various Embodiments [00054] Described herein is a process that enables autonomous platforms (e.g., robots, vehicles) in the field to quickly learn object classes without any training labels or with only a small set of training labels. The invented process learns a new machine learning model based on novel (i.e., previously unseen) sensory input, where the input may be from an optical camera. This learning process leverages a dataset of prior trained models by finding structure in these models that enable learning, without labels, an object class within new sensory data. The system described herein automatically finds object components in the prior trained models as well as in the novel sensory data to automatically match the distribution of components between prior models and novel data. By matching the component distributions, even if they come from different sensors, new data can be identified by mapping component compositions to objects in a database. Non-limiting examples of “different sensors” include two (or more) cameras with different lenses, and a regular camera and an infrared camera. In addition, the method according to embodiments of the present disclosure is used to bootstrap the learning of new machine learning models. Here, the system optimizes a pseudo, or surrogate, task for learning a new model. A pseudo task is, for example, taking away the color of an image and training a network to predict again the color for each pixel (this process is also called self-supervised learning). This pseudo task is not the actual task but trains the network in a way that makes it easier to learn the actual task. Finding the right pseudo task has been done manually in the prior art. In this invention, the pseudo task is found automatically, as will be described in detail below. [00055] FIG.3 shows the key elements of the invention. Upon an autonomous platform (e.g., robot (element 301)) discovering a new physical object in the field (element 300), the robot (element 301) processes the camera input (element 302), which are images of the object. Here, a new object means that the robot (element 301) was not trained on the object for its mission, but the nature of the object is known to mankind, non-limiting examples of which include a traffic cone, a vehicle that is different than one the robot (element 301) has been trained on, an animal, etc. [00056] As a next step (element 304), the invented process begins to automatically dissect the image (i.e., camera input (element 302) into its components (element 306). An example embodiment for such automatic dissection is described in Literature Reference No 3. The image components (element 306) are matched with components from a database (element 308). Based on this component-wise matching, the identity of the object is revealed (i.e., identify new object based on components (element 310)) and assigned an object label (element 312). The system described herein is used to generate a control script to automate execution of a task (or action) to be performed by an autonomous platform, such as a robot (element 301). The autonomous platform executes the control script (element 314) that is appropriate for, or corresponds to, the labeled object (element 312), such as instructions to disassemble the object, and the autonomous platform performs the action (element 316) in accordance with the control script (element 314). The control script (element 314) is a predefined control script that can be either manually written, learned from human observation, or automatically generated in an optimization process (e.g., using genetic algorithms). Learning from human observation is described in detail in Literature Reference No.9. In addition, the robot (element 301) trains a new machine learning model with the images of the object (i.e., camera input (element 302) and the corresponding label (i.e., object label (element 312). [00057] FIG.4 provides more details on identifying the object based on components (element 310), which includes identifying a new object based on components (FIG.3, element 310) matched with components in the database of object components (FIG.3, element 308). Both the components from the database (i.e., source or old data) as well as the components from the new data are mapped onto a feature space (element 400). In this feature space, the data points are clustered, revealing clusters of components (element 402). Initially, the clustering is carried out based solely on the components from the database. When presenting new data, the components in feature space are aligned with the feature space representation of the components from the database (i.e., old data) (element 404). After alignment, the components from the new data can be mapped onto corresponding attributes in the database to identify new data (element 406). Since each object in the database has a known composition of attributes, the object identity or label can be retrieved based on the attributes. [00058] Before one can map data onto the feature space (element 400) and carry out the component alignment (element 404), several steps have to be computed in advance. These initial steps are Steps 1 to 4 in the 6-step outline below and illustrated in FIGs.5A and 5B. FIG.5A depicts Steps 1 to 3 of the process described herein, and FIG.5B depicts Steps 4-6 of the process. Step 5 includes the alignment (element 404), and Step 6 is optional to improve the new machine learning model with a few auto-selected labels. The new six-step process consists of 1) creating a similarity graph between source models (element 500); 2) learning optimal pseudo tasks for each source model (element 502); 3) extracting components from source models (element 504); 4) building a dictionary of abstract attributes from the components (element 506); 5) aligning target data with model data to map the target data onto abstract attributes in the dictionary (element 508); and 6) active learning for data for which there is not a good match with existing attributes (element 510). In the following, the six steps are first outlined and then the entire system is described in detail below. [00059] (3.1) Step 1: Source-Similarity-Graph (SSG) Creation (element 500) [00060] The SSG creation step (Step 1; element 500) leverages a multitude of pre- existing data sources from a database of models with training data (element 512), which include machine learning datasets and associated models. The system measures similarity between models (element 514) and, based on a similarity measure computed between the source models, the SSG (element 516) is created that links the models based on their nearest neighbors. The benefit of the SSG (element 516) is to quickly find a small set of models that are closest to the target data. [00061] (3.2) Step 2: Pseudo-Tasks Optimization (element 502) [00062] Pseudo-task optimization for SSL is performed on all source models (element 518) to find the best pseudo task (element 520) for each source model. The optimization criterium is to create networks that allow a discriminative component extraction comparable to what is obtained from a network trained in a supervised way, since source labels are known. Then, on a large set of unlabeled target data (element 522), self-supervised learning (SSL) is again carried out by combining the best pseudo tasks (element 520) from the closest source models. In other words, a network is trained for target data with SSL using pseudo tasks from the closest source models (element 524). The result is an initial target network (element 526) that maps the input onto a latent feature space. [00063] (3.3) Step 3: Unsupervised Data Decomposition (element 504) [00064] Given the trained networks from the sources, the process according to embodiments of the present disclosure automatically extracts salient components from input data (e.g., extracting the cropped image of a car tire) using unsupervised data decomposition (element 528), as described in detail in Literature Reference No.3. The end result is a set of components for each source model (element 530) in input space for each labeled source input. [00065] (3.4) Step 4: Attribute-Dictionary Creation (element 506) [00066] In the attribute-dictionary creation step (element 506), components of all source models and corresponding labels (element 532) are mapped across input data and sources onto abstract attributes (element 534). This mapping can be carried out through unsupervised clustering of the abstract attributes (element 536), resulting in semantically meaningful clusters, such as a cluster of bike wheels (described in Literature Reference No.3). Given the mined attributes, a dictionary of sets of abstract attributes is built (element 538), resulting in an attribute dictionary (element 540). For each set, the corresponding labels from the sources are obtained. Thus, for objects and actions, a mapping from the attributes onto labels can be learned. [00067] (3.5) Step 5: Zero-Shot Attribute Distillation (element 508) [00068] Given the attributes from the sources in the attribute dictionary (element 540), the initial target network (element 526), and the source models, the unlabeled target data (element 522) needs to be processed so that the unlabeled target data (element 522) maps onto the right set of attributes (i.e., map target data onto abstract attributes (element 542). To achieve this mapping, the closest source model or models is selected based on the SSG (element 516) and then a zero-shot attribute distillation method (element 544) is used to align the sources with the target. Alignment happens by learning a shared latent space for the source and target models that is predictive of the attributes for the source data and simultaneously generative of the target data. As a result, there is a new target network (element 546) that maps target data onto a probability distribution over the abstract attributes in the attribute dictionary (element 540). This distribution can then be mapped onto a label or sentence. Note that in all of these five steps, not a single training label was used to assign labels to data from a new domain. [00069] Distilled embedding (element 620) refers to the set of representations (i.e., deep neural activations) of the unlabeled target data (element 602) that are already aligned with one or more source models. Update embedding (element 622) refers to updating the target representations (via updating the target model) using the information provided by the oracle (element 616) through active learning (element 510) (through inquiry). [00070] (3.6) Step 6: Active Learning (element 510) [00071] For some data, a good attribute representation may not be found, where goodness is, for instance, measured by having high values in the attribute probability distribution or high confidence in these values. For data without a good representation, an active learning method (element 548) is used. To carry out this learning in a label-efficient way using selected labels from an oracle (element 550) (i.e., human labeler), all data for which active learning is required is collected from the new target network (element 546), and then label propagation in the latent feature space of the data is carried out. The result is an improved attribute dictionary and refined target network (element 552). [00072] The output of these six steps (elements 500, 502, 504, 506, 508, and 510) is a trained network (new target network (element 546)) that returns labels for a target domain. These steps describe the process for training a new network from scratch. If the goal is to only adapt a network to a new sensor or different viewpoint or lighting, the process described above can be simplified. Here, a trained model from the six-step process is used as a starting point, which becomes the new source model, which is input to the feature space (element 400) and part of the different source datasets (element 604). [00073] In the following, the invented process is described in more detail. A functional-level representation of the architecture for learning a target model is shown in FIG.6. The output of this process is a new target model that is generated by zero-shot attribute distillation (element 508) and used in the components (element 310) to identify a new object. The inputs (element 600) to the system described herein are the target dataset with no labels (element 602) and a set of annotated/labeled source datasets (element 604) that contain relevant and irrelevant datasets to the target dataset. [00074] Meta-learning on the source datasets (element 606) is used to learn three things: the SSG (element 516) for different datasets, optimal pseudo-tasks (OPT) for source datasets (element 520), and a set of canonical attributes (element 608). The meta-learning module (element 606) utilizes source similarity graph generation (element 500), attribute generation (element 506), and meta learning of pseudo-tasks for SSL (element 502). The target data (element 602) is then placed on the SSG (element 516) in source dataset retrieval (element 610) to retrieve relevant source datasets. The optimal pseudo-tasks (OPT) (element 520) for relevant sources (i.e., K-NN sources (element 612)) are used to define a target pseudo-task and an SSL model is learned from scratch for the target dataset (i.e., SSL of target with optimal pseudo-tasks (element 524)). [00075] Next, the relevant sources (i.e., K-NN sources (element 612)) and their corresponding attributes (i.e., canonical attributes (element 608) are used together with the target data (i.e., initial target model (element 526)) to tune the SSL trained model and perform zero-shot attribute distillation (element 508). Finally, the uncertain attributes/components of the target data are used as a proxy for active learning (element 510) to inquire/query (element 614) class labels for a few target samples from an oracle (element 616). The inquired labels (i.e., selected labels from oracle (element 550)) are then used as feedback signal to further tune the model and enrich/update the set of canonical attributes (element 618). [00076] (3.7) Step 1: Source Similarity Graph (element 516) [00077] Transfer learning is a natural choice for learning a target dataset with few (or no) labels (element 602) in the presence of relevant source dataset(s) with abundant labeled data (element 604). Identifying the relevant source dataset(s) (i.e., K-NN sources (element 612)), however, remains to be a core challenge in transfer learning applications. This difficulty is due to a phenomenon known as `negative transfer’, which occurs when knowledge is transferred from irrelevant source domains and in which transfer learning degrades the learning performance instead of improving it. Current transfer learning methods often assume known relevant source domains, which are handpicked by a human expert. Obtaining a fully automated transfer learning system requires a quantitative notion of relevance between different datasets, so that a machine is capable of choosing the relevant source dataset(s) from a large pool of datasets to solve a task for the input target dataset. [00078] To address this challenge, a unique similarity measure is computed between different datasets in the present invention. FIG.7 illustrates measuring similarity between datasets, X1 (e.g., MNIST (Modified National Institute of Standards and Technology) (element 700) and X2 e.g., USPS (U.S. Postal Service) or SVHN (Street View House Numbers) (element 702), where the similarity measures their distributional distance in a lower dimensional shared latent space (element 706). An example for such a similarity measure is the sliced-Wasserstein distance (Literature Reference No 4) A shared encoder
Figure imgf000023_0001
(element 708), is used for both datasets (elements 702), and the latent space (element 706) of the encoder (i.e., the output of the encoder) is required to be generative for both datasets in the sense that both domains can be recovered from the latent space (element 706) via decoders,
Figure imgf000023_0006
(element 710) and
Figure imgf000023_0002
(element 712). The process according to embodiments of the present disclosure compares the distributions in the latent space for two different datasets and generates a similarity graph. [00079] denote the samples from the i’th source dataset. To
Figure imgf000023_0003
measure the similarity between two datasets and Xj , first ensure that they are in the same Hilbert space (i.e
Figure imgf000023_0005
If they are not already in the same space, there are two different options. The first involves resizing the input images. The second involves using different preprocessing networks to provide a preliminary map to the same Hilbert space. Then, a mapping
Figure imgf000023_0004
(e.g., a deep neural encoder) is identified that encodes the datasets into a shared latent space, while requiring this latent space to be generative for the two datasets, as described above. This means that decoders ^^ ௌௌீ and ^^ ௌௌீ: ^ → ^ are learned alongside the shared encoder ^ such that ^^ ௌௌீ ^^ௌௌீ൫^^ ^ ൯^ ≈ ^^ ^ and ^^ ௌௌீ
Figure imgf000024_0001
≈ ^ ^ ^ . Finally, the dissimilarity between the datasets is defined as a metric between the empirical distributions of the two datasets in the latent space. Here, a possible metric is the sliced-Wasserstein metric, which has theoretical merits against classic information-theoretic dissimilarity metrics, such as KL-divergence (see Literature Reference Nos.4 and 5). Therefore, the distance is: ^൫ ^^, ^^൯ = ^ ଶ^(^^ௌௌீ(^^ ^ )^^, ^^ௌௌீ(^^ ^ )^^ ). Having the distances between pair of source datasets, a K-Nearest Neighbors (K- NN) similarity graph is formed, where each node of a graph is a source dataset and the edges identify the closeness of the datasets to one another. For example, if the distance d is below a threshold (e.g., 0.5, an edge is formed in the similarity graph). To overcome the computational expense of calculating pairwise distances between all source datasets, the system described herein utilizes an efficient approximation of a K-NN graph, e.g., the one proposed by Dong et al. (see Literature Reference No.6). Finally, for a target dataset, a computationally efficient algorithm (e.g., Literature Reference No.10) is used for proximity search on large graphs. The result is an automated way of obtaining the nearest (i.e., most relevant) source datasets to the target dataset, which leads to a seamless transfer learning with minimized negative transfer effect. [00080] FIG.8 depicts the Source Similarity Graph (SSG) (element 516) allowing efficient retrieval of relevant sources (element 612) when presenting a new target dataset (element 800). Using the pairwise distance between source datasets, an SSG (element 516) is formed, where each node (represented by a circle, e.g., element 802)) of the SSG (element 516) is a source dataset, and each edge’s (represented by a line, e.g., element 804) length/weight identifies the similarity between two datasets. For a target dataset (element 800), an efficient proximity search is used on the SSG (element 516) to retrieve the most relevant source datasets (element 612). [00081] (3.8) Step 2: Pseudo-Tasks Optimization (element 502) [00082] Self-Supervised Learning (SSL) involves solving a pseudo-task over unlabeled data to learn an internal representation that allows for solving the real task(s) with much fewer labels. The pseudo-tasks in current SSL methods are hand-designed such that they uncover relevant features of the data for a target task, while the performance of the network is easily measured for the pseudo-task. Non-limiting examples of common pseudo-tasks are 1) recovering the data from a lower dimensional embedding space, as in auto-encoders; 2) leaving out parts of the input data and requiring the network to reconstruct the missing data; and 3) permuting the information in the input data and requiring the network to recover the un-permuted data. In the present invention, the pseudo tasks are learned. FIG.9 depicts training of the meta pseudo-task generator for SSL, which enables SSL to generate attribute/component distributions similar to those learned by a model trained with full supervision. The pseudo task generator (element 900) is a function that is applied to the input image to discard some of the information in the input image (e.g., cut out part of the image, convert from color to grayscale). The neural encoder (element 902) and neural decoder (element 904) are deep convolutional neural networks with mirrored architectures. The neural encoder (element 902) squeezes the information of the pseudo-input image (i.e., input image that has gone through the pseudo task generator (element 900) into a low- dimensional vector space/latent space. The neural decoder (element 904) aims to recover the information of the original input image from this low-dimensional vector representation. [00083] The SSG module (element 516) described above provides the relevance of source datasets with the target dataset. The relevance is used for two things: 1) attribute distillation; and 2) providing a near optimal pseudo-task for the target data that enables one to learn a good initial model for the target dataset from scratch. Let ^ ^^^^ be the set of all sources and let ^ ^^: ^ → ^^^ be their corresponding optimal pseudo-tasks for these datasets. For the target dataset ^ ∈ ^, let ^^^ ^ ^ denote the relevance or similarities of the target to the source datasets such that ∑
Figure imgf000026_0001
^ ^^ = 1. Then, a new pseudo-task is designed for the target data as a function of ^ ^^: ^ → ^ ^ ^ and ^ ^^ ^ ^ . As obvious to one skilled in the art, there are various options to do this. The most straightforward way is to use pseudo-task ^^ in training the target model ^^ percent of the time. In addition, a composition rule is learned to find a pseudo-task as a function of the source pseudo-tasks ( i.e., ^் = ^( ^ ^^ ^ ^, ^ ^^ ^ ^). [00084] (3.9) Step 3: Unsupervised Data Decomposition (element 504) [00085] It is challenging to interpret data as a combination of its components that could be shared between different datasets. The data components are standalone entities of information that piece together a data sample. Here, the rationale is that comparing samples from different source datasets (with potential appearance changes) on a component basis is more effective than as a whole. To obtain data components, an input data point (e.g., an image) is dissected based on its neural activation patterns with respect to a trained neural network model (e.g., via SSL, as above). In Literature Reference No.3, the authors disclosed that the Nonnegative Matrix Factorization (NMF) of the final convolutional layer of a pre- trained convolutional neural network (CNN) leads to blob-like masks that identify the semantically meaningful components in the input image. As shown in FIG. 10, the system according to embodiments of the present disclosure dissects an input image (element 302) into its image components (element 306), or visual words. The rationale here is that the image components (element 306) have a higher chance of being shared among different source datasets. The pre-trained model (i.e., trained CNN (element 1000) for a dataset is used, and its neural activations (i.e., NMF components of activations (element 1002)) for the input image (element 302) are analyzed in an unsupervised manner via blob detection (element 1004) to identify the image components (element 306). Similar ideas could be extended to the video and textual domains. [00086] As depicted in FIG.11, each unsupervised data decomposition module (element 528) enables the use of the learned/trained models (element 1100) (e.g., an a priori trained model on a related dataset) and dissection of the input images into image components (element 306). Image components (element 306) extracted from all source models and datasets (element 604) are unified to provide groups of components, referred to as attributes, that cover all source datasets (element 604). [00087] (3.10) Step 4: Attribute Dictionary Creation (element 506) [00088] Another challenge is to unify the extracted image components (element 306) from different source datasets (element 604) into a shared set of components, or attributes. To solve this challenge, a joint embedding, or attribute embedding (element 1102), is learned for all extracted image components (element 306) from source datasets (element 604) via a joint encoder ^ (joint model; element 1104). SSL is used to learn such neural encoder, ^ (element 1104) for all image components (element 306). Having a joint embedding on the image components (element 306), in the embedding space, clustering is performed (e.g., using Sliced-Wasserstein Gaussian Mixture Model (see Literature Reference No.7) to obtain machine learned attributes for source datasets (i.e., attribute embedding (element 1102)). [00089] As described above, the source datasets (element 604) are processed to obtain the attributes for each sample in each dataset. In short, for the n’th sample from the i’th source dataset,
Figure imgf000028_0001
, the input is dissected into its components (i.e., into textual or visual words) (element 306), then the image components (element 306) are embedded into the joint embedding via joint model ^ (element 1104), and the corresponding attributes for
Figure imgf000028_0002
are obtained. In this manner, the source datasets will have data, labels, and attributes, (
Figure imgf000028_0003
which are collected into a dictionary. The mined attributes enable performance of zero-shot learning on the target dataset. [00090] (3.11) Step 5: Zero-Shot Attribute Distillation (element 544) [00091] As depicted in FIG.12, images from the source datasets (element 604) and the target dataset (element 602) both go through the shared neural encoder, ^ௌௌ^ (element 1200), and a latent representation (a vector) for each image from each dataset/domain is obtained. Images from the source datasets (element 604) have ground truth attributes, and the goal is to train the shared neural encoder, ^ௌௌ^ (element 1200) such that its output is representative of the attributes for the source datasets (element 604) (i.e., one should be able to recover the attributes of images from the source domain from this latent space). However, the target dataset (element 602) doesn’t have ground truth attributes, so the output of ^ௌௌ^ (element 1200) must be made representative of the target domain for images from the target dataset (element 602). This is done by using a neural decoder ^ (element 1202) that reconstructs the target images from the target dataset using the same latent representation as for the source data, resulting in a reconstructed target dataset (element 1204). In other words, the decoder ^ (element 1202) enforces the shared neural encoder, ^ௌௌ^ (element 1200) to maintain the critical information of the target dataset (element 602). [00092] The zero-shot attribute distillation module (element 544) receives the target dataset (element 602), samples from relevant source datasets (K nearest neighbor sources), their corresponding mined attributes (i.e., canonical attributes (element 608)), and the SSL-trained target model (element 526). The attributes are different from the labels and are mined as described above. For instance, while the label for a sample is `car’, its attributes could be ‘has wheels’, ‘is metal’, etc. The twist here is that since the attributes are mined by a machine there is no human labeling them (i.e., they are abstract, but still encode class-specific characteristics). [00093] The challenge is to map new target data onto the same attributes. Starting from a good model for the target dataset (i.e., the SSL trained model (element 526)), the model is retuned such that the provided embedding is: 1) generative for the target dataset, and 2) the mined attributes for the relevant source datasets can be predicted from the embedding via a shallow regressor, ^ (element 1206) Having tuned such a model, ^ (element 1200) and ^ (element 1206), ^^^ =
Figure imgf000029_0001
is then used as an approximation of the attributes for the target dataset (element 602). The catch here is that only attributes that were predicted with a high confidence ( i.e., ^^^ = ^( ^^^)), are used where ^ thresholds the predicted attributes based on the certainty of the prediction. [00094] Zero-shot attribute distillation (element 544) uses a shared embedding to map the target dataset (element 602) onto mined attributes. The SSL trained model (element 526) for the target dataset (element 602) is jointly tuned on relevant source datasets (element 604), such that the latent space is predictive of the components/attributes for the source datasets, while the latent space should remain a generative space for the target domain. [00095] (3.12) Step 6: Active Learning of Novel Classes and/or Attributes (element 548) [00096] Referring to FIG.12, with zero-shot attribute distillation (element 544), one is able to recognize classes of data for which there exists an attribute representation in the source datasets. However, this representation can be insufficient when new classes or new attributes exist in the target dataset (element 602). In the presence of new classes and/or attributes, active learning (element 548) is used to disambiguate the new information. The approach according to embodiments of the present disclosure is as follows. Starting with an SSL-trained model (element 526) for the target dataset (element 602), the target samples first go through the zero-shot distillation pipeline (element 544), which predicts target attributes (element 1202). Referring to FIG.11, the target model (element 1100) is then used together with the unsupervised data decomposition (element 528) to dissect data into their components (element 306). The dissected components (element 306) are then fed to the joint component model, ^ (element 1104), and novel attributes are detected via cluster analysis in the latent space of ^ (element 1104). This process filters the samples that are ambiguous and require further clarification. Then active learning (element 548) is performed on these ambiguous samples such that the ambiguity is resolved based on minimal inquiries to a human user/operator. For active learning (element 548), a non- limiting example is a combination of least confidence, margin sampling, and entropy sampling (see Literature Reference No.8) to select the top most informative/uncertain data points among samples that have the largest uncertainty. [00097] In summary, the invention according to embodiments of the present disclosure meta-learns unsupervised learning algorithms by optimizing pseudo tasks for the source datasets. Moreover, it will generate machine-learned attributes from the source datasets that are leveraged to predict classes of the target dataset without requiring labels. Finally, the system described herein is capable of inquiring new attributes and learning new classes using its unsupervised component decomposition together with active learning. To adapt a target model learned from scratch to a new target domain, a shared latent space is used, and the old and new target datasets are embedded into this space. The models are then tuned to achieve three things: 1) solve an optimal pseudo task for the new target data; 2) remain discriminative for the old target dataset; and 3) be domain agnostic. By requiring the latent space to be domain agnostic, the latent distributions of the two datasets are enforced to be undistinguishable from one another. The pseudo-task optimization for the target dataset forces the model to carefully preserve information in the new target domain while, simultaneously, being discriminative for the old dataset. [00098] The label efficient learning of a model without utilizing or relying on any previous work in the present invention emerges from the interplay between meta- learning, self-supervision, unsupervised component decomposition, and transfer learning. At the meta-level, the following is used: 1) a unique similarity measure between source datasets that enables generation of the Source Similarity Graph (SSG) on these datasets; 2) a memory of shared attributes that are building blocks (e.g., all known object parts for object detection) of the source datasets; and 3) learning the optimal pseudo tasks for source datasets that constrain a self- supervised learner to learn data attributes which are similar to those learned in a model with full supervision. For optimal self-supervision in the target domain, relevant source pseudo-tasks are composed. Finally, for transferring learned knowledge from relevant source datasets to the target dataset, zero-shot attribute distillation is implemented, which recovers shared attributes between the source and target datasets. The target data, however, could contain attributes that did not exist in the source datasets, and therefore, there are no corresponding attributes in the attribute dictionary. To enable the system to learn such novel attributes, active learning is leveraged, where the system inquires class labels from a human labeler for samples that would disambiguate the uncertainty associated with the novel attributes. The feedback from the human is then used to further tune the network to achieve an optimal embedding. Finally, the newly identified attributes are added to the set of mined dictionary attributes for solving future tasks. [00099] The invention described herein enables a significant reduction in the number of labeled training data required for learning new image classes. State-of-the-art machine learning models require millions of labeled data. Labeling data, which usually requires manual labor, can be expensive, particularly, for sensitive data. Moreover, such labeling is time critical when needed to adapt robots in the field. The system and method according to embodiments of the present disclosure can be used in automatic control of an autonomous platform, such as a robot, autonomous self-driving ground vehicle, unmanned aerial vehicle (UAV). Non- limiting examples of devices that can be controlled via the processor 104 include a motor vehicle or a motor vehicle component (electrical, non-electrical, mechanical), such as a brake, a steering mechanism, suspension, or safety device (e.g., airbags, seatbelt tensioners, etc.). For instance, upon labeling and, thus identification, of an object in the target domain, the action to be performed can be a driving operation/maneuver (such as steering or another command) in line with driving parameters in accordance with the now labeled object. For example, if the system recognizes a bicyclist, another vehicle, or a pedestrian in the environments surrounding the autonomous driving system/vehicle, the system described herein can cause a vehicle maneuver/operation to be performed to avoid a collision with the bicyclist or vehicle (or any other object that should be avoided while driving). The system can cause the autonomous vehicle to apply a functional movement response, which may be the task to be performed, such as a braking operation followed by a steering operation (etc.), to redirect vehicle away from the object, thereby avoiding a collision. [000100] Other appropriate actions may include one or more of a steering operation, a throttle operation to increase speed or to decrease speed, or a decision to maintain course and speed without change. The responses may be appropriate for avoiding a collision, improving travel speed, or improving efficiency. As can be appreciated by one skilled in the art, control of other device types is also possible. Thus, there are a number of automated actions that can be initiated by the autonomous platform given the particular object assigned a label and the target domain in which the system is implemented. [000101] Finally, while this invention has been described in terms of several embodiments, one of ordinary skill in the art will readily recognize that the invention may have other applications in other environments. It should be noted that many embodiments and implementations are possible. Further, the following claims are in no way intended to limit the scope of the present invention to the specific embodiments described above. In addition, any recitation of “means for” is intended to evoke a means-plus-function reading of an element and a claim, whereas, any elements that do not specifically use the recitation “means for”, are not intended to be read as means-plus-function elements, even if the claim otherwise includes the word “means”. Further, while particular method steps have been recited in a particular order, the method steps may occur in any desired order and fall within the scope of the present invention.

Claims

CLAIMS What is claimed is: 1. A system for learning object labels for control of an autonomous platform, the system comprising: one or more processors and a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, the one or more processors perform operations of: performing pseudo-task optimization to identify an optimal pseudo-task for each source model of one or more source models; training an initial target network with self-supervised learning using the optimal pseudo-task; extracting a plurality of source image components from the one or more source models; generating an attribute dictionary of abstract attributes from the plurality of source image components; using zero-shot attribution distillation, aligning a set of unlabeled target data with the one or more source models that are similar to the set of unlabeled target data; mapping the set of unlabeled target data onto a plurality of abstract attributes in the attribute dictionary; generating a new target network from the mapping; using the new target network, assigning an object label to an object in the unlabeled target data; and controlling the autonomous platform based on the assigned object label.
2. The system as set forth in Claim 1, wherein the set of unlabeled target data is an input image, and wherein mapping the unlabeled target data onto abstract attributes further comprises: dissecting the input image into a plurality of target image components; comparing the plurality of target image components with the plurality of source image components; assigning the object label to the object based on the comparison; generating an executable control script appropriate for the object label; and causing the autonomous platform to execute the control script and perform an action corresponding to the control script.
3. The system as set forth in Claim 1, wherein a source similarity graph is used to select the one or more source models that are similar to the set of unlabeled target data and performing pseudo-task optimization further comprises: computing a similarity measure between the one or more source models; generating the source similarity graph based on the similarity measure; and using the source similarity graph, identifying one or more source models that are similar to the set of unlabeled target data.
4. The system as set forth in Claim 1, wherein extracting the plurality of source image components and generating the attribute dictionary further comprises: generating the plurality of source image components for each source model using unsupervised data decomposition; mapping the plurality of source image components and their corresponding labels onto the plurality of abstract attributes, resulting in clusters of abstract attributes; and generating the attribute dictionary from the clusters of abstract attributes.
5. The system as set forth in Claim 1, wherein the autonomous platform is a vehicle, and wherein the one or more processors further perform an operation of causing the vehicle to perform a driving operation in accordance with the assigned object label.
6. A computer implemented method for learning object labels for control of an autonomous platform, the method comprising an act of: causing one or more processors to execute instructions encoded on a non- transitory computer-readable medium, such that upon execution, the one or more processors perform operations of: performing pseudo-task optimization to identify an optimal pseudo-task for each source model of one or more source models; training an initial target network with self-supervised learning using the optimal pseudo-task; extracting a plurality of source image components from the one or more source models; generating an attribute dictionary of abstract attributes from the plurality of source image components; using zero-shot attribution distillation, aligning a set of unlabeled target data with the one or more source models that are similar to the set of unlabeled target data; mapping the set of unlabeled target data onto a plurality of abstract attributes in the attribute dictionary; generating a new target network from the mapping; using the new target network, assigning an object label to an object in the unlabeled target data; and controlling the autonomous platform based on the assigned object label.
7. The method as set forth in Claim 6, wherein the set of unlabeled target data is an input image, and wherein mapping the unlabeled target data onto abstract attributes further comprises: dissecting the input image into a plurality of target image components; comparing the plurality of target image components with the plurality of source image components; assigning the object label to the object based on the comparison; generating an executable control script appropriate for the object label; and causing the autonomous platform to execute the control script and perform an action corresponding to the control script.
8. The method as set forth in Claim 6, wherein a source similarity graph is used to select the one or more source models that are similar to the set of unlabeled target data and performing pseudo-task optimization further comprises: computing a similarity measure between the one or more source models; generating the source similarity graph based on the similarity measure; and using the source similarity graph, identifying one or more source models that are similar to the set of unlabeled target data.
9. The method as set forth in Claim 6, wherein extracting the plurality of source image components and generating the attribute dictionary further comprises: generating the plurality of source image components for each source model using unsupervised data decomposition; mapping the plurality of source image components and their corresponding labels onto the plurality of abstract attributes, resulting in clusters of abstract attributes; and generating the attribute dictionary from the clusters of abstract attributes.
10. The method as set forth in Claim 6, wherein the autonomous platform is a vehicle, and wherein the one or more processors further perform an operation of causing the vehicle to perform a driving operation in accordance with the assigned object label.
11. A computer program product for learning object labels for control of an autonomous platform, the computer program product comprising: computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors for causing the processor to perform operations of: performing pseudo-task optimization to identify an optimal pseudo-task for each source model of one or more source models; training an initial target network with self-supervised learning using the optimal pseudo-task; extracting a plurality of source image components from the one or more source models; generating an attribute dictionary of abstract attributes from the plurality of source image components; using zero-shot attribution distillation, aligning a set of unlabeled target data with the one or more source models that are similar to the set of unlabeled target data; mapping the set of unlabeled target data onto a plurality of abstract attributes in the attribute dictionary; generating a new target network from the mapping; using the new target network, assigning an object label to an object in the unlabeled target data; and controlling the autonomous platform based on the assigned object label.
12. The computer program product as set forth in Claim 11, wherein the set of unlabeled target data is an input image, and wherein mapping the unlabeled target data onto abstract attributes further comprises: dissecting the input image into a plurality of target image components; comparing the plurality of target image components with the plurality of source image components; assigning the object label to the object based on the comparison; generating an executable control script appropriate for the object label; and causing the autonomous platform to execute the control script and perform an action corresponding to the control script.
13. The computer program product as set forth in Claim 11, wherein a source similarity graph is used to select the one or more source models that are similar to the set of unlabeled target data and performing pseudo-task optimization further comprises: computing a similarity measure between the one or more source models; generating the source similarity graph based on the similarity measure; and using the source similarity graph, identifying one or more source models that are similar to the set of unlabeled target data.
14. The computer program product as set forth in Claim 11, wherein extracting the plurality of source image components and generating the attribute dictionary further comprises: generating the plurality of source image components for each source model using unsupervised data decomposition; mapping the plurality of source image components and their corresponding labels onto the plurality of abstract attributes, resulting in clusters of abstract attributes; and generating the attribute dictionary from the clusters of abstract attributes.
15. The computer program product as set forth in Claim 11, wherein the autonomous platform is a vehicle, and wherein controlling the autonomous platform further comprises causing the vehicle to perform a driving operation in accordance with the assigned object label.
16. A method for learning with a fraction of labels per object class in a new target domain, the method comprising an act of: causing one or more processors to execute instructions encoded on a non- transitory computer-readable medium, such that upon execution, the one or more processors perform operations of: generating a source similarity graph between two or more source models; learning pseudo-tasks for each source model; extracting a plurality of source image components from each source model; generating an attribute dictionary of abstract attributes from the plurality of source image components; mapping a set of target data from a new target domain onto the attribute dictionary; and generating a new target network from the mapping.
17. The method as set forth in Claim 16, further comprising an acts of: collecting data from the new target network; and propagating object labels in a latent feature space, resulting in an improved dictionary of abstract attributes and a refined target network.
PCT/US2020/057404 2019-12-10 2020-10-26 Process to learn new image classes without labels WO2021118697A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080075144.7A CN114600130A (en) 2019-12-10 2020-10-26 Process for learning new image classes without labels
EP20808575.3A EP4073704A1 (en) 2019-12-10 2020-10-26 Process to learn new image classes without labels

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962946277P 2019-12-10 2019-12-10
US62/946,277 2019-12-10

Publications (1)

Publication Number Publication Date
WO2021118697A1 true WO2021118697A1 (en) 2021-06-17

Family

ID=73476248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/057404 WO2021118697A1 (en) 2019-12-10 2020-10-26 Process to learn new image classes without labels

Country Status (3)

Country Link
EP (1) EP4073704A1 (en)
CN (1) CN114600130A (en)
WO (1) WO2021118697A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591918A (en) * 2021-06-29 2021-11-02 北京百度网讯科技有限公司 Training method of image processing model, image processing method, device and equipment
CN113792823A (en) * 2021-11-17 2021-12-14 山东力聚机器人科技股份有限公司 Method and device for identifying new type of image
CN114187610A (en) * 2021-12-02 2022-03-15 南京理工大学 Low-cost pedestrian re-identification method based on deep active learning
US20220365816A1 (en) * 2021-05-12 2022-11-17 Lockheed Martin Corporation Feature extraction from perception data for pilot assistance with high workload tasks
US20230144745A1 (en) * 2021-11-09 2023-05-11 Zoox, Inc. Machine-learned architecture for efficient object attribute and/or intention classification
CN117671426A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Concept distillation and CLIP-based hintable segmentation model pre-training method and system
WO2024100851A1 (en) * 2022-11-10 2024-05-16 日本電信電話株式会社 Model training device, model training method, and model training program

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311844B (en) * 2022-06-22 2023-05-16 东南大学 Expressway traffic state estimation method based on self-supervision learning support vector machine

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130220A1 (en) * 2017-10-27 2019-05-02 GM Global Technology Operations LLC Domain adaptation via class-balanced self-training with spatial priors

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130220A1 (en) * 2017-10-27 2019-05-02 GM Global Technology Operations LLC Domain adaptation via class-balanced self-training with spatial priors

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
AHMED AMR ET AL: "Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks", 12 October 2008, LECTURE NOTES IN COMPUTER SCIENCE; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], PAGE(S) 69 - 82, ISBN: 978-3-030-67069-6, ISSN: 0302-9743, XP047530154 *
DONG, W.MOSES, C.LI, K.: "Efficient k-Nearest Neighbor Graph Construction for Generic Similarity Measures", PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, March 2011 (2011-03-01), pages 577 - 586, XP058001428, DOI: 10.1145/1963405.1963487
KOLOURI, S.MARTIN, C.E.HOFFMANN, H.: "Explaining distributed neural activations via unsupervised learning", CVPR WORKSHOP ON EXPLAINABLE COMPUTER VISION AND JOB CANDIDATE SCREENING COMPETITION, vol. 2, July 2017 (2017-07-01)
KOLOURI, S.POPE, P.E.MARTIN, C.E.ROHDE, G.K.: "Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model", ARXIV PREPRINT ARXIV: 1804.01947, 2018
KOLOURI, S.ROHDE, G.K.HOFFMANN, H.: "Sliced Wasserstein Distance for Learning Gaussian Mixture Models", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pages 3427 - 3436
KOLOURI, S.ZOU, Y.ROHDE, G.K.: "Sliced Wasserstein Kernels for Probability Distributions", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 5258 - 5267, XP033021721, DOI: 10.1109/CVPR.2016.568
MARC'AURELIO RANZATO ET AL: "Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition", CVPR '07. IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION; 18-23 JUNE 2007; MINNEAPOLIS, MN, USA, IEEE, PISCATAWAY, NJ, USA, 1 June 2007 (2007-06-01), pages 1 - 8, XP031114414, ISBN: 978-1-4244-1179-5 *
MENG YE ET AL: "Self-Training Ensemble Networks for Zero-Shot Image Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 May 2018 (2018-05-19), XP080880007 *
NOROOZI, M.VINJIMOOR, A.FAVARO, P.PIRSIAVASH, H.: "Boosting Self-Supervised Learning via Knowledge Transfer", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018
PASTOR, P.HOFFMANN, H.ASFOUR, T.SCHAAL, S.: "Learning and Generalization of Motor Skills by Learning from Demonstration", IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, 2009
SARKAR, P.MOORE, A. W.PRAKASH, A.: "Fast Incremental Proximity Search in Large Graphs", PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 2008, pages 896 - 903, XP058106403, DOI: 10.1145/1390156.1390269
WANG, K.ZHANG, D.LI, Y.ZHANG, R.LIN, L.: "Cost-Effective Active Learning for Deep Image Classification", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 27, no. 12, 2017, pages 2591 - 2600, XP011674537, DOI: 10.1109/TCSVT.2016.2589879
XIAN, Y.SCHIELE, B.AKATA, Z.: "Zero-shot learning-the good, the bad and the ugly", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2017, pages 4582 - 4591

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220365816A1 (en) * 2021-05-12 2022-11-17 Lockheed Martin Corporation Feature extraction from perception data for pilot assistance with high workload tasks
US11928505B2 (en) * 2021-05-12 2024-03-12 Lockheed Martin Corporation Feature extraction from perception data for pilot assistance with high workload tasks
CN113591918A (en) * 2021-06-29 2021-11-02 北京百度网讯科技有限公司 Training method of image processing model, image processing method, device and equipment
CN113591918B (en) * 2021-06-29 2024-02-06 北京百度网讯科技有限公司 Training method of image processing model, image processing method, device and equipment
US20230144745A1 (en) * 2021-11-09 2023-05-11 Zoox, Inc. Machine-learned architecture for efficient object attribute and/or intention classification
US11972614B2 (en) * 2021-11-09 2024-04-30 Zoox, Inc. Machine-learned architecture for efficient object attribute and/or intention classification
CN113792823A (en) * 2021-11-17 2021-12-14 山东力聚机器人科技股份有限公司 Method and device for identifying new type of image
CN113792823B (en) * 2021-11-17 2022-03-25 山东力聚机器人科技股份有限公司 Method and device for identifying new type of image
CN114187610A (en) * 2021-12-02 2022-03-15 南京理工大学 Low-cost pedestrian re-identification method based on deep active learning
WO2024100851A1 (en) * 2022-11-10 2024-05-16 日本電信電話株式会社 Model training device, model training method, and model training program
CN117671426A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Concept distillation and CLIP-based hintable segmentation model pre-training method and system
CN117671426B (en) * 2023-12-07 2024-05-28 北京智源人工智能研究院 Concept distillation and CLIP-based hintable segmentation model pre-training method and system

Also Published As

Publication number Publication date
EP4073704A1 (en) 2022-10-19
CN114600130A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
US11625557B2 (en) Process to learn new image classes without labels
WO2021118697A1 (en) Process to learn new image classes without labels
US10902615B2 (en) Hybrid and self-aware long-term object tracking
US11288498B2 (en) Learning actions with few labels in the embedded space
Tsintotas et al. The revisiting problem in simultaneous localization and mapping: A survey on visual loop closure detection
Stewart et al. End-to-end people detection in crowded scenes
CN110073367B (en) Multi-view embedding with SOFT-MAX based compatibility function for zero sample learning
Garcia-Fidalgo et al. Vision-based topological mapping and localization methods: A survey
US11176477B2 (en) System and method for unsupervised domain adaptation via sliced-wasserstein distance
Charalampous et al. On-line deep learning method for action recognition
WO2019099537A1 (en) Spatio-temporal action and actor localization
US11790646B2 (en) Network for interacted object localization
US10607111B2 (en) Machine vision system for recognizing novel objects
CN114329031B (en) Fine-granularity bird image retrieval method based on graph neural network and deep hash
Korrapati et al. Multi-resolution map building and loop closure with omnidirectional images
Wang et al. Online visual place recognition via saliency re-identification
WO2020159638A1 (en) System and method for unsupervised domain adaptation via sliced-wasserstein distance
Tsintotas et al. The revisiting problem in simultaneous localization and mapping
Luo et al. LEST: Large-scale LiDAR Semantic Segmentation with Transformer
Jaenal et al. Unsupervised appearance map abstraction for indoor visual place recognition with mobile robots
JP2022150552A (en) Data processing apparatus and method
Zhang et al. R-CCF: region-aware continual contrastive fusion for weakly supervised object detection
US20240185078A1 (en) Purified contrastive learning for lightweight neural network training
Shakeri et al. Online loop-closure detection via dynamic sparse representation
Li et al. From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20808575

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020808575

Country of ref document: EP

Effective date: 20220711