[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024127554A1 - Information processing device, inference method, inference program, and method for generating feature value generation model - Google Patents

Information processing device, inference method, inference program, and method for generating feature value generation model Download PDF

Info

Publication number
WO2024127554A1
WO2024127554A1 PCT/JP2022/046044 JP2022046044W WO2024127554A1 WO 2024127554 A1 WO2024127554 A1 WO 2024127554A1 JP 2022046044 W JP2022046044 W JP 2022046044W WO 2024127554 A1 WO2024127554 A1 WO 2024127554A1
Authority
WO
WIPO (PCT)
Prior art keywords
inference
context
feature
frame images
information indicating
Prior art date
Application number
PCT/JP2022/046044
Other languages
French (fr)
Japanese (ja)
Inventor
あずさ 澤田
尚司 谷内田
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/046044 priority Critical patent/WO2024127554A1/en
Publication of WO2024127554A1 publication Critical patent/WO2024127554A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion

Definitions

  • One aspect of the present invention aims to realize an information processing device or the like that is capable of performing inference that takes into account context while keeping computational costs down.
  • An information processing device includes a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
  • An inference method includes at least one processor generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, features of the object appearing in the frame images according to the context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured, and performing a predetermined inference regarding the object based on the features.
  • An inference program causes a computer to function as a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
  • a method for generating a feature generation model includes: at least one processor inputs, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined ground truth data.
  • FIG. 1 is a block diagram showing a configuration of an information processing device according to a first exemplary embodiment of the present invention
  • 1 is a flow diagram showing the flow of a method for generating and inferring a feature generation model according to an exemplary embodiment 1 of the present invention.
  • FIG. 1 is a diagram for explaining a method of inspection for checking for foreign matter.
  • FIG. 13 is a diagram for explaining an overview of an inference method according to an exemplary embodiment 2 of the present invention.
  • FIG. 11 is a block diagram showing an example of the configuration of an information processing device according to an exemplary embodiment 2 of the present invention.
  • FIG. 13 is a diagram showing an example in which a difference occurs between contexts during learning and inference.
  • FIG. 11 is a flowchart showing a flow of processing performed by an information processing device according to an exemplary embodiment 2 of the present invention during learning.
  • FIG. 11 is a flow chart showing the flow of processing performed during inference by an information processing device according to an exemplary embodiment 2 of the present invention.
  • FIG. 1 is a diagram showing an example of a computer that executes instructions of a program, which is software that realizes the functions of each device according to each exemplary embodiment of the present invention.
  • Example embodiment 1 A first exemplary embodiment of the present invention will be described in detail with reference to the drawings. This exemplary embodiment is a basic form of the exemplary embodiments described below. First, information processing devices 1 and 2 according to this exemplary embodiment will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the configuration of the information processing devices 1 and 2.
  • the information processing device 1 includes an inference unit 11 and a learning unit 12.
  • the inference unit 11 performs a predetermined inference regarding an object moving according to a predetermined context, based on features calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context.
  • the learning unit 12 updates the feature generation model so that the result of the inference by the inference unit 11 approaches the predetermined correct answer data.
  • the information processing device 1 includes an inference unit 11 that performs a predetermined inference regarding the object based on features calculated by inputting, for each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches predetermined correct answer data.
  • the above configuration makes it possible to generate a feature generation model that can generate features according to the context from time information and location information. This has the effect of making it possible to perform inference that takes into account the context while reducing computational costs compared to conventional techniques that generate context features from the entire video.
  • the information processing device 2 includes a feature generating unit 21 and an inference unit 22. For each of a plurality of frame images extracted from a moving image capturing an object moving along a predetermined context, the feature generating unit 21 generates a feature corresponding to the context of the object captured in the frame image by using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
  • the inference unit 22 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
  • the information processing device 2 includes a feature generation unit 21 that generates features according to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a predetermined context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
  • An object that moves according to a specific context is affected differently at each point in time based on that context. Therefore, with the above configuration, it is possible to generate features according to the context of the object and perform inference based on those features.
  • the information processing device 2 has the advantage of being able to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
  • the above-mentioned functions of the information processing device 1 can also be realized by a program.
  • the learning program according to this exemplary embodiment causes a computer to function as an inference unit 11 that performs a predetermined inference regarding the object based on a feature calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image into a feature generation model for generating a feature according to the context, for each of a plurality of frame images extracted from a video image capturing an object moving along a predetermined context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches a predetermined correct answer data.
  • a feature generation model capable of generating a feature according to a context from time information and position information can be generated, and thus an effect is obtained in which it is possible to perform inference taking into account the context while suppressing calculation costs.
  • the inference program causes a computer to function as a feature amount generating unit 21 that generates a feature amount according to the context of an object appearing in a plurality of frame images extracted from a video in which an object moving along a predetermined context is captured, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the feature amount generated by the feature amount generating unit 21.
  • This inference program provides the effect of being able to perform inference taking into account the context while suppressing calculation costs.
  • Fig. 2 is a flow diagram showing the flow of the method for generating and inferring a feature generation model. Note that the execution subject of each step in the determination method shown in Fig. 2 may be a processor provided in the information processing device 1 or 2, or a processor provided in another device, or each step may be a processor provided in a different device.
  • the flow diagram shown on the left side of FIG. 2 illustrates a method for generating a feature generation model according to this exemplary embodiment.
  • at least one processor inputs time information indicating the timing at which each of a plurality of frame images extracted from a video of an object moving along a predetermined context was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and performs a predetermined inference regarding the object based on the calculated features.
  • At least one processor updates the feature generation model so that the result of the inference in S11 approaches the predetermined ground truth data.
  • the method for generating a feature generation model includes at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features (S11), and updating the feature generation model so that the result of the inference in S11 approaches predetermined ground truth data (S12).
  • a feature generation model capable of generating features according to the context from the time information and position information, which has the effect of making it possible to perform inference taking into account the context while suppressing calculation costs.
  • the flow diagram shown on the right side of FIG. 2 illustrates an inference method according to this exemplary embodiment.
  • at least one processor generates, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature quantity corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
  • At least one processor performs a predetermined inference regarding the object based on the features generated in S21.
  • the inference method includes at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, feature values of the object appearing in the frame images according to the above-mentioned context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S21), and making a predetermined inference about the object based on the feature values generated in S21 (S22).
  • This provides the effect of being able to make inferences that take into account the context while keeping computational costs down.
  • Exemplary embodiment 2 A second exemplary embodiment of the present invention will be described in detail with reference to the drawings.
  • an inference method according to this exemplary embodiment (hereinafter, referred to as this inference method) is used in an inspection to check whether a foreign object is present in a liquid (e.g., medicine, beverage, etc.) sealed in a transparent container (hereinafter, referred to as a foreign object confirmation inspection).
  • FIG. 3 is a diagram for explaining the method of foreign object inspection.
  • a container filled with a predetermined liquid which is an object to be inspected, is fixed in a device (not shown in FIG. 3), and the device is used to rock the container.
  • a control sequence for rocking the container is determined in advance.
  • the control sequence in the example of FIG. 3 is such that the container is rotated in a vertical plane for a predetermined time, then stopped for a predetermined time, and then rotated in a horizontal plane for a predetermined time. This control sequence may be repeated multiple times.
  • the container is rocked using the above-mentioned control while a moving image of the liquid inside the container is captured.
  • frame images are extracted from the captured moving image, and object detection is performed on each frame image.
  • object detection is performed on each frame image. Then, for each object detected by this object detection, it is determined whether the object is an air bubble or a foreign body, and if there is no object determined to be a foreign body, the inspected item is determined to be a good product, and if there is even one object determined to be a foreign body, it is determined to be a defective product.
  • the device rocks the container by controlling a specific pattern, so that the object inside the container (air bubbles or foreign body) moves along a specific context based on this pattern.
  • the flow of liquid inside the container accelerates for a while after the container starts to rotate, so the object also accelerates along this flow.
  • the object moves in the direction of the container's rotation.
  • the flow rate of the liquid gradually slows down and stabilizes in a steady state.
  • the speed and direction of the object's movement during this time also follow the flow rate and direction of the liquid. The same applies to subsequent controls.
  • the inference in this inference method is to determine whether the object is an air bubble or a foreign object.
  • this inference method makes it possible to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
  • Fig. 4 is a diagram for explaining an overview of this inference method.
  • frame images Prior to execution of this inference method, frame images are first extracted from a video image capturing an object.
  • n frame images from FR1 to FRn are extracted.
  • FR1 a container filled with liquid is captured, as well as objects OB1 and OB2. The same is true for the other frame images.
  • the objects are air bubbles and foreign objects. Since both are small in size and have similar appearances, it is difficult to accurately identify whether a detected object is an air bubble or a foreign object based on only one frame image.
  • trajectory data is generated that indicates the trajectory of the movement of the objects.
  • Fig. 4 shows schematic diagrams of trajectory data A1 of the object OB1 and trajectory data A2 of the object OB2.
  • the trajectory data A1 includes time information indicating the timing at which the frame image in which the object OB1 was detected was shot, and position information indicating the detected position of the object OB1 in the frame image. For example, assume that the object OB1 was detected in each of the frame images FR 1 to FR 10. In this case, the trajectory data A1 includes time information indicating the shooting timing of each of the frame images FR 1 to FR 10 , and position information indicating the detected position of the object OB1 in the frame images FR 1 to FR 10 .
  • the trajectory data A1 may also include information indicating the characteristics of the detected object OB1.
  • the trajectory data A1 may include an image patch that is an image cut out of an area in the frame image that includes the object OB1, or feature amounts extracted from the image patch, information indicating the size of the detected object, and information indicating the moving speed of the detected object.
  • normalized position information may be used to avoid being affected by differences in container size and liquid volume.
  • normalized position information can be generated by applying at least one of translation, rotation, and scale transformation to position information indicating the position of the object on the frame image.
  • the time information may also be normalized.
  • the time information of the frame image FR 1 may be set to 0, and the time information of the frame image FR n captured at the timing when a series of control sequences ends may be set to 1, and the time information of the frame images FR 2 to FR n-1 may be determined based on these values.
  • trajectory data A2 which includes time information indicating the timing at which the frame image in which the object OB2 was detected was captured, and position information indicating the detected position of the object OB2 in the frame image.
  • the trajectory data A2 may also include information indicating the characteristics of the detected object OB2.
  • trajectory data A1 and A2 may differ. Also, only the trajectory data A1 and A2 for two objects OB1 and OB2 are shown here, but a larger number of objects may be detected in an actual foreign object confirmation inspection.
  • the time information and position information contained in each of the trajectory data A1 and A2 are input into a feature generation model to generate feature quantities B1 and B2 according to the context.
  • These feature quantities are generated for each time. For example, a feature quantity at time t1 is generated from the time information and position information at that time t1.
  • a previously generated judgment function may be used to judge whether the frame image is inside or outside a liquid region, and a value indicating the judgment result may be included in feature quantities B1 and B2. This makes it possible to eliminate the influence of objects, etc. detected outside the liquid region and obtain valid inference results.
  • the feature generation model makes it possible to generate features according to the context using time and position information, which are significantly smaller in data size than video images.
  • the trajectory data and features generated as described above are integrated to generate integrated data, and the integrated data is input to the inference model to output the inference result.
  • the inference result indicates whether each of the objects OB1 and OB2 is an air bubble or a foreign body.
  • the objects OB1 and OB2 detected in the above-mentioned foreign body confirmation inspection are both small in size and similar in appearance. For this reason, it is difficult to accurately determine whether a detected object is an air bubble or a foreign body.
  • this inference method uses integrated data that reflects features generated using a feature generation model to perform estimation that takes into account the context, making it possible to perform difficult estimations with high accuracy.
  • Fig. 5 is a block diagram showing an example of the configuration of the information processing device 3.
  • the information processing device 3 includes a control unit 30 that controls each unit of the information processing device 3, and a storage unit 31 that stores various data used by the information processing device 3.
  • the information processing device 3 also includes a communication unit 32 that allows the information processing device 3 to communicate with other devices, an input unit 33 that accepts input of various data to the information processing device 3, and an output unit 34 that allows the information processing device 3 to output various data.
  • the control unit 30 also includes an object detection unit 301, a trajectory data generation unit 302, a feature generation unit 303, an integration unit 304, an inference unit 305, a learning unit 306, a difference identification unit 307, an adjustment unit 308, and a similarity calculation unit 309.
  • the memory unit 31 stores trajectory data 311, a feature generation model 312, an inference model 313, and teacher data 314.
  • the learning unit 306, the similarity calculation unit 309, and the teacher data 314 will be described later in the section “About learning", and the difference identification unit 307 and the adjustment unit 308 will be described later in the section "About absorbing context differences".
  • the object detection unit 301 detects a specific object from each of multiple frame images extracted from a video. If the target video has been shot at a high frame rate, the object detection unit 301 can detect the object from each frame image with relatively lightweight image processing by utilizing the positional continuity. There are no particular limitations on the method of detecting the object. For example, the object detection unit 301 may detect the object using a detection model that has been trained to detect the object using an image of the object as training data. There are no particular limitations on the algorithm of the detection model. For example, the object detection unit 301 may use a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.
  • a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.
  • the trajectory data generation unit 302 generates trajectory data indicating the trajectory of the movement of an object based on the detection result of the object from multiple frame images by the object detection unit 301.
  • the trajectory data includes time information indicating the timing when the frame image in which the object is detected was captured, and position information indicating the detected position of the object in the frame image, and may also include image patches or the like as information indicating the characteristics of the detected object.
  • the time information may also include information indicating the time difference between the frame images.
  • the generated trajectory data is stored in the memory unit 31 as trajectory data 311.
  • the feature generation unit 303 generates features according to the context of an object appearing in each of a plurality of frame images extracted from a video capturing an object moving along a specific context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image. Specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data 311 to the feature generation model 312.
  • the feature generation model 312 is a trained model that has been trained to generate features according to a context. More specifically, the feature generation model 312 is generated by learning the relationship between time information indicating the timing at which an object moving in accordance with a specified context was photographed, and position information indicating the detected position of the object in the image photographed at that timing, and the feature of the object at that timing.
  • the feature generation model 312 may be a function that uses the above-mentioned feature as a response variable and time information and position information as explanatory variables.
  • the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference.
  • the algorithm of the feature generation model 312 is not particularly limited.
  • the feature generation model 312 may be a model such as a convolutional neural network, a recurrent neural network, or a transformer, or may be a model that combines two or more of these.
  • the integration unit 304 integrates the trajectory data 311 generated by the trajectory data generation unit 302 and the features generated by the feature generation unit 303 to generate integrated data.
  • the integration method is not particularly limited.
  • the integration unit 304 may combine the feature as an additional dimension with respect to each time component in the trajectory data 311 (specifically, position information or an image patch associated with one piece of time information, etc.).
  • the integration unit 304 may generate integrated data reflecting the feature by adding the feature to each time component or multiplying the feature by each time component.
  • the integration unit 304 may reflect the feature in each time component by an attention mechanism.
  • the inference unit 305 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 303. Specifically, the inference unit 305 inputs integrated data reflecting the features generated by the feature generation unit 303 to the inference model 313, thereby obtaining an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • an inference result i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • the inference model 313 is a model generated by using integrated data generated from an image showing air bubbles or foreign objects to learn whether an object shown in the image is an air bubble or a foreign object.
  • the algorithm of the inference model 313 is not particularly limited.
  • the inference model 313 may be a model such as a convolutional neural network, a recursive neural network, or a transformer, or may be a model that combines two or more of these.
  • the information processing device 3 includes a feature generating unit 303 that generates features according to the context of an object appearing in a frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, and an inference unit 305 that performs a predetermined inference regarding the object based on the features generated by the feature generating unit 303.
  • This provides the effect of being able to perform inference that takes into account the context while suppressing calculation costs, similar to the information processing device 2 according to the exemplary embodiment 1.
  • still images in a time series obtained by continuously capturing still images are also included in the category of "a plurality of frame images extracted from a video”.
  • the feature generation unit 303 may generate features using a feature generation model 312 that has learned the relationship between time information indicating the timing at which an object moving along a context that is the same as or similar to the context in which the target object moves was photographed, position information indicating the detected position of the object in the image photographed at that timing, and features according to the context of the object at that timing. This provides the effect of being able to generate appropriate features based on the learning results, in addition to the effect provided by the information processing device 2 according to exemplary embodiment 1.
  • the information processing device 3 includes a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images, and an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data, the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data.
  • a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images
  • an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data
  • the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data.
  • the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with the trajectory data of an object.
  • the teacher data 314 may also include multiple frame images that are the source of the trajectory data.
  • the feature generation unit 303 generates features from the time information and position information included in the trajectory data, and the integration unit 304 integrates the generated features with the trajectory data to generate integrated data.
  • the inference unit 305 then performs inference using the integrated data, thereby obtaining an inference result based on the trajectory data included in the teacher data 314.
  • the learning unit 306 updates the inference model 313 and the feature generation model 312 so that the result of inference based on the trajectory data included in the teacher data 314 approaches the predetermined correct answer data indicated in the teacher data 314.
  • the learning unit 306 may use a gradient descent method to update each of the inference model 313 and the feature generation model 312 so as to minimize a loss function that is the sum of the errors between the inference result and the correct answer data.
  • video images or frame images are not used as is to generate features, but frame images may be used during learning.
  • the similarity between frame images may be used to update the feature generation model 312.
  • the similarity calculation unit 309 calculates the similarity between frame images, and is configured to be used when the similarity is used to update the feature generation model 312.
  • the learning unit 306 updates the feature generation model 312 so that the similarity between multiple frame images is reflected in the similarity between features generated by the feature generation model 312 for the frame images. For example, by adding a normalization term to the loss function described above, the feature generation model 312 can be updated so that the similarity between frame images becomes closer to the similarity of the features.
  • the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the contexts need only be at least partially the same or similar, and do not have to be entirely the same or similar.
  • the difference identification unit 307 and adjustment unit 308 are used when there is a difference between the context of the movement of the object used in learning and the context of the movement of the target object that is the subject of inference.
  • the adjustment unit 308 adjusts at least one of the time information and the position information used to generate the features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred.
  • the difference identification unit 307 identifies the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the environment surrounding the object and the environment surrounding the target object.
  • the difference identification unit 307 may also identify the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the object used for learning and the target object to be inferred.
  • FIG. 6 is a diagram showing an example in which a difference occurs between contexts.
  • EX1 shown in FIG. 6, as in FIG. 3, it is assumed that a foreign object confirmation inspection is performed using a control sequence of rotation, rest, and rotation, and the controls included in each control sequence during learning and inference and their execution order are the same. However, the rest period during inference is longer than during learning. More specifically, in the example EX1, during both learning and inference, rotation starts at time t1 and ends at time t2 to enter a rest state, and the movement of the liquid in the container becomes steady at time t3.
  • the second rotation starts at time t4 and ends at time t5
  • the second rotation starts at time t4', which is ⁇ t earlier than time t4.
  • the time at which the second rotation ends is also ⁇ t earlier than time t5 at time t5'.
  • the adjustment unit 308 performs an adjustment by adding ⁇ t to the time indicated in each piece of time information corresponding to the period from time t4' to time t5' among the time information used to generate the feature. This makes it possible to absorb the difference in context between learning and inference. Note that, contrary to example EX1, if the still period during inference is made longer by ⁇ t than during learning, the adjustment unit 308 can perform an adjustment by subtracting ⁇ t from the time indicated in each piece of time information after the end of the still period.
  • the adjustment unit 308 can absorb the difference in context by adjusting the time information accordingly.
  • the adjustment unit 308 can adjust the position information used to generate features so as to absorb the difference.
  • the adjustment unit 308 may absorb the difference between contexts by performing a left-right inversion process on the position information of the object. For example, when coordinate values are used as position information, the adjustment unit 308 may perform a conversion that inverts left-right about a specified axis for each coordinate value during the period in which the pattern of movement is inverted left-right. Also, for example, when the pattern of movement of the object and the pattern of movement of the object during learning are in a rotationally symmetric relationship, the adjustment unit 308 may absorb the difference between contexts by rotationally transforming the position information of the object. Note that the adjustment unit that adjusts the time information and the adjustment unit that adjusts the position information may each be separate blocks.
  • the time information corresponding to each frame image used during inference may differ from the time information corresponding to each frame image used for learning.
  • the time of the frame image at the start of the first rotation among the frame images used for learning is t1.
  • the time of the frame image at the start of the first rotation among the frame images used for inference is t1' (t1' ⁇ t1), a difference of (t1 - t1') will occur between the contexts.
  • the adjustment unit 308 can perform an adjustment by adding the value of (t1 - t1') to the time indicated in each piece of time information used for inference. Furthermore, if the time of the frame image at the start of the first rotation among the frame images used in inference is t1" (t1" > t1), the adjustment unit 308 can perform an adjustment by subtracting the value of (t1" - t1) from the time indicated in each piece of time information used in inference. By making such an adjustment, the relationship between the time indicated in the time information and the control timing can be aligned with that during learning, thereby allowing the feature generation model 312 to output appropriate features.
  • factors that cause differences in context are not limited to control sequences.
  • differences between contexts can arise when there is a difference between the object used in learning and the subject of inference, or when there is a difference between the environment surrounding the object used in learning and the environment surrounding the subject of inference.
  • the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the differences described above, i.e., the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. Therefore, according to the information processing device 3 of this exemplary embodiment, in addition to the effect achieved by the information processing device 2 of exemplary embodiment 1, the effect of being able to automatically identify context differences can be obtained. Furthermore, since the information processing device 3 is equipped with the adjustment unit 308, it is possible to cause the adjustment unit 308 to make adjustments to absorb the differences identified by the difference identification unit 307.
  • example EX2 in Figure 6 shows a case where the viscosity of the liquid sealed in the container is different during learning and inference in a foreign object confirmation inspection that is performed in a sequence of rotation, rest, and rotation. That is, in example EX2, the environment surrounding the object used in learning is different from the environment surrounding the target object. Specifically, the liquid in the container used in inference has a higher viscosity than the liquid used in learning, and therefore, during inference, the time from when the container is brought to rest until the inside of the container becomes steady is shorter than during learning. In other words, the time t3' when the container becomes steady during inference is earlier than the time t3 when the container becomes steady during learning (t3 > t3').
  • the difference identification unit 307 identifies the time t3' at which the liquid stabilized, which is the difference between the contexts at the time of learning and the time of inference, based on the viscosity of the liquid in the container used for inference. If the relationship between the viscosity and the time required for the liquid to stabilize is identified and modeled in advance, the difference identification unit 307 can identify the time t3' using the model and the viscosity of the liquid in the container used for inference.
  • the adjustment unit 308 absorbs the above-mentioned difference by adjusting the time information based on the result of the identification by the difference identification unit 307. Specifically, the adjustment unit 308 performs an adjustment by adding the value of (t3-t3') to the time indicated in each piece of time information from time t3' to time t4.
  • the adjustment unit 308 may adjust all times in the steady state to the same value. In this case, the adjustment unit 308 may replace the times indicated in each piece of time information from time t3' to t4 with, for example, time t3. In this way, the adjustment unit 308 may set the time information for a period in which the context is constant during inference to a constant value. Furthermore, the constant value may be selected from the time values for a period in which an object moved according to a context that was the same as or similar to the above context during learning (the period from time t3 to t4 in the above example).
  • the information processing device 3 includes an adjustment unit 308 that adjusts at least one of the time information and the position information used to generate features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred. Therefore, according to the information processing device 1 according to this exemplary embodiment, even if there is a difference between the context of the movement of the target object and the context of the movement of the object used in learning, it is possible to obtain the effect that appropriate features can be generated using the same feature generation model 312.
  • (Learning process flow) 7 is a flow diagram showing the flow of processing performed by the information processing device 3 during learning. Note that, when learning is performed, the teacher data 314 and the feature generation model 312 are stored in advance in the storage unit 31.
  • the feature generation model 312 stored in the storage unit 31 may have parameters in an initial state, or may be a model in which learning has progressed to a certain degree.
  • the learning unit 306 acquires the teacher data 314 stored in the memory unit 31.
  • the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with respect to the trajectory data of the object.
  • the teacher data 314 also includes multiple frame images that are the basis of the trajectory data.
  • the feature generation unit 303 uses the teacher data 314 acquired in S31 to generate features according to the context of the object used in learning. More specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data included in the teacher data 314 acquired in S31 to the feature generation model 312.
  • the integration unit 304 integrates the trajectory data included in the teacher data 314 acquired in S31 with the feature amount generated in S32 to generate integrated data. Then, in S34, the inference unit 305 performs a predetermined inference using the integrated data generated in S33. Specifically, the inference unit 305 inputs the integrated data generated in S33 to the inference model 313 to obtain an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • an inference result i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • the similarity calculation unit 309 calculates the similarity between the frame images included in the teacher data 314 acquired in S31.
  • the frame images may be clipped from around a location corresponding to the position information indicated in the trajectory data.
  • the similarity calculation unit 309 may calculate the similarity for each combination of multiple frame images (corresponding to one trajectory data) included in the teacher data 314, or may calculate the similarity for some combinations.
  • the process of S35 may be performed before S36, and may be performed before S32, for example, or in parallel with the processes of S32 to S34.
  • the learning unit 306 updates the feature generation model 312 so that the result of the inference in S34 approaches the predetermined correct answer data indicated in the teacher data 314.
  • the learning unit 306 updates the feature generation model 312 so that the similarity between the frame images calculated in S35 is reflected in the similarity between the features generated by the feature generation model 312 for the frame images.
  • the learning unit 306 determines whether or not to end learning.
  • the condition for ending learning may be determined in advance, and learning may end, for example, when the number of updates of the feature generation model 312 reaches a predetermined number. If the learning unit 306 determines NO in S37, it returns to the process of S31 and acquires new teacher data 314. On the other hand, if the learning unit 306 determines YES in S37, it stores the updated feature generation model 312 in the memory unit 31 and ends the process of FIG. 7.
  • a video image showing a predetermined object (specifically, at least one of an air bubble and a foreign object) or a frame image extracted from the video image may be acquired.
  • the object detection unit 301 detects the object from the acquired frame image, and the trajectory data generation unit 302 generates trajectory data 311 of the detected object.
  • the teacher data 314 is generated by labeling this trajectory data 311 with the correct answer data.
  • the processing after the teacher data 314 is generated is the same as the processing from S32 onwards described above.
  • the method for generating the feature generation model 312 includes: for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into the feature generation model 312 for generating features according to the context, performing a predetermined inference regarding the object based on the calculated features (S34); and updating the feature generation model 312 so that the result of the inference approaches predetermined correct answer data (S36).
  • the above configuration makes it possible to generate a feature generation model 312 capable of generating features according to a context.
  • This makes it possible to generate a feature generation model capable of generating features according to a context from time information and location information, and has the effect of making it possible to perform inference that takes into account the context while suppressing computational costs.
  • the method for generating the feature generation model 312 includes calculating the similarity between a plurality of frame images (S35), and in updating the feature generation model 312, the feature generation model 312 is updated so that the similarity between a plurality of frame images is reflected in the similarity between the features generated by the feature generation model 312 for the frame images. Since similar frame images are considered to have similar contexts, the above configuration makes it possible to generate a feature generation model 312 that can generate more valid features that take into account the similarity between frame images.
  • Fig. 8 is a flow diagram showing the flow of processing (inference method) performed by the information processing device 3 during inference.
  • Fig. 8 shows processing after a plurality of frame images extracted from a moving image to be inferred are input to the information processing device 3.
  • the moving image shows an object to be determined as being an air bubble or a foreign object.
  • the information processing device 3 may also perform the processing of extracting frame images from the moving image.
  • the object detection unit 301 detects an object from each of the frame images. Then, in S42, the trajectory data generation unit 302 generates trajectory data 311 indicating the trajectory of the object movement based on the object detection result in S41. Note that the following describes the process when one object and one trajectory data 311 indicating the trajectory of the object movement are generated. When multiple trajectory data 311 are generated, the processes of S43 to S47 described below are performed for each trajectory data 311.
  • the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. For example, if the viscosity of the liquid sealed in the container is different during learning and during inference, the difference identification unit 307 may calculate the time at which the liquid in the container becomes steady based on the difference in the viscosity of the liquid, and calculate the difference between this time and the time at which the liquid in the container becomes steady during learning.
  • the adjustment unit 308 adjusts at least one of the time information and the location information used to generate the features so as to absorb the difference between the contexts identified in S43. For example, if a difference in the time at which the features become stable is calculated as described above in S43, the adjustment unit 308 adjusts the time information so as to absorb the time difference. Note that if there is no difference between the contexts, the processes of S43 and S44 are omitted. Furthermore, if at least one of the time information and the location information was normalized during learning, the adjustment unit 308 similarly normalizes the time information and the location information used to generate the features.
  • the feature generation unit 303 generates features according to the context. Specifically, for each of a plurality of frame images corresponding to one piece of trajectory data 311, the feature generation unit 303 extracts position information and time information of an object appearing in the frame image from the trajectory data 311. The feature generation unit 303 then inputs the extracted time information and position information into the feature generation model 312 to generate features. As a result, for each frame image, features according to the context of the object appearing in the frame image are generated.
  • the integration unit 304 integrates the trajectory data 311 generated in S42 with the feature quantities generated in S45 to generate integrated data. Then, in S47, the inference unit 305 performs a predetermined inference regarding the object based on the feature quantities generated in S45. Specifically, the inference unit 305 obtains an inference result by inputting the integrated data reflecting the feature quantities generated in S45 to the inference model 313, and the processing in FIG. 8 ends. Note that the inference unit 305 may output the inference result to the output unit 34, etc., or may store it in the memory unit 31, etc.
  • the inference method includes generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features corresponding to the context of the object appearing in the frame images using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S45), and making a predetermined inference about the object based on the generated features (S47).
  • This has the effect of making it possible to perform inference that takes into account the context while keeping computational costs down.
  • each process described in the above exemplary embodiment is arbitrary and is not limited to the above example.
  • an information processing system having the same functions as the information processing devices 1 to 3 can be constructed by a plurality of devices that can communicate with each other.
  • the process in the flow chart of FIG. 7 and the process in the flow chart of FIG. 8 may be executed by different information processing devices (or processors).
  • each process in the flow chart shown in FIG. 7 or FIG. 8 can be shared and executed by a plurality of information processing devices (or processors).
  • the content of the predetermined inference executed by the inference units 11, 22, and 305 is not particularly limited as long as it is related to the object.
  • it may be prediction, conversion, etc.
  • the factors that give rise to a context are also arbitrary.
  • information processing device 2 or 3 it is possible to perform inference that takes into account the context for an object that moves in accordance with a context that arises due to various devices whose operations change at a predetermined cycle, or due to natural phenomena that change at a predetermined cycle, etc.
  • information processing device 1 or 3 it is possible to generate a feature generation model that makes it possible to perform inference that takes into account the above-mentioned context.
  • the movement of moving objects (vehicles, people, etc.) around a traffic light is affected by the periodic light emission control of the traffic light.
  • the moving objects move according to the context resulting from the light emission control of the traffic light.
  • the information processing device 1 or 3 performs a predetermined inference regarding the moving object based on the feature calculated by inputting time information and position information into the feature generation model for each of a plurality of frame images extracted from a video image capturing a moving object moving along the context, and updates the feature generation model so that the inference result approaches predetermined correct answer data.
  • the information processing device 2 or 3 performs a predetermined inference regarding the moving object based on the feature generated using the feature generation model thus generated, thereby obtaining a highly valid inference result that takes the context into account.
  • the content of the inference is not particularly limited, and may be, for example, a position prediction of the moving object after a predetermined time, a behavior classification of the moving object, or detection of abnormal behavior of the moving object. It is preferable that these inferences also take into account interactions between vehicles and pedestrians, between vehicles, etc.
  • Some or all of the functions of the information processing devices 1 to 3 may be realized by hardware such as an integrated circuit (IC chip), or may be realized by software.
  • information processing devices 1 to 3 are realized, for example, by a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function.
  • a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function.
  • An example of such a computer (hereinafter referred to as computer C) is shown in Figure 9.
  • Computer C has at least one processor C1 and at least one memory C2.
  • Memory C2 stores program P for operating computer C as any one of information processing devices 1 to 3.
  • processor C1 reads and executes program P from memory C2, thereby realizing the function of any one of information processing devices 1 to 3.
  • the processor C1 may be, for example, a CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit), microcontroller, or a combination of these.
  • the memory C2 may be, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination of these.
  • Computer C may further include a RAM (Random Access Memory) for expanding program P during execution and for temporarily storing various data.
  • Computer C may further include a communications interface for sending and receiving data to and from other devices.
  • Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
  • the program P can also be recorded on a non-transitory, tangible recording medium M that can be read by the computer C.
  • a recording medium M can be, for example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit.
  • the computer C can obtain the program P via such a recording medium M.
  • the program P can also be transmitted via a transmission medium.
  • a transmission medium can be, for example, a communications network or broadcast waves.
  • the computer C can also obtain the program P via such a transmission medium.
  • An information processing device comprising: a feature generation means for generating features corresponding to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image; and an inference means for making a specified inference regarding the object based on the features.
  • Appendix 2 The information processing device described in Appendix 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating a timing at which an object moving in accordance with a context identical or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and feature values corresponding to the context of the object at that timing.
  • (Appendix 3) The information processing device according to claim 1, further comprising: a trajectory data generation means for generating trajectory data indicating a trajectory of movement of an object based on a detection result of the object from the plurality of frame images; and an integration means for integrating the trajectory data and a feature generated by the feature generation means to generate integrated data, wherein the feature generation means generates the feature using the position information and the time information extracted from the trajectory data, and the inference means performs the inference using the integrated data.
  • Appendix 5 An information processing device as described in Appendix 4, comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between an environment surrounding the object and an environment surrounding the target object.
  • An inference method comprising: at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features of the object appearing in the frame images corresponding to the context, using position information indicating the detection position of the object in the frame images and time information indicating the timing when the frame images were captured; and making a predetermined inference regarding the object based on the features.
  • An inference program that causes a computer to function as a feature generation means that generates features corresponding to the context of an object appearing in a plurality of frame images extracted from a video of an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference means that makes a specified inference about the object based on the features.
  • a method for generating a feature generation model comprising: at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving in accordance with a predetermined context, time information indicating the timing at which the frame image was captured and positional information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby making a predetermined inference about the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined correct answer data.
  • Appendix 9 9. The method for generating a feature generation model described in Appendix 8, further comprising: at least one processor calculating a similarity between the plurality of frame images; and updating the feature generation model such that the similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images, in updating the feature generation model.
  • An information processing device including at least one processor, which executes, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a process of generating a feature amount of the object appearing in the frame image according to the context using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and a process of making a predetermined inference regarding the object based on the feature amount.
  • the information processing device may further include a memory, and the memory may store an inference program for causing the processor to execute the process of generating the feature amount and the process of performing the predetermined inference.
  • the inference program may also be recorded on a computer-readable, non-transitory, tangible recording medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

In order to enable context-based inference while suppressing calculation cost, this information processing device (2) is provided with: a feature value generation unit (21) that uses time information indicating timing at which a frame image was captured and position information indicating the detection position of an object in the frame image to generate a feature value according to the context; and an inference unit (22) that makes an inference on the basis of the generated feature value.

Description

情報処理装置、推論方法、推論プログラム、および特徴量生成モデルの生成方法Information processing device, inference method, inference program, and method for generating feature quantity generation model
 本発明は、対象物を撮影した動画像を用いて当該対象物に関する推論を行う情報処理装置等に関する。 The present invention relates to an information processing device that performs inferences about an object using video images of the object.
 近年では、深層学習等の技術を用いることにより、画像を対象とした物体検出や物体識別等(以下、これらを推論と総称する)を非常に高精度に行うことができるようになっている。また、動画像を対象とした推論についての研究も進められている。 In recent years, by using technologies such as deep learning, it has become possible to perform object detection and object identification (hereafter collectively referred to as inference) in images with a very high degree of accuracy. Research into inference for video images is also progressing.
 動画像を対象とした推論の精度を高める方策の1つとして、動画像のコンテキストを考慮して推論を行うというものが知られている。例えば、下記の非特許文献1には、学習済みの畳み込みニューラルネットワークを用いて動画像からコンテキスト特徴量を抽出し、そのコンテキスト特徴量を用いて、動画像に写る人物の動作を識別する技術が開示されている。 One known method for improving the accuracy of inferences made on video images is to perform inferences taking into account the context of the video images. For example, Non-Patent Document 1 below discloses a technology that uses a trained convolutional neural network to extract context features from video images, and then uses the context features to identify the actions of people appearing in the video images.
 しかしながら、上述のような従来技術には、動画像を畳み込みニューラルネットワークに入力してコンテキスト特徴量を抽出する際の計算コストが非常に大きいという問題がある。本発明の一態様は、計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能な情報処理装置等を実現することを目的とする。 However, the above-mentioned conventional techniques have a problem in that the computational costs involved in inputting video images into a convolutional neural network and extracting context features are very high. One aspect of the present invention aims to realize an information processing device or the like that is capable of performing inference that takes into account context while keeping computational costs down.
 本発明の一側面に係る情報処理装置は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段と、前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段と、を備える。 An information processing device according to one aspect of the present invention includes a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
 本発明の一側面に係る推論方法は、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像における前記対象物の検出位置を示す位置情報と、当該フレーム画像が撮影されたタイミングを示す時刻情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成することと、前記特徴量に基づいて前記対象物に関する所定の推論を行うことと、を含む。 An inference method according to one aspect of the present invention includes at least one processor generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, features of the object appearing in the frame images according to the context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured, and performing a predetermined inference regarding the object based on the features.
 本発明の一側面に係る推論プログラムは、コンピュータを、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段、および前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段、として機能させる。 An inference program according to one aspect of the present invention causes a computer to function as a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
 本発明の一側面に係る特徴量生成モデルの生成方法は、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、前記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された前記特徴量に基づいて前記物体に関する所定の推論を行うことと、前記推論の結果が所定の正解データに近付くように前記特徴量生成モデルを更新することと、を含む。 A method for generating a feature generation model according to one aspect of the present invention includes: at least one processor inputs, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined ground truth data.
 本発明の一態様によれば、計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能になる。 According to one aspect of the present invention, it is possible to perform inference that takes into account context while keeping computational costs down.
本発明の例示的実施形態1に係る情報処理装置の構成を示すブロック図である。1 is a block diagram showing a configuration of an information processing device according to a first exemplary embodiment of the present invention; 本発明の例示的実施形態1に係る特徴量生成モデルの生成方法および推論方法の流れを示すフロー図である。1 is a flow diagram showing the flow of a method for generating and inferring a feature generation model according to an exemplary embodiment 1 of the present invention. 異物確認検査の方法を説明する図である。FIG. 1 is a diagram for explaining a method of inspection for checking for foreign matter. 本発明の例示的実施形態2に係る推論方法の概要を説明する図である。FIG. 13 is a diagram for explaining an overview of an inference method according to an exemplary embodiment 2 of the present invention. 本発明の例示的実施形態2に係る情報処理装置の構成例を示すブロック図である。FIG. 11 is a block diagram showing an example of the configuration of an information processing device according to an exemplary embodiment 2 of the present invention. 学習時と推論時とでコンテキスト間に差異が生じた例を示す図である。FIG. 13 is a diagram showing an example in which a difference occurs between contexts during learning and inference. 本発明の例示的実施形態2に係る情報処理装置が学習時に行う処理の流れを示すフロー図である。FIG. 11 is a flowchart showing a flow of processing performed by an information processing device according to an exemplary embodiment 2 of the present invention during learning. 本発明の例示的実施形態2に係る情報処理装置が推論時に行う処理の流れを示すフロー図である。FIG. 11 is a flow chart showing the flow of processing performed during inference by an information processing device according to an exemplary embodiment 2 of the present invention. 本発明の各例示的実施形態に係る各装置の各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータの一例を示す図である。FIG. 1 is a diagram showing an example of a computer that executes instructions of a program, which is software that realizes the functions of each device according to each exemplary embodiment of the present invention.
 〔例示的実施形態1〕
 本発明の第1の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。まず、本例示的実施形態に係る情報処理装置1および2について図1を参照して説明する。図1は、情報処理装置1および2の構成を示すブロック図である。
[Example embodiment 1]
A first exemplary embodiment of the present invention will be described in detail with reference to the drawings. This exemplary embodiment is a basic form of the exemplary embodiments described below. First, information processing devices 1 and 2 according to this exemplary embodiment will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the configuration of the information processing devices 1 and 2.
 (情報処理装置1の構成)
 図示のように、情報処理装置1は、推論部11と学習部12とを備えている。推論部11は、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行う。
(Configuration of information processing device 1)
As shown in the figure, the information processing device 1 includes an inference unit 11 and a learning unit 12. The inference unit 11 performs a predetermined inference regarding an object moving according to a predetermined context, based on features calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context.
 学習部12は、推論部11による推論の結果が所定の正解データに近付くように特徴量生成モデルを更新する。 The learning unit 12 updates the feature generation model so that the result of the inference by the inference unit 11 approaches the predetermined correct answer data.
 このように、本例示的実施形態に係る情報処理装置1は、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行う推論部11と、推論部11による推論の結果が所定の正解データに近付くように特徴量生成モデルを更新する学習部12と、を備えている。 In this manner, the information processing device 1 according to this exemplary embodiment includes an inference unit 11 that performs a predetermined inference regarding the object based on features calculated by inputting, for each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches predetermined correct answer data.
 上記の構成によれば、時刻情報と位置情報とからコンテキストに応じた特徴量を生成することが可能な特徴量生成モデルを生成することができる。そして、これにより、動画像全体からコンテキスト特徴量を生成する従来技術と比べて計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能になるという効果が得られる。 The above configuration makes it possible to generate a feature generation model that can generate features according to the context from time information and location information. This has the effect of making it possible to perform inference that takes into account the context while reducing computational costs compared to conventional techniques that generate context features from the entire video.
 (情報処理装置2の構成)
 情報処理装置2は、特徴量生成部21と推論部22とを備えている。特徴量生成部21は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物の上記コンテキストに応じた特徴量を生成する。
(Configuration of information processing device 2)
The information processing device 2 includes a feature generating unit 21 and an inference unit 22. For each of a plurality of frame images extracted from a moving image capturing an object moving along a predetermined context, the feature generating unit 21 generates a feature corresponding to the context of the object captured in the frame image by using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
 推論部22は、特徴量生成部21が生成する特徴量に基づいて対象物に関する所定の推論を行う。 The inference unit 22 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
 このように、本例示的実施形態に係る情報処理装置2は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成する特徴量生成部21と、特徴量生成部21が生成する特徴量に基づいて対象物に関する所定の推論を行う推論部22と、を備える。 In this way, the information processing device 2 according to this exemplary embodiment includes a feature generation unit 21 that generates features according to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a predetermined context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
 所定のコンテキストに沿って移動する対象物については、当該コンテキストに基づき、時刻毎に異なった影響を受ける。したがって、上記の構成によれば、対象物のコンテキストに応じた特徴量を生成し、その特徴量に基づいた推論を行うことができる。 An object that moves according to a specific context is affected differently at each point in time based on that context. Therefore, with the above configuration, it is possible to generate features according to the context of the object and perform inference based on those features.
 そして、位置情報と時刻情報は動画像と比べてデータサイズが著しく小さい。このため、本例示的実施形態に係る情報処理装置2によれば、動画像全体からコンテキスト特徴量を生成する従来技術と比べて計算コストを抑えつつ、コンテキストを加味した推論を行うことができるという効果が得られる。 Furthermore, the data size of location information and time information is significantly smaller than that of video images. Therefore, the information processing device 2 according to this exemplary embodiment has the advantage of being able to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
 (学習プログラム)
 上述の情報処理装置1の機能は、プログラムによって実現することもできる。本例示的実施形態に係る学習プログラムは、コンピュータを、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行う推論部11、および、推論部11による推論の結果が所定の正解データに近付くように特徴量生成モデルを更新する学習部12、として機能させる。この学習プログラムによれば、時刻情報と位置情報とからコンテキストに応じた特徴量を生成することが可能な特徴量生成モデルを生成することができ、これにより、計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能になるという効果が得られる。
(Study Program)
The above-mentioned functions of the information processing device 1 can also be realized by a program. The learning program according to this exemplary embodiment causes a computer to function as an inference unit 11 that performs a predetermined inference regarding the object based on a feature calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image into a feature generation model for generating a feature according to the context, for each of a plurality of frame images extracted from a video image capturing an object moving along a predetermined context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches a predetermined correct answer data. According to this learning program, a feature generation model capable of generating a feature according to a context from time information and position information can be generated, and thus an effect is obtained in which it is possible to perform inference taking into account the context while suppressing calculation costs.
 (推論プログラム)
 同様に上述の情報処理装置2の機能もプログラムによって実現することもできる。本例示的実施形態に係る推論プログラムは、コンピュータを、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成する特徴量生成部21、および、特徴量生成部21が生成する特徴量に基づいて対象物に関する所定の推論を行う推論部22、として機能させる。この推論プログラムによれば、計算コストを抑えつつ、コンテキストを加味した推論を行うことができるという効果が得られる。
(Inference Program)
Similarly, the functions of the information processing device 2 described above can also be realized by a program. The inference program according to this exemplary embodiment causes a computer to function as a feature amount generating unit 21 that generates a feature amount according to the context of an object appearing in a plurality of frame images extracted from a video in which an object moving along a predetermined context is captured, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the feature amount generated by the feature amount generating unit 21. This inference program provides the effect of being able to perform inference taking into account the context while suppressing calculation costs.
 (特徴量生成モデルの生成方法・推論方法の流れ)
 本例示的実施形態に係る特徴量生成モデルの生成方法および推論方法の流れについて、図2を参照して説明する。図2は、特徴量生成モデルの生成方法および推論方法の流れを示すフロー図である。なお、図2に示す判定方法における各ステップの実行主体は、情報処理装置1または2が備えるプロセッサであってもよいし、他の装置が備えるプロセッサであってもよく、各ステップの実行主体がそれぞれ異なる装置に設けられたプロセッサであってもよい。
(Flow of feature generation model generation and inference methods)
The flow of the method for generating and inferring a feature generation model according to this exemplary embodiment will be described with reference to Fig. 2. Fig. 2 is a flow diagram showing the flow of the method for generating and inferring a feature generation model. Note that the execution subject of each step in the determination method shown in Fig. 2 may be a processor provided in the information processing device 1 or 2, or a processor provided in another device, or each step may be a processor provided in a different device.
 図2の左側に示すフロー図は、本例示的実施形態に係る特徴量生成モデルの生成方法を示す。S11では、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における上記物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行う。 The flow diagram shown on the left side of FIG. 2 illustrates a method for generating a feature generation model according to this exemplary embodiment. In S11, at least one processor inputs time information indicating the timing at which each of a plurality of frame images extracted from a video of an object moving along a predetermined context was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and performs a predetermined inference regarding the object based on the calculated features.
 S12では、少なくとも1つのプロセッサが、S11の推論の結果が所定の正解データに近付くように特徴量生成モデルを更新する。 In S12, at least one processor updates the feature generation model so that the result of the inference in S11 approaches the predetermined ground truth data.
 以上のように、本例示的実施形態に係る特徴量生成モデルの生成方法においては、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における上記物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行うこと(S11)と、S11の推論の結果が所定の正解データに近付くように特徴量生成モデルを更新すること(S12)と、を含む。よって、時刻情報と位置情報とからコンテキストに応じた特徴量を生成することが可能な特徴量生成モデルを生成することができ、これにより、計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能になるという効果が得られる。 As described above, the method for generating a feature generation model according to this exemplary embodiment includes at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features (S11), and updating the feature generation model so that the result of the inference in S11 approaches predetermined ground truth data (S12). Thus, it is possible to generate a feature generation model capable of generating features according to the context from the time information and position information, which has the effect of making it possible to perform inference taking into account the context while suppressing calculation costs.
 一方、図2の右側に示すフロー図は、本例示的実施形態に係る推論方法を示す。S21では、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における上記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成する。 On the other hand, the flow diagram shown on the right side of FIG. 2 illustrates an inference method according to this exemplary embodiment. In S21, at least one processor generates, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature quantity corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
 S22では、少なくとも1つのプロセッサが、S21で生成された特徴量に基づいて対象物に関する所定の推論を行う。 In S22, at least one processor performs a predetermined inference regarding the object based on the features generated in S21.
 以上のように、本例示的実施形態に係る推論方法においては、少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像における対象物の検出位置を示す位置情報と、当該フレーム画像が撮影されたタイミングを示す時刻情報とを用いて、当該フレーム画像に写る対象物の上記コンテキストに応じた特徴量を生成すること(S21)と、S21で生成された特徴量に基づいて対象物に関する所定の推論を行うこと(S22)と、を含む。よって、計算コストを抑えつつ、コンテキストを加味した推論を行うことができるという効果が得られる。 As described above, the inference method according to this exemplary embodiment includes at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, feature values of the object appearing in the frame images according to the above-mentioned context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S21), and making a predetermined inference about the object based on the feature values generated in S21 (S22). This provides the effect of being able to make inferences that take into account the context while keeping computational costs down.
 〔例示的実施形態2〕
 本発明の第2の例示的実施形態について、図面を参照して詳細に説明する。以下では、透明容器に封入された液体(例えば医薬品や飲料等)に異物が混入しているか否かを確認する検査(以下、異物確認検査と呼ぶ)に、本例示的実施形態に係る推論方法(以下、本推論方法と呼ぶ)を利用する例を説明する。
Exemplary embodiment 2
A second exemplary embodiment of the present invention will be described in detail with reference to the drawings. In the following, an example will be described in which an inference method according to this exemplary embodiment (hereinafter, referred to as this inference method) is used in an inspection to check whether a foreign object is present in a liquid (e.g., medicine, beverage, etc.) sealed in a transparent container (hereinafter, referred to as a foreign object confirmation inspection).
 (異物確認検査の方法)
 本推論方法の説明に先立ち、異物確認検査の方法を図3に基づいて説明する。図3は、異物確認検査の方法を説明する図である。異物確認検査では、検査対象物である所定の液体が封入された容器を装置(図3では図示を省略)内に固定し、当該装置により当該容器を搖動させる。容器を搖動させるための制御シーケンスは予め定められている。例えば、図3の例における制御シーケンスは、容器を垂直面内で所定時間回転させた後、所定時間静止させ、さらにその後、容器を水平面内で所定時間回転させるというものである。なお、この制御シーケンスは複数回繰り返されてもよい。
(Method of inspection for foreign matter)
Prior to the description of this inference method, a method of foreign object inspection will be described with reference to FIG. 3. FIG. 3 is a diagram for explaining the method of foreign object inspection. In the foreign object inspection, a container filled with a predetermined liquid, which is an object to be inspected, is fixed in a device (not shown in FIG. 3), and the device is used to rock the container. A control sequence for rocking the container is determined in advance. For example, the control sequence in the example of FIG. 3 is such that the container is rotated in a vertical plane for a predetermined time, then stopped for a predetermined time, and then rotated in a horizontal plane for a predetermined time. This control sequence may be repeated multiple times.
 この異物確認検査では、以上のような制御により容器を搖動させつつ、当該容器の内部の液体の動画像を撮影する。次に、撮影した動画像からフレーム画像を抽出し、各フレーム画像を対象として物体検出を行う。そして、この物体検出により検出された各対象物について、当該対象物が気泡であるか異物であるかを判定し、異物と判定された対象物がなければ検査対象物を良品と判定し、異物と判定された対象物が1つでもあった場合には不良品と判定する。 In this foreign body inspection, the container is rocked using the above-mentioned control while a moving image of the liquid inside the container is captured. Next, frame images are extracted from the captured moving image, and object detection is performed on each frame image. Then, for each object detected by this object detection, it is determined whether the object is an air bubble or a foreign body, and if there is no object determined to be a foreign body, the inspected item is determined to be a good product, and if there is even one object determined to be a foreign body, it is determined to be a defective product.
 以上説明した異物確認検査では、装置による所定のパターンの制御により容器を搖動させるから、容器内の対象物(気泡または異物)は、このパターンに基づく所定のコンテキストに沿って移動する。例えば、容器の回転を開始させてからしばらくの間は容器内の液体の流れは加速していくから、対象物もこの流れに乗って加速される。また、対象物の移動方向は容器の回転方向に沿った方向となる。その後、容器の回転が終了すると、液体の流速は次第に遅くなり、定常状態で安定する。この間の対象物の移動速度および移動方向も液体の流速および流れる方向に沿ったものとなる。以降の制御においても同様である。 In the foreign body inspection described above, the device rocks the container by controlling a specific pattern, so that the object inside the container (air bubbles or foreign body) moves along a specific context based on this pattern. For example, the flow of liquid inside the container accelerates for a while after the container starts to rotate, so the object also accelerates along this flow. In addition, the object moves in the direction of the container's rotation. After that, when the container stops rotating, the flow rate of the liquid gradually slows down and stabilizes in a steady state. The speed and direction of the object's movement during this time also follow the flow rate and direction of the liquid. The same applies to subsequent controls.
 本推論方法では、上記のようなコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像に写る対象物の上記コンテキストに応じた特徴量を生成する。そして、生成した特徴量に基づいて、対象物が気泡であるか異物であるかを判定する。つまり、本推論方法における推論は、対象物が気泡であるか異物であるかを判定することである。 In this inference method, for each of multiple frame images extracted from a video that captures an object moving in accordance with the above-mentioned context, a feature amount corresponding to the above-mentioned context of the object appearing in the frame image is generated. Then, based on the generated feature amount, it is determined whether the object is an air bubble or a foreign object. In other words, the inference in this inference method is to determine whether the object is an air bubble or a foreign object.
 詳細は後述するが、本推論方法によれば、動画像全体からコンテキスト特徴量を生成する従来技術と比べて計算コストを抑えつつ、コンテキストを加味した推論を行うことができる。 The details will be described later, but this inference method makes it possible to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
 (推論方法の概要)
 続いて、図4に基づいて本推論方法の概要を説明する。図4は、本推論方法の概要を説明する図である。当該推論方法の実行に先立ち、まず、対象物を撮影した動画像からフレーム画像が抽出される。図4の例ではFRからFRまでのn枚のフレーム画像が抽出されている。図示のフレーム画像FRには、液体が封入された容器が写っていると共に対象物OB1とOB2が写っている。他のフレーム画像も同様である。
(Summary of reasoning method)
Next, an overview of this inference method will be described with reference to Fig. 4. Fig. 4 is a diagram for explaining an overview of this inference method. Prior to execution of this inference method, frame images are first extracted from a video image capturing an object. In the example of Fig. 4, n frame images from FR1 to FRn are extracted. In the illustrated frame image FR1 , a container filled with liquid is captured, as well as objects OB1 and OB2. The same is true for the other frame images.
 本推論方法では、まず、各フレーム画像から対象物を検出する。上述のように対象物は気泡と異物である。これらは何れもサイズが小さく、外観も似通っているため、1つのフレーム画像のみに基づいて、検出した対象物が気泡であるか異物であるかを正確に識別することは難しい。 In this inference method, first, objects are detected from each frame image. As mentioned above, the objects are air bubbles and foreign objects. Since both are small in size and have similar appearances, it is difficult to accurately identify whether a detected object is an air bubble or a foreign object based on only one frame image.
 次に、本推論方法では、フレーム画像FR~FRからの対象物OB1およびOB2の検出結果に基づいて、当該対象物が移動した軌跡を示す軌跡データを生成する。図4には、対象物OB1の軌跡データA1と対象物OB2の軌跡データA2を模式的に示している。 Next, in this inference method, based on the detection results of the objects OB1 and OB2 from the frame images FR 1 to FR n , trajectory data is generated that indicates the trajectory of the movement of the objects. Fig. 4 shows schematic diagrams of trajectory data A1 of the object OB1 and trajectory data A2 of the object OB2.
 軌跡データA1には、対象物OB1が検出されたフレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における対象物OB1の検出位置を示す位置情報とが含まれる。例えば、フレーム画像FRからFR10のそれぞれで対象物OB1が検出されたとする。この場合、軌跡データA1にはフレーム画像FR~FR10の各撮影タイミングを示す時刻情報と、フレーム画像FR~FR10における対象物OB1の検出位置を示す位置情報とが含まれる。 The trajectory data A1 includes time information indicating the timing at which the frame image in which the object OB1 was detected was shot, and position information indicating the detected position of the object OB1 in the frame image. For example, assume that the object OB1 was detected in each of the frame images FR 1 to FR 10. In this case, the trajectory data A1 includes time information indicating the shooting timing of each of the frame images FR 1 to FR 10 , and position information indicating the detected position of the object OB1 in the frame images FR 1 to FR 10 .
 また、軌跡データA1には、検出された対象物OB1の特徴を示す情報が含まれていてもよい。例えば、軌跡データA1には、フレーム画像における対象物OB1が写る領域を切り出した画像である画像パッチまたは画像パッチから抽出した特徴量、検出された対象物のサイズを示す情報、および検出された対象物の移動速度を示す情報等が含まれていてもよい。 The trajectory data A1 may also include information indicating the characteristics of the detected object OB1. For example, the trajectory data A1 may include an image patch that is an image cut out of an area in the frame image that includes the object OB1, or feature amounts extracted from the image patch, information indicating the size of the detected object, and information indicating the moving speed of the detected object.
 なお、本推論方法では、容器のサイズや液量の差の影響を受けないようにするため、正規化した位置情報を用いるようにしてもよい。この場合、例えば、フレーム画像上における対象物の位置を示す位置情報に対し、並進、回転、およびスケール変換の少なくとも何れかを施すことにより、正規化した位置情報を生成することができる。 In addition, in this inference method, normalized position information may be used to avoid being affected by differences in container size and liquid volume. In this case, for example, normalized position information can be generated by applying at least one of translation, rotation, and scale transformation to position information indicating the position of the object on the frame image.
 また、時刻情報についても正規化してもよい。この場合、例えば、フレーム画像FRの時刻情報を0、一連の制御シーケンスが終了したタイミングで撮影されたフレーム画像FRの時刻情報を1とし、これらの値を基準としてフレーム画像FR~FRn-1の時刻情報を決定すればよい。 The time information may also be normalized. In this case, for example, the time information of the frame image FR 1 may be set to 0, and the time information of the frame image FR n captured at the timing when a series of control sequences ends may be set to 1, and the time information of the frame images FR 2 to FR n-1 may be determined based on these values.
 軌跡データA2についても同様であり、軌跡データA2には、対象物OB2が検出されたフレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における対象物OB2の検出位置を示す位置情報とが含まれる。また、軌跡データA2には、検出された対象物OB2の特徴を示す情報が含まれていてもよい。 The same is true for the trajectory data A2, which includes time information indicating the timing at which the frame image in which the object OB2 was detected was captured, and position information indicating the detected position of the object OB2 in the frame image. The trajectory data A2 may also include information indicating the characteristics of the detected object OB2.
 なお、例えば対象物が気泡であった場合には搖動中に発生したり消えたりすることがあるから、軌跡データA1とA2における時刻情報の範囲は異なり得る。また、ここでは2つの対象物OB1およびOB2についての軌跡データA1およびA2のみを示しているが、実際の異物確認検査ではより多数の対象物が検出され得る。 Note that, for example, if the object is an air bubble, it may appear and disappear during the oscillation, so the range of time information in the trajectory data A1 and A2 may differ. Also, only the trajectory data A1 and A2 for two objects OB1 and OB2 are shown here, but a larger number of objects may be detected in an actual foreign object confirmation inspection.
 次に、本推論方法では、軌跡データA1およびA2のそれぞれについて、それらに含まれる時刻情報と位置情報を特徴量生成モデルに入力することにより、コンテキストに応じた特徴量B1およびB2を生成する。なお、これらの特徴量は時刻毎に生成される。例えば、時刻t1における時刻情報と位置情報からは、当該時刻t1における特徴量が生成される。また、例えば予め生成した判定関数によりフレーム画像における液体領域の内外を判定し、その判定結果を示す値を特徴量B1およびB2に含めてもよい。これにより、液体領域外で検出された物体等の影響を排除して妥当な推論結果を得ることが可能になる。 Next, in this inference method, the time information and position information contained in each of the trajectory data A1 and A2 are input into a feature generation model to generate feature quantities B1 and B2 according to the context. These feature quantities are generated for each time. For example, a feature quantity at time t1 is generated from the time information and position information at that time t1. Also, for example, a previously generated judgment function may be used to judge whether the frame image is inside or outside a liquid region, and a value indicating the judgment result may be included in feature quantities B1 and B2. This makes it possible to eliminate the influence of objects, etc. detected outside the liquid region and obtain valid inference results.
 特徴量生成モデルによれば、時刻情報と位置情報という動画像と比べてデータサイズが著しく小さいデータを用いて、コンテキストに応じた特徴量を生成することができる。そして、本推論方法では、上述のようにして生成した軌跡データと特徴量とを統合して統合データを生成し、生成した統合データを推論モデルに入力して推論結果を出力させる。これにより、本推論方法は終了する。推論結果は、具体的には、対象物OB1およびOB2のそれぞれが気泡であるか異物であるかを示すものである。 The feature generation model makes it possible to generate features according to the context using time and position information, which are significantly smaller in data size than video images. In this inference method, the trajectory data and features generated as described above are integrated to generate integrated data, and the integrated data is input to the inference model to output the inference result. This completes the inference method. Specifically, the inference result indicates whether each of the objects OB1 and OB2 is an air bubble or a foreign body.
 上述の異物確認検査において検出される対象物OB1およびOB2は何れもサイズが小さく、外観も似通っている。このため、検出した対象物が気泡であるか異物であるかを精度よく判定することの難易度は高いが、本推論方法では、特徴量生成モデルを用いて生成した特徴量が反映された統合データを用いることによりコンテキストを加味した推定を行い、難易度の高い推定を高精度に行うことを可能にしている。  The objects OB1 and OB2 detected in the above-mentioned foreign body confirmation inspection are both small in size and similar in appearance. For this reason, it is difficult to accurately determine whether a detected object is an air bubble or a foreign body. However, this inference method uses integrated data that reflects features generated using a feature generation model to perform estimation that takes into account the context, making it possible to perform difficult estimations with high accuracy.
 (情報処理装置3の構成)
 情報処理装置3の構成を図5に基づいて説明する。図5は、情報処理装置3の構成例を示すブロック図である。情報処理装置3は、情報処理装置3の各部を統括して制御する制御部30と、情報処理装置3が使用する各種データを記憶する記憶部31を備えている。また、情報処理装置3は、情報処理装置3が他の装置と通信するための通信部32、情報処理装置3に対する各種データの入力を受け付ける入力部33、および情報処理装置3が各種データを出力するための出力部34を備えている。
(Configuration of information processing device 3)
The configuration of the information processing device 3 will be described with reference to Fig. 5. Fig. 5 is a block diagram showing an example of the configuration of the information processing device 3. The information processing device 3 includes a control unit 30 that controls each unit of the information processing device 3, and a storage unit 31 that stores various data used by the information processing device 3. The information processing device 3 also includes a communication unit 32 that allows the information processing device 3 to communicate with other devices, an input unit 33 that accepts input of various data to the information processing device 3, and an output unit 34 that allows the information processing device 3 to output various data.
 また、制御部30には、物体検出部301、軌跡データ生成部302、特徴量生成部303、統合部304、推論部305、学習部306、差異特定部307、調整部308、および類似度算出部309が含まれている。そして、記憶部31には、軌跡データ311、特徴量生成モデル312、推論モデル313、および教師データ314が記憶されている。なお、学習部306、類似度算出部309、および教師データ314については後記「学習について」の項目で説明し、差異特定部307および調整部308については後記「コンテキストの差異の吸収について」の項目で説明する。 The control unit 30 also includes an object detection unit 301, a trajectory data generation unit 302, a feature generation unit 303, an integration unit 304, an inference unit 305, a learning unit 306, a difference identification unit 307, an adjustment unit 308, and a similarity calculation unit 309. The memory unit 31 stores trajectory data 311, a feature generation model 312, an inference model 313, and teacher data 314. The learning unit 306, the similarity calculation unit 309, and the teacher data 314 will be described later in the section "About learning", and the difference identification unit 307 and the adjustment unit 308 will be described later in the section "About absorbing context differences".
 物体検出部301は、動画像から抽出された複数のフレーム画像のそれぞれから所定の対象物を検出する。対象となる動画像が高フレームレートで撮影されたものであれば、物体検出部301は、位置の連続性を利用することにより、比較的軽量な画像処理にて各フレーム画像から対象物を検出することが可能である。対象物の検出方法は特に限定されない。例えば、物体検出部301は、対象物の画像を教師データとして当該対象物を検出するように学習した検出モデルを用いて対象物を検出してもよい。検出モデルのアルゴリズムは特に限定されない。例えば、物体検出部301は、畳み込みニューラルネットワーク、再帰的ニューラルネットワーク、あるいはトラスンフォーマー等の検出モデルを用いてもよいし、これらの複数を組み合わせた検出モデルを用いてもよい。 The object detection unit 301 detects a specific object from each of multiple frame images extracted from a video. If the target video has been shot at a high frame rate, the object detection unit 301 can detect the object from each frame image with relatively lightweight image processing by utilizing the positional continuity. There are no particular limitations on the method of detecting the object. For example, the object detection unit 301 may detect the object using a detection model that has been trained to detect the object using an image of the object as training data. There are no particular limitations on the algorithm of the detection model. For example, the object detection unit 301 may use a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.
 軌跡データ生成部302は、物体検出部301による複数のフレーム画像からの対象物の検出結果に基づいて当該対象物が移動した軌跡を示す軌跡データを生成する。上述のように、軌跡データには、対象物が検出されたフレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における対象物の検出位置を示す位置情報とが含まれており、検出された対象物の特徴を示す情報として画像パッチ等が含まれていてもよい。また、時刻情報には、フレーム画像間の時刻差を示す情報が含まれていてもよい。生成された軌跡データは、軌跡データ311として記憶部31に記憶される。 The trajectory data generation unit 302 generates trajectory data indicating the trajectory of the movement of an object based on the detection result of the object from multiple frame images by the object detection unit 301. As described above, the trajectory data includes time information indicating the timing when the frame image in which the object is detected was captured, and position information indicating the detected position of the object in the frame image, and may also include image patches or the like as information indicating the characteristics of the detected object. The time information may also include information indicating the time difference between the frame images. The generated trajectory data is stored in the memory unit 31 as trajectory data 311.
 特徴量生成部303は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成する。具体的には、特徴量生成部303は、軌跡データ311に示される時刻情報および位置情報を特徴量生成モデル312に入力することにより特徴量を生成する。 The feature generation unit 303 generates features according to the context of an object appearing in each of a plurality of frame images extracted from a video capturing an object moving along a specific context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image. Specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data 311 to the feature generation model 312.
 特徴量生成モデル312は、コンテキストに応じた特徴量を生成するように学習された学習済みのモデルである。より詳細には、特徴量生成モデル312は、所定のコンテキストに沿って移動する物体が撮影されたタイミングを示す時刻情報、および、当該タイミングで撮影された画像における物体の検出位置を示す位置情報と、当該タイミングにおける物体の特徴量との関係を学習することにより生成される。特徴量生成モデル312は、上述の特徴量を目的変数とし、時刻情報および位置情報を説明変数とする関数であってもよい。 The feature generation model 312 is a trained model that has been trained to generate features according to a context. More specifically, the feature generation model 312 is generated by learning the relationship between time information indicating the timing at which an object moving in accordance with a specified context was photographed, and position information indicating the detected position of the object in the image photographed at that timing, and the feature of the object at that timing. The feature generation model 312 may be a function that uses the above-mentioned feature as a response variable and time information and position information as explanatory variables.
 なお、学習に用いる物体の移動に影響を与えるコンテキストは、推論の対象となる対象物の移動に影響を与えるコンテキストと同一のものであってもよいし、類似のものであってもよい。また、特徴量生成モデル312のアルゴリズムは特に限定されない。例えば、特徴量生成モデル312は、畳み込みニューラルネットワーク、再帰的ニューラルネットワーク、あるいはトラスンフォーマー等のモデルであってもよいし、これらの複数を組み合わせたモデルであってもよい。 The context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the algorithm of the feature generation model 312 is not particularly limited. For example, the feature generation model 312 may be a model such as a convolutional neural network, a recurrent neural network, or a transformer, or may be a model that combines two or more of these.
 統合部304は、軌跡データ生成部302が生成する軌跡データ311と、特徴量生成部303が生成する特徴量とを統合して統合データを生成する。統合の方法は特に限定されない。例えば、統合部304は、軌跡データ311における各時刻の成分(具体的には1つの時刻情報に対応付けられている位置情報や画像パッチ等)に対し、上記特徴量を追加の次元として結合してもよい。また、例えば、統合部304は、各時刻の成分に上記特徴量を加算する、あるいは各時刻の成分に上記特徴量を乗じることにより、上記特徴量が反映された統合データを生成することもできる。さらに、例えば、統合部304は、アテンション機構により各時刻の成分に上記特徴量を反映させてもよい。 The integration unit 304 integrates the trajectory data 311 generated by the trajectory data generation unit 302 and the features generated by the feature generation unit 303 to generate integrated data. The integration method is not particularly limited. For example, the integration unit 304 may combine the feature as an additional dimension with respect to each time component in the trajectory data 311 (specifically, position information or an image patch associated with one piece of time information, etc.). Furthermore, for example, the integration unit 304 may generate integrated data reflecting the feature by adding the feature to each time component or multiplying the feature by each time component. Furthermore, for example, the integration unit 304 may reflect the feature in each time component by an attention mechanism.
 推論部305は、特徴量生成部303が生成する特徴量に基づいて対象物に関する所定の推論を行う。具体的には、推論部305は、特徴量生成部303が生成する特徴量が反映された統合データを推論モデル313に入力することにより推論結果、すなわち対象物が気泡であるか異物であるかの判定結果を得る。 The inference unit 305 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 303. Specifically, the inference unit 305 inputs integrated data reflecting the features generated by the feature generation unit 303 to the inference model 313, thereby obtaining an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
 推論モデル313は、気泡または異物が写る画像から生成された統合データを用いて、当該画像に写る対象物が気泡であるか異物であるかを学習することにより生成されたモデルである。推論モデル313のアルゴリズムは特に限定されない。例えば、推論モデル313は、畳み込みニューラルネットワーク、再帰的ニューラルネットワーク、あるいはトラスンフォーマー等のモデルであってもよいし、これらの複数を組み合わせたモデルであってもよい。 The inference model 313 is a model generated by using integrated data generated from an image showing air bubbles or foreign objects to learn whether an object shown in the image is an air bubble or a foreign object. The algorithm of the inference model 313 is not particularly limited. For example, the inference model 313 may be a model such as a convolutional neural network, a recursive neural network, or a transformer, or may be a model that combines two or more of these.
 以上のように、情報処理装置3は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成する特徴量生成部303と、特徴量生成部303が生成する特徴量に基づいて対象物に関する所定の推論を行う推論部305とを備える。これにより、例示的実施形態1に係る情報処理装置2と同様に、計算コストを抑えつつ、コンテキストを加味した推論を行うことができるという効果が得られる。なお、静止画像を連続して撮影することにより得られる時系列の静止画像も「動画像から抽出された複数のフレーム画像」の範疇に含まれる。 As described above, the information processing device 3 includes a feature generating unit 303 that generates features according to the context of an object appearing in a frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, and an inference unit 305 that performs a predetermined inference regarding the object based on the features generated by the feature generating unit 303. This provides the effect of being able to perform inference that takes into account the context while suppressing calculation costs, similar to the information processing device 2 according to the exemplary embodiment 1. Note that still images in a time series obtained by continuously capturing still images are also included in the category of "a plurality of frame images extracted from a video".
 また、以上のように、特徴量生成部303は、対象物が移動するコンテキストと同一または類似のコンテキストに沿って移動する物体が撮影されたタイミングを示す時刻情報、および、当該タイミングで撮影された画像における物体の検出位置を示す位置情報と、当該タイミングにおける物体のコンテキストに応じた特徴量との関係を学習した特徴量生成モデル312を用いて特徴量を生成してもよい。これにより、例示的実施形態1に係る情報処理装置2が奏する効果に加えて、学習結果に基づいた妥当な特徴量を生成することができるという効果が得られる。 Furthermore, as described above, the feature generation unit 303 may generate features using a feature generation model 312 that has learned the relationship between time information indicating the timing at which an object moving along a context that is the same as or similar to the context in which the target object moves was photographed, position information indicating the detected position of the object in the image photographed at that timing, and features according to the context of the object at that timing. This provides the effect of being able to generate appropriate features based on the learning results, in addition to the effect provided by the information processing device 2 according to exemplary embodiment 1.
 また、以上のように、情報処理装置3は、複数のフレーム画像からの対象物の検出結果に基づいて当該対象物が移動した軌跡を示す軌跡データ311を生成する軌跡データ生成部302と、軌跡データ311と特徴量生成部303が生成した特徴量とを統合して統合データを生成する統合部304と、を備え、特徴量生成部303は、軌跡データ311から抽出した位置情報と時刻情報とを用いて特徴量を生成し、推論部305は、統合データを用いて推論を行う。これにより、例示的実施形態1に係る情報処理装置2が奏する効果に加えて、フレーム画像に基づいて生成した軌跡データ311を用いて推論を行うという枠組みで、コンテキストを加味した推論を行うことができるという効果が得られる。 As described above, the information processing device 3 includes a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images, and an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data, the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data. As a result, in addition to the effect achieved by the information processing device 2 according to exemplary embodiment 1, an effect is obtained in that inference taking into account the context can be performed within the framework of performing inference using the trajectory data 311 generated based on the frame images.
 (学習について)
 本項目では学習部306による学習について説明する。また、教師データ314と類似度算出部309についても説明する。学習部306は、教師データ314を用いた学習により、特徴量生成モデル312と推論モデル313を更新する。
(About learning)
This section describes learning by the learning unit 306. It also describes the teacher data 314 and the similarity calculation unit 309. The learning unit 306 updates the feature generation model 312 and the inference model 313 by learning using the teacher data 314.
 教師データ314は、ある物体の軌跡データに対して、正解データとして当該物体が気泡であるか異物であるかを示す情報が対応付けられたデータである。また、教師データ314には、当該軌跡データの元になった複数のフレーム画像が含まれていてもよい。上述のように、特徴量生成部303は軌跡データに含まれる時刻情報と位置情報から特徴量を生成し、統合部304は生成された特徴量と軌跡データとを統合して統合データを生成する。そして、推論部305は統合データを用いて推論を行い、これにより教師データ314に含まれる軌跡データに基づく推論の結果が得られる。 The teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with the trajectory data of an object. The teacher data 314 may also include multiple frame images that are the source of the trajectory data. As described above, the feature generation unit 303 generates features from the time information and position information included in the trajectory data, and the integration unit 304 integrates the generated features with the trajectory data to generate integrated data. The inference unit 305 then performs inference using the integrated data, thereby obtaining an inference result based on the trajectory data included in the teacher data 314.
 学習部306は、教師データ314に含まれる軌跡データに基づく推論の結果が、その教師データ314に示される所定の正解データに近付くように、推論モデル313と特徴量生成モデル312を更新する。例えば、学習部306は、勾配降下法を用いて、推論結果と正解データとの誤差の総和である損失関数を最小化するように、推論モデル313と特徴量生成モデル312のそれぞれを更新してもよい。 The learning unit 306 updates the inference model 313 and the feature generation model 312 so that the result of inference based on the trajectory data included in the teacher data 314 approaches the predetermined correct answer data indicated in the teacher data 314. For example, the learning unit 306 may use a gradient descent method to update each of the inference model 313 and the feature generation model 312 so as to minimize a loss function that is the sum of the errors between the inference result and the correct answer data.
 ここで、上述のように、推論時には、動画像またはフレーム画像をそのまま用いて特徴量の生成等を行うことはないが、学習時にはフレーム画像を利用してもよい。例えば、類似したフレーム画像においてはコンテキストも類似していると考えられるから、フレーム画像間の類似度を特徴量生成モデル312の更新に利用してもよい。 As described above, during inference, video images or frame images are not used as is to generate features, but frame images may be used during learning. For example, since similar frame images are considered to have similar contexts, the similarity between frame images may be used to update the feature generation model 312.
 類似度算出部309は、フレーム画像間の類似度を算出するものであり、当該類似度を特徴量生成モデル312の更新に利用する場合に用いられる構成である。類似度算出部309が類似度を算出する場合、学習部306は、複数のフレーム画像間の類似度が、当該フレーム画像について特徴量生成モデル312により生成される特徴量間の類似度に反映されるように特徴量生成モデル312を更新する。例えば、上述の損失関数に正規化項を追加することにより、フレーム画像間の類似度が特徴量の類似度に近くなるように特徴量生成モデル312を更新することができる。 The similarity calculation unit 309 calculates the similarity between frame images, and is configured to be used when the similarity is used to update the feature generation model 312. When the similarity calculation unit 309 calculates the similarity, the learning unit 306 updates the feature generation model 312 so that the similarity between multiple frame images is reflected in the similarity between features generated by the feature generation model 312 for the frame images. For example, by adding a normalization term to the loss function described above, the feature generation model 312 can be updated so that the similarity between frame images becomes closer to the similarity of the features.
 (コンテキストの差異の吸収について)
 前述のとおり、学習に用いた物体の移動に影響を与えるコンテキストは、推論の対象となる対象物の移動に影響を与えるコンテキストと同一のものであってもよいし、類似のものであってもよい。また、それらのコンテキストは少なくとも一部分が同一または類似であればよく、全体が同一またはでなくてもよい。
(Regarding absorption of contextual differences)
As described above, the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the contexts need only be at least partially the same or similar, and do not have to be entirely the same or similar.
 学習に用いた物体の移動におけるコンテキストと、推論の対象となる対象物の移動におけるコンテキストとの間に差異がある場合に用いられるのが、差異特定部307と調整部308である。 The difference identification unit 307 and adjustment unit 308 are used when there is a difference between the context of the movement of the object used in learning and the context of the movement of the target object that is the subject of inference.
 調整部308は、学習に用いた物体の移動におけるコンテキストと、推論の対象となる対象物の移動におけるコンテキストとの間の差異を吸収するように、特徴量の生成に用いる時刻情報および位置情報の少なくとも何れかを調整する。 The adjustment unit 308 adjusts at least one of the time information and the position information used to generate the features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred.
 差異特定部307は、学習に用いた物体の周囲の環境と当該対象物の周囲の環境との差異に基づき、当該物体の移動におけるコンテキストと当該対象物の移動におけるコンテキストとの差異を特定する。また、差異特定部307は、学習に用いた物体と推論の対象となる対象物との差異に基づき、当該物体の移動におけるコンテキストと当該対象物の移動におけるコンテキストとの差異を特定してもよい。 The difference identification unit 307 identifies the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the environment surrounding the object and the environment surrounding the target object. The difference identification unit 307 may also identify the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the object used for learning and the target object to be inferred.
 以下、差異特定部307と調整部308について、図6に基づいてさらに詳細に説明する。図6は、コンテキスト間に差異が生じた例を示す図である。図6に示す例EX1では、図3と同様に、回転、静止、回転という制御シーケンスにより異物確認検査を行うことを想定しており、学習時と推論時の各制御シーケンスに含まれる制御とその実行順は同一である。ただし、推論時における静止期間が学習時よりも長くなっている。より具体的には、例EX1では学習時と推論時の何れにおいても、時刻t1に回転を開始して時刻t2に回転を終了して静止状態とし、時刻t3には容器内の液体の動きが定常化している。この後、学習時には時刻t4で2回目の回転を開始して時刻t5に2回目の回転を終了しているのに対し、推論時には時刻t4よりもΔtだけ早い時刻t4’で2回目の回転を開始している。そして、これに伴って、2回目の回転を終了する時刻も時刻t5よりもΔtだけ早い時刻t5’となっている。このように、例EX1では、学習時と推論時とで時刻t4’以降のコンテキストにΔtのずれが生じている。 The difference identification unit 307 and the adjustment unit 308 will be described in more detail below with reference to FIG. 6. FIG. 6 is a diagram showing an example in which a difference occurs between contexts. In the example EX1 shown in FIG. 6, as in FIG. 3, it is assumed that a foreign object confirmation inspection is performed using a control sequence of rotation, rest, and rotation, and the controls included in each control sequence during learning and inference and their execution order are the same. However, the rest period during inference is longer than during learning. More specifically, in the example EX1, during both learning and inference, rotation starts at time t1 and ends at time t2 to enter a rest state, and the movement of the liquid in the container becomes steady at time t3. After that, during learning, the second rotation starts at time t4 and ends at time t5, whereas during inference, the second rotation starts at time t4', which is Δt earlier than time t4. Accordingly, the time at which the second rotation ends is also Δt earlier than time t5 at time t5'. Thus, in example EX1, there is a difference of Δt in the context from time t4' onwards between learning and inference.
 この場合、調整部308は、特徴量の生成に用いる時刻情報のうち、時刻t4’から時刻t5’の期間に対応する各時刻情報に示される時刻にΔtを加算する調整を行う。これにより、学習時と推論時におけるコンテキストの差異を吸収することができる。なお、例EX1とは逆に、推論時において学習時よりも静止期間をΔtだけ長くした場合には、調整部308は、静止期間の終了後の各時刻情報に示される時刻からΔtを減算する調整を行えばよい。 In this case, the adjustment unit 308 performs an adjustment by adding Δt to the time indicated in each piece of time information corresponding to the period from time t4' to time t5' among the time information used to generate the feature. This makes it possible to absorb the difference in context between learning and inference. Note that, contrary to example EX1, if the still period during inference is made longer by Δt than during learning, the adjustment unit 308 can perform an adjustment by subtracting Δt from the time indicated in each piece of time information after the end of the still period.
 このように、コンテキストに影響を与える制御シーケンスが学習時と推論時とで異なる場合、調整部308は、それらの際に応じて時刻情報を調整することにより、コンテキストの差異を吸収することができる。 In this way, if the control sequence that affects the context is different during learning and inference, the adjustment unit 308 can absorb the difference in context by adjusting the time information accordingly.
 また、学習時と推論時とで容器を動かす方向を逆向きにした場合、学習時と推論時とで液体の流れる向きも逆向きになり、この場合にも学習時と推論時とでコンテキストに差異が生じる。このように、学習時と推論時とで位置に関するコンテキストに差異が生じる場合には、調整部308は、その差異を吸収するように、特徴量の生成に用いる位置情報を調整すればよい。 Furthermore, if the direction in which the container is moved is reversed between learning and inference, the direction in which the liquid flows will also be reversed between learning and inference, and in this case too, a difference in context will occur between learning and inference. In this way, when a difference in context related to position occurs between learning and inference, the adjustment unit 308 can adjust the position information used to generate features so as to absorb the difference.
 例えば、容器を動かす方向を逆向きにすることにより、対象物の移動のパターンが、学習時における物体の移動のパターンに対して左右反転したものとなったとする。この場合、調整部308は、対象物の位置情報に対して左右反転させる処理を施すことによりコンテキスト間の差異を吸収してもよい。例えば、位置情報として座標値を用いる場合、調整部308は、移動のパターンが左右反転している期間における各座標値に対し、所定の軸を基準として左右反転させる変換を施せばよい。また、例えば、対象物の移動のパターンと、学習時における物体の移動のパターンとが回転対称の関係にある場合、調整部308は、対象物の位置情報を回転変換することによりコンテキスト間の差異を吸収してもよい。なお、時刻情報を調整する調整部と、位置情報を調整する調整部とをそれぞれ別のブロックとしてもよい。 For example, suppose that by reversing the direction in which the container is moved, the pattern of movement of the object becomes a left-right inversion of the pattern of movement of the object during learning. In this case, the adjustment unit 308 may absorb the difference between contexts by performing a left-right inversion process on the position information of the object. For example, when coordinate values are used as position information, the adjustment unit 308 may perform a conversion that inverts left-right about a specified axis for each coordinate value during the period in which the pattern of movement is inverted left-right. Also, for example, when the pattern of movement of the object and the pattern of movement of the object during learning are in a rotationally symmetric relationship, the adjustment unit 308 may absorb the difference between contexts by rotationally transforming the position information of the object. Note that the adjustment unit that adjusts the time information and the adjustment unit that adjusts the position information may each be separate blocks.
 また、コンテキストに影響を与える制御シーケンスが学習時と推論時とで同じであった場合でも、推論時に用いる各フレーム画像に対応する時刻情報が、学習に用いた各フレーム画像に対応する時刻情報とずれることがあり得る。例えば、図6の例EX1において、学習に用いられたフレーム画像のうち1回目の回転の開始時のフレーム画像の時刻はt1である。このとき、推論に用いられたフレーム画像のうち1回目の回転の開始時のフレーム画像の時刻がt1’(t1’<t1)であれば、コンテキスト間に(t1-t1’)のずれが生じる。このような場合、調整部308は、推論に用いる各時刻情報に示される時刻に(t1-t1’)の値を加算する調整を行えばよい。また、推論に用いられたフレーム画像のうち1回目の回転の開始時のフレーム画像の時刻がt1”(t1”>t1)であれば、調整部308は、推論に用いる各時刻情報に示される時刻から(t1”-t1)の値を減算する調整を行えばよい。このような調整により、時刻情報に示される時刻と制御タイミングとの関係を、学習時と揃えることができ、これにより、特徴量生成モデル312に適切な特徴量を出力させることができる。 In addition, even if the control sequence that affects the context is the same during learning and inference, the time information corresponding to each frame image used during inference may differ from the time information corresponding to each frame image used for learning. For example, in example EX1 of Figure 6, the time of the frame image at the start of the first rotation among the frame images used for learning is t1. In this case, if the time of the frame image at the start of the first rotation among the frame images used for inference is t1' (t1' < t1), a difference of (t1 - t1') will occur between the contexts. In such a case, the adjustment unit 308 can perform an adjustment by adding the value of (t1 - t1') to the time indicated in each piece of time information used for inference. Furthermore, if the time of the frame image at the start of the first rotation among the frame images used in inference is t1" (t1" > t1), the adjustment unit 308 can perform an adjustment by subtracting the value of (t1" - t1) from the time indicated in each piece of time information used in inference. By making such an adjustment, the relationship between the time indicated in the time information and the control timing can be aligned with that during learning, thereby allowing the feature generation model 312 to output appropriate features.
 また、コンテキストに差異を生じさせる要素は制御シーケンスに限られない。例えば、学習に用いた物体と推論の対象物との間に差異がある場合や、学習に用いた物体の周囲の環境と推論の対象物の周囲の環境との間に差異がある場合には、コンテキスト間に差異が生じ得る。 Furthermore, factors that cause differences in context are not limited to control sequences. For example, differences between contexts can arise when there is a difference between the object used in learning and the subject of inference, or when there is a difference between the environment surrounding the object used in learning and the environment surrounding the subject of inference.
 差異特定部307は、上述のような差異、すなわち学習に用いた物体と推論の対象物との差異、および、学習に用いた物体の周囲の環境と当該対象物の周囲の環境との差異の少なくとも何れかに基づき、当該物体の移動におけるコンテキストと当該対象物の移動におけるコンテキストとの差異を特定する。このため、本例示的実施形態に係る情報処理装置3によれば、例示的実施形態1に係る情報処理装置2が奏する効果に加えて、コンテキストの差異を自動で特定することができるという効果が得られる。また、情報処理装置3は、調整部308を備えているため、差異特定部307が特定した差異を吸収する調整を調整部308に行わせることができる。 The difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the differences described above, i.e., the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. Therefore, according to the information processing device 3 of this exemplary embodiment, in addition to the effect achieved by the information processing device 2 of exemplary embodiment 1, the effect of being able to automatically identify context differences can be obtained. Furthermore, since the information processing device 3 is equipped with the adjustment unit 308, it is possible to cause the adjustment unit 308 to make adjustments to absorb the differences identified by the difference identification unit 307.
 例えば、図6に示す例EX2には、回転、静止、回転というシーケンスで行われる異物確認検査において、学習時と推論時とで容器に封入する液体の粘度が異なる場合の例を示している。つまり、例EX2では、学習に用いた物体の周囲の環境と当該対象物の周囲の環境とが異なっている。具体的には、推論に用いた容器内の液体は、学習に用いた液体よりも粘度が高く、このため推論時には、容器を静止させてから容器内が定常化するまでの時間が学習時より短くなっている。つまり、学習時において定常化した時刻t3よりも、推論時において定常化した時刻t3’が先の時刻(t3>t3’)となっている。 For example, example EX2 in Figure 6 shows a case where the viscosity of the liquid sealed in the container is different during learning and inference in a foreign object confirmation inspection that is performed in a sequence of rotation, rest, and rotation. That is, in example EX2, the environment surrounding the object used in learning is different from the environment surrounding the target object. Specifically, the liquid in the container used in inference has a higher viscosity than the liquid used in learning, and therefore, during inference, the time from when the container is brought to rest until the inside of the container becomes steady is shorter than during learning. In other words, the time t3' when the container becomes steady during inference is earlier than the time t3 when the container becomes steady during learning (t3 > t3').
 この場合、差異特定部307は、推論に用いた容器内の液体の粘性に基づき、学習時と推論時におけるコンテキスト間の差異である、定常化した時刻t3’を特定する。なお、粘度と定常化までの所要時間との関係を予め特定してモデル化しておけば、差異特定部307は当該モデルと推論に用いた容器内の液体の粘度を用いて時刻t3’を特定することができる。 In this case, the difference identification unit 307 identifies the time t3' at which the liquid stabilized, which is the difference between the contexts at the time of learning and the time of inference, based on the viscosity of the liquid in the container used for inference. If the relationship between the viscosity and the time required for the liquid to stabilize is identified and modeled in advance, the difference identification unit 307 can identify the time t3' using the model and the viscosity of the liquid in the container used for inference.
 そして、調整部308は、差異特定部307の特定結果に基づいて時刻情報を調整することにより上記の差異を吸収させる。具体的には、調整部308は、時刻t3’からt4までの各時刻情報に示される時刻に(t3-t3’)の値を加算する調整を行えばよい。 Then, the adjustment unit 308 absorbs the above-mentioned difference by adjusting the time information based on the result of the identification by the difference identification unit 307. Specifically, the adjustment unit 308 performs an adjustment by adding the value of (t3-t3') to the time indicated in each piece of time information from time t3' to time t4.
 なお、調整部308は、定常状態における時刻を全て同じ値に調整してもよい。この場合、調整部308は、時刻t3’からt4までの各時刻情報に示される時刻を、例えば時刻t3に置換してもよい。このように、調整部308は、推論時においてコンテキストが一定である期間の時刻情報を一定の値としてもよい。また、当該一定の値は、学習時において上記コンテキストと同一または類似のコンテキストに従って物体が移動していた期間(上述の例では時刻t3からt4までの期間)における時刻の値から選択すればよい。 The adjustment unit 308 may adjust all times in the steady state to the same value. In this case, the adjustment unit 308 may replace the times indicated in each piece of time information from time t3' to t4 with, for example, time t3. In this way, the adjustment unit 308 may set the time information for a period in which the context is constant during inference to a constant value. Furthermore, the constant value may be selected from the time values for a period in which an object moved according to a context that was the same as or similar to the above context during learning (the period from time t3 to t4 in the above example).
 以上のように、情報処理装置3は、学習に用いた物体の移動におけるコンテキストと、推論の対象となる対象物の移動におけるコンテキストとの差異を吸収するように、特徴量の生成に用いる時刻情報および位置情報の少なくとも何れかを調整する調整部308を備える。このため、本例示的実施形態に係る情報処理装置1によれば、対象物の移動におけるコンテキストと、学習に用いた物体の移動におけるコンテキストとの間に差異がある場合であっても、同じ特徴量生成モデル312を用いて妥当な特徴量を生成することができるという効果が得られる。 As described above, the information processing device 3 includes an adjustment unit 308 that adjusts at least one of the time information and the position information used to generate features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred. Therefore, according to the information processing device 1 according to this exemplary embodiment, even if there is a difference between the context of the movement of the target object and the context of the movement of the object used in learning, it is possible to obtain the effect that appropriate features can be generated using the same feature generation model 312.
 (学習時の処理の流れ)
 図7は、情報処理装置3が学習時に行う処理の流れを示すフロー図である。なお、学習を行うにあたり、教師データ314と特徴量生成モデル312を予め記憶部31に記憶させておく。記憶部31に記憶させておく特徴量生成モデル312は、パラメータが初期状態のものであってもよいし、ある程度学習が進んだものであってもよい。
(Learning process flow)
7 is a flow diagram showing the flow of processing performed by the information processing device 3 during learning. Note that, when learning is performed, the teacher data 314 and the feature generation model 312 are stored in advance in the storage unit 31. The feature generation model 312 stored in the storage unit 31 may have parameters in an initial state, or may be a model in which learning has progressed to a certain degree.
 S31では、学習部306が記憶部31に記憶されている教師データ314を取得する。上述のように、教師データ314は、ある物体の軌跡データに対して正解データとして当該物体が気泡であるか異物であるかを示す情報が対応付けられたデータである。また、ここでは、教師データ314には、軌跡データの元になった複数のフレーム画像も含まれているとする。 In S31, the learning unit 306 acquires the teacher data 314 stored in the memory unit 31. As described above, the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with respect to the trajectory data of the object. In addition, here, the teacher data 314 also includes multiple frame images that are the basis of the trajectory data.
 S32では、特徴量生成部303が、S31で取得された教師データ314を用いて、学習に用いた物体のコンテキストに応じた特徴量を生成する。より詳細には、特徴量生成部303は、S31で取得された教師データ314に含まれる軌跡データに示される時刻情報および位置情報を特徴量生成モデル312に入力することにより特徴量を生成する。 In S32, the feature generation unit 303 uses the teacher data 314 acquired in S31 to generate features according to the context of the object used in learning. More specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data included in the teacher data 314 acquired in S31 to the feature generation model 312.
 S33では、統合部304が、S31で取得された教師データ314に含まれる軌跡データと、S32で生成された特徴量とを統合して統合データを生成する。そして、S34では、推論部305が、S33で生成された統合データを用いて所定の推論を行う。具体的には、推論部305は、S33で生成された統合データを推論モデル313に入力することにより、推論結果すなわち物体が気泡であるか異物であるかの判定結果を得る。 In S33, the integration unit 304 integrates the trajectory data included in the teacher data 314 acquired in S31 with the feature amount generated in S32 to generate integrated data. Then, in S34, the inference unit 305 performs a predetermined inference using the integrated data generated in S33. Specifically, the inference unit 305 inputs the integrated data generated in S33 to the inference model 313 to obtain an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
 S35では、類似度算出部309が、S31で取得された教師データ314に含まれるフレーム画像間の類似度を算出する。このフレーム画像は、軌跡データに示される位置情報に該当する場所の周囲を切り出したものを用いてもよい。なお、類似度算出部309は、教師データ314に含まれる複数のフレーム画像(1つの軌跡データに対応するもの)の全ての組み合わせについて、それぞれ類似度を算出してもよいし、一部の組み合わせについて類似度を算出してもよい。また、S35の処理はS36より前に行えばよく、例えばS32より先に行ってもよいし、S32~S34の処理と並行で行ってもよい。 In S35, the similarity calculation unit 309 calculates the similarity between the frame images included in the teacher data 314 acquired in S31. The frame images may be clipped from around a location corresponding to the position information indicated in the trajectory data. The similarity calculation unit 309 may calculate the similarity for each combination of multiple frame images (corresponding to one trajectory data) included in the teacher data 314, or may calculate the similarity for some combinations. The process of S35 may be performed before S36, and may be performed before S32, for example, or in parallel with the processes of S32 to S34.
 S36では、学習部306が、S34における推論の結果が、教師データ314に示される所定の正解データに近付くように特徴量生成モデル312を更新する。この更新において、学習部306は、S35で算出されたフレーム画像間の類似度が、当該フレーム画像について特徴量生成モデル312により生成される特徴量間の類似度に反映されるように特徴量生成モデル312を更新する。 In S36, the learning unit 306 updates the feature generation model 312 so that the result of the inference in S34 approaches the predetermined correct answer data indicated in the teacher data 314. In this update, the learning unit 306 updates the feature generation model 312 so that the similarity between the frame images calculated in S35 is reflected in the similarity between the features generated by the feature generation model 312 for the frame images.
 S37では、学習部306は、学習を終了するか否かを判定する。学習の終了条件は予め定めておけばよく、例えば特徴量生成モデル312の更新回数が所定回数に達したときに学習を終了するようにしてもよい。学習部306は、S37でNOと判定した場合には、S31の処理に戻り、新たな教師データ314を取得する。一方、学習部306は、S37でYESと判定した場合には、更新後の特徴量生成モデル312を記憶部31に記憶させて図7の処理を終える。 In S37, the learning unit 306 determines whether or not to end learning. The condition for ending learning may be determined in advance, and learning may end, for example, when the number of updates of the feature generation model 312 reaches a predetermined number. If the learning unit 306 determines NO in S37, it returns to the process of S31 and acquires new teacher data 314. On the other hand, if the learning unit 306 determines YES in S37, it stores the updated feature generation model 312 in the memory unit 31 and ends the process of FIG. 7.
 なお、S31では、教師データ314の代わりに、所定の物体(具体的には気泡および異物の少なくとも何れか)が写った動画像または当該動画像から抽出されたフレーム画像を取得してもよい。この場合、物体検出部301が、取得されたフレーム画像から上記物体を検出し、軌跡データ生成部302が、検出された物体の軌跡データ311を生成する。この軌跡データ311に正解データをラベル付けすることにより、教師データ314が生成される。教師データ314が生成された後の処理は、上述したS32以降の処理と同様である。 In S31, instead of the teacher data 314, a video image showing a predetermined object (specifically, at least one of an air bubble and a foreign object) or a frame image extracted from the video image may be acquired. In this case, the object detection unit 301 detects the object from the acquired frame image, and the trajectory data generation unit 302 generates trajectory data 311 of the detected object. The teacher data 314 is generated by labeling this trajectory data 311 with the correct answer data. The processing after the teacher data 314 is generated is the same as the processing from S32 onwards described above.
 以上のように、本例示的実施形態に係る特徴量生成モデル312の生成方法は、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における物体の検出位置を示す位置情報とを、上記コンテキストに応じた特徴量を生成するための特徴量生成モデル312に入力することにより算出された特徴量に基づいて上記物体に関する所定の推論を行うこと(S34)と、当該推論の結果が所定の正解データに近付くように特徴量生成モデル312を更新すること(S36)と、を含む。 As described above, the method for generating the feature generation model 312 according to this exemplary embodiment includes: for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into the feature generation model 312 for generating features according to the context, performing a predetermined inference regarding the object based on the calculated features (S34); and updating the feature generation model 312 so that the result of the inference approaches predetermined correct answer data (S36).
 上記の構成によれば、コンテキストに応じた特徴量を生成することが可能な特徴量生成モデル312を生成することができる。そして、これにより、時刻情報と位置情報とからコンテキストに応じた特徴量を生成することが可能な特徴量生成モデルを生成することができ、計算コストを抑えつつ、コンテキストを加味した推論を行うことが可能になるという効果が得られる。 The above configuration makes it possible to generate a feature generation model 312 capable of generating features according to a context. This makes it possible to generate a feature generation model capable of generating features according to a context from time information and location information, and has the effect of making it possible to perform inference that takes into account the context while suppressing computational costs.
 また、以上のように、本例示的実施形態に係る特徴量生成モデル312の生成方法は、複数のフレーム画像間の類似度を算出すること(S35)を含み、特徴量生成モデル312の更新において、複数のフレーム画像間の類似度が、当該フレーム画像について特徴量生成モデル312により生成される特徴量間の類似度に反映されるように特徴量生成モデル312を更新する。類似したフレーム画像においてはコンテキストも類似していると考えられるから、上記の構成によれば、フレーム画像間の類似度を考慮したより妥当性の高い特徴量を生成可能な特徴量生成モデル312を生成することができる。 As described above, the method for generating the feature generation model 312 according to this exemplary embodiment includes calculating the similarity between a plurality of frame images (S35), and in updating the feature generation model 312, the feature generation model 312 is updated so that the similarity between a plurality of frame images is reflected in the similarity between the features generated by the feature generation model 312 for the frame images. Since similar frame images are considered to have similar contexts, the above configuration makes it possible to generate a feature generation model 312 that can generate more valid features that take into account the similarity between frame images.
 (推論時の処理の流れ)
 図8は、情報処理装置3が推論時に行う処理(推論方法)の流れを示すフロー図である。なお、図8には、推論の対象となる動画像から抽出された複数のフレーム画像が情報処理装置3に入力された後の処理を示している。この動画像には、気泡であるか異物であるかを判定する対象となる対象物が写っている。なお、動画像からフレーム画像を抽出する処理も情報処理装置3が行うようにしてもよい。
(Processing flow during inference)
Fig. 8 is a flow diagram showing the flow of processing (inference method) performed by the information processing device 3 during inference. Fig. 8 shows processing after a plurality of frame images extracted from a moving image to be inferred are input to the information processing device 3. The moving image shows an object to be determined as being an air bubble or a foreign object. The information processing device 3 may also perform the processing of extracting frame images from the moving image.
 S41では、物体検出部301が、上記の各フレーム画像から対象物を検出する。続いて、S42では、軌跡データ生成部302が、S41における対象物の検出結果に基づいて当該対象物が移動した軌跡を示す軌跡データ311を生成する。なお、以下では、1つの対象物と、その対象物が移動した軌跡を示す1つの軌跡データ311が生成された場合の処理を説明する。複数の軌跡データ311が生成された場合には各軌跡データ311について以下説明するS43~S47の処理が行われる。 In S41, the object detection unit 301 detects an object from each of the frame images. Then, in S42, the trajectory data generation unit 302 generates trajectory data 311 indicating the trajectory of the object movement based on the object detection result in S41. Note that the following describes the process when one object and one trajectory data 311 indicating the trajectory of the object movement are generated. When multiple trajectory data 311 are generated, the processes of S43 to S47 described below are performed for each trajectory data 311.
 S43では、差異特定部307が、学習に用いた物体と推論の対象物との差異、および、学習に用いた物体の周囲の環境と当該対象物の周囲の環境との差異の少なくとも何れかに基づき、当該物体の移動におけるコンテキストと当該対象物の移動におけるコンテキストとの差異を特定する。例えば、学習時と推論時とで容器に封入する液体の粘度が異なる場合、差異特定部307は、液体の粘度の差異に基づいて、容器内の液体が定常化する時刻を算出し、当該時刻と学習時において容器内の液体が定常化した時刻との差を算出してもよい。 In S43, the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. For example, if the viscosity of the liquid sealed in the container is different during learning and during inference, the difference identification unit 307 may calculate the time at which the liquid in the container becomes steady based on the difference in the viscosity of the liquid, and calculate the difference between this time and the time at which the liquid in the container becomes steady during learning.
 S44では、調整部308が、S43で特定されたコンテキスト間の差異を吸収するように、特徴量の生成に用いる時刻情報および位置情報の少なくとも何れかを調整する。例えば、S43において上記のように定常化する時刻の差が算出された場合、調整部308は、その時刻の差を吸収するように時刻情報を調整する。なお、コンテキスト間に差異がない場合にはS43およびS44の処理は省略される。また、学習時に時刻情報および位置情報の少なくとも一方が正規化されていた場合、調整部308は、特徴量の生成に用いる時刻情報および位置情報についても同様に正規化する。 In S44, the adjustment unit 308 adjusts at least one of the time information and the location information used to generate the features so as to absorb the difference between the contexts identified in S43. For example, if a difference in the time at which the features become stable is calculated as described above in S43, the adjustment unit 308 adjusts the time information so as to absorb the time difference. Note that if there is no difference between the contexts, the processes of S43 and S44 are omitted. Furthermore, if at least one of the time information and the location information was normalized during learning, the adjustment unit 308 similarly normalizes the time information and the location information used to generate the features.
 S45では、特徴量生成部303が、コンテキストに応じた特徴量を生成する。具体的には、特徴量生成部303は、1つの軌跡データ311に対応する複数のフレーム画像のそれぞれについて、当該フレーム画像に写る対象物の位置情報および時刻情報を当該軌跡データ311から抽出する。そして、特徴量生成部303は、抽出した時刻情報および位置情報を特徴量生成モデル312に入力して特徴量を生成する。これにより、各フレーム画像について、当該フレーム画像に写る対象物のコンテキストに応じた特徴量が生成される。 In S45, the feature generation unit 303 generates features according to the context. Specifically, for each of a plurality of frame images corresponding to one piece of trajectory data 311, the feature generation unit 303 extracts position information and time information of an object appearing in the frame image from the trajectory data 311. The feature generation unit 303 then inputs the extracted time information and position information into the feature generation model 312 to generate features. As a result, for each frame image, features according to the context of the object appearing in the frame image are generated.
 S46では、統合部304が、S42で生成された軌跡データ311と、S45で生成された特徴量とを統合して統合データを生成する。そして、S47では、推論部305が、S45で生成された特徴量に基づいて対象物に関する所定の推論を行う。具体的には、推論部305は、S45で生成された特徴量が反映された統合データを推論モデル313に入力することにより推論結果を得て、これにより図8の処理は終了する。なお、推論部305は、推論結果を出力部34等に出力させてもよいし、記憶部31等に記憶させてもよい。 In S46, the integration unit 304 integrates the trajectory data 311 generated in S42 with the feature quantities generated in S45 to generate integrated data. Then, in S47, the inference unit 305 performs a predetermined inference regarding the object based on the feature quantities generated in S45. Specifically, the inference unit 305 obtains an inference result by inputting the integrated data reflecting the feature quantities generated in S45 to the inference model 313, and the processing in FIG. 8 ends. Note that the inference unit 305 may output the inference result to the output unit 34, etc., or may store it in the memory unit 31, etc.
 以上のように、本例示的実施形態に係る推論方法は、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像における対象物の検出位置を示す位置情報と、当該フレーム画像が撮影されたタイミングを示す時刻情報とを用いて、当該フレーム画像に写る対象物のコンテキストに応じた特徴量を生成すること(S45)と、生成された特徴量に基づいて対象物に関する所定の推論を行うこと(S47)と、を含む。これにより、計算コストを抑えつつ、コンテキストを加味した推論を行うことができるという効果が得られる。 As described above, the inference method according to this exemplary embodiment includes generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features corresponding to the context of the object appearing in the frame images using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S45), and making a predetermined inference about the object based on the generated features (S47). This has the effect of making it possible to perform inference that takes into account the context while keeping computational costs down.
 〔変形例〕
 上述の例示的実施形態で説明した各処理の実行主体は任意であり、上述の例に限られない。つまり、相互に通信可能な複数の装置により情報処理装置1~3と同様の機能を備えた情報処理システムを構築することができる。例えば、図7のフロー図における処理と図8のフロー図における処理とをそれぞれ別の情報処理装置(あるいはプロセッサ)に実行させてもよい。また、図7あるいは図8に示すフロー図における各処理を複数の情報処理装置(あるいはプロセッサ)に分担させて実行させることもできる。
[Modifications]
The execution subject of each process described in the above exemplary embodiment is arbitrary and is not limited to the above example. In other words, an information processing system having the same functions as the information processing devices 1 to 3 can be constructed by a plurality of devices that can communicate with each other. For example, the process in the flow chart of FIG. 7 and the process in the flow chart of FIG. 8 may be executed by different information processing devices (or processors). Also, each process in the flow chart shown in FIG. 7 or FIG. 8 can be shared and executed by a plurality of information processing devices (or processors).
 また、推論部11、22、305が実行する所定の推論の内容は対象物に関するものであればよく、特に限定されない。例えば、例示的実施形態2で説明したような分類あるいは識別の他、予測、変換等であってもよい。 Furthermore, the content of the predetermined inference executed by the inference units 11, 22, and 305 is not particularly limited as long as it is related to the object. For example, in addition to classification or identification as described in the exemplary embodiment 2, it may be prediction, conversion, etc.
 また、コンテキストを生じさせる要因も任意である。例えば、情報処理装置2または3によれば、所定の周期で動作が変化する各種機器や、所定の周期で変化する自然現象などに起因して生じるコンテキストに沿って移動する対象物について、当該コンテキストを加味した推論を行うことができる。また、情報処理装置1または3によれば、上記のコンテキストを加味した推論を行うことを可能にする特徴量生成モデルを生成することができる。 Furthermore, the factors that give rise to a context are also arbitrary. For example, with information processing device 2 or 3, it is possible to perform inference that takes into account the context for an object that moves in accordance with a context that arises due to various devices whose operations change at a predetermined cycle, or due to natural phenomena that change at a predetermined cycle, etc. Furthermore, with information processing device 1 or 3, it is possible to generate a feature generation model that makes it possible to perform inference that takes into account the above-mentioned context.
 例えば、交通信号機の周囲の移動体(車両や人等)の移動は、交通信号機における周期的な発光制御の影響を受ける。つまり、上記移動体は、交通信号機の発光制御に起因するコンテキストに従って移動するといえる。 For example, the movement of moving objects (vehicles, people, etc.) around a traffic light is affected by the periodic light emission control of the traffic light. In other words, it can be said that the moving objects move according to the context resulting from the light emission control of the traffic light.
 このため、情報処理装置1または3は、上記コンテキストに沿って移動する移動体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、時刻情報と位置情報とを特徴量生成モデルに入力することにより算出された特徴量に基づいて移動体に関する所定の推論を行い、推論の結果が所定の正解データに近付くように特徴量生成モデルを更新する、という処理を繰り返すことにより、上記コンテキストに応じた特徴量生成モデルを生成することができる。そして、情報処理装置2または3は、このようにして生成された特徴量生成モデルを用いて生成された特徴量に基づき、移動体に関する所定の推論を行うことにより、上記コンテキストを加味した妥当性の高い推論結果を得ることができる。推論内容は特に限定されず、例えば、所定時間後の移動体の位置予側、移動体の行動分類、あるいは移動体の異常行動の検知等であってもよい。なお、これらの推論においては、車両と歩行者間、車両間等の相互作用についても考慮することが好ましい。 For this reason, the information processing device 1 or 3 performs a predetermined inference regarding the moving object based on the feature calculated by inputting time information and position information into the feature generation model for each of a plurality of frame images extracted from a video image capturing a moving object moving along the context, and updates the feature generation model so that the inference result approaches predetermined correct answer data. By repeating this process, it is possible to generate a feature generation model according to the context. Then, the information processing device 2 or 3 performs a predetermined inference regarding the moving object based on the feature generated using the feature generation model thus generated, thereby obtaining a highly valid inference result that takes the context into account. The content of the inference is not particularly limited, and may be, for example, a position prediction of the moving object after a predetermined time, a behavior classification of the moving object, or detection of abnormal behavior of the moving object. It is preferable that these inferences also take into account interactions between vehicles and pedestrians, between vehicles, etc.
 〔ソフトウェアによる実現例〕
 情報処理装置1~3の一部又は全部の機能は、集積回路(ICチップ)等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。
[Software implementation example]
Some or all of the functions of the information processing devices 1 to 3 may be realized by hardware such as an integrated circuit (IC chip), or may be realized by software.
 後者の場合、情報処理装置1~3は、例えば、各機能を実現するソフトウェアであるプログラム(推論プログラム/学習プログラム)の命令を実行するコンピュータによって実現される。このようなコンピュータの一例(以下、コンピュータCと記載する)を図9に示す。コンピュータCは、少なくとも1つのプロセッサC1と、少なくとも1つのメモリC2と、を備えている。メモリC2には、コンピュータCを情報処理装置1~3の何れかとして動作させるためのプログラムPが記録されている。コンピュータCにおいて、プロセッサC1は、プログラムPをメモリC2から読み取って実行することにより、情報処理装置1~3の何れかの機能が実現される。 In the latter case, information processing devices 1 to 3 are realized, for example, by a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function. An example of such a computer (hereinafter referred to as computer C) is shown in Figure 9. Computer C has at least one processor C1 and at least one memory C2. Memory C2 stores program P for operating computer C as any one of information processing devices 1 to 3. In computer C, processor C1 reads and executes program P from memory C2, thereby realizing the function of any one of information processing devices 1 to 3.
 プロセッサC1としては、例えば、CPU(Central Processing Unit)、GPU(Graphic Processing Unit)、DSP(Digital Signal Processor)、MPU(Micro Processing Unit)、FPU(Floating point number Processing Unit)、PPU(Physics Processing Unit)、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。メモリC2としては、例えば、フラッシュメモリ、HDD(Hard Disk Drive)、SSD(Solid State Drive)、又は、これらの組み合わせなどを用いることができる。 The processor C1 may be, for example, a CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit), microcontroller, or a combination of these. The memory C2 may be, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination of these.
 なお、コンピュータCは、プログラムPを実行時に展開したり、各種データを一時的に記憶したりするためのRAM(Random Access Memory)を更に備えていてもよい。また、コンピュータCは、他の装置との間でデータを送受信するための通信インタフェースを更に備えていてもよい。また、コンピュータCは、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インタフェースを更に備えていてもよい。 Computer C may further include a RAM (Random Access Memory) for expanding program P during execution and for temporarily storing various data. Computer C may further include a communications interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
 また、プログラムPは、コンピュータCが読み取り可能な、一時的でない有形の記録媒体Mに記録することができる。このような記録媒体Mとしては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータCは、このような記録媒体Mを介してプログラムPを取得することができる。また、プログラムPは、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータCは、このような伝送媒体を介してプログラムPを取得することもできる。 The program P can also be recorded on a non-transitory, tangible recording medium M that can be read by the computer C. Such a recording medium M can be, for example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit. The computer C can obtain the program P via such a recording medium M. The program P can also be transmitted via a transmission medium. Such a transmission medium can be, for example, a communications network or broadcast waves. The computer C can also obtain the program P via such a transmission medium.
 〔付記事項1〕
 本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。
[Additional Note 1]
The present invention is not limited to the above-described embodiment, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the above-described embodiment are also included in the technical scope of the present invention.
 〔付記事項2〕
 上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。
[Additional Note 2]
Some or all of the above-described embodiments can be described as follows. However, the present invention is not limited to the aspects described below.
 (付記1)
 所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段と、前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段と、を備える情報処理装置。
(Appendix 1)
An information processing device comprising: a feature generation means for generating features corresponding to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image; and an inference means for making a specified inference regarding the object based on the features.
 (付記2)
 前記特徴量生成手段は、前記コンテキストと同一または類似のコンテキストに沿って移動する物体が撮影されたタイミングを示す時刻情報、および、当該タイミングで撮影された画像における前記物体の検出位置を示す位置情報と、当該タイミングにおける前記物体の前記コンテキストに応じた特徴量との関係を学習した特徴量生成モデルを用いて前記特徴量を生成する、付記1に記載の情報処理装置。
(Appendix 2)
The information processing device described in Appendix 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating a timing at which an object moving in accordance with a context identical or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and feature values corresponding to the context of the object at that timing.
 (付記3)
 前記複数のフレーム画像からの前記対象物の検出結果に基づいて当該対象物が移動した軌跡を示す軌跡データを生成する軌跡データ生成手段と、前記軌跡データと前記特徴量生成手段が生成した特徴量とを統合して統合データを生成する統合手段と、を備え、前記特徴量生成手段は、前記軌跡データから抽出した前記位置情報と前記時刻情報とを用いて前記特徴量を生成し、前記推論手段は、前記統合データを用いて前記推論を行う、付記1または2に記載の情報処理装置。
(Appendix 3)
3. The information processing device according to claim 1, further comprising: a trajectory data generation means for generating trajectory data indicating a trajectory of movement of an object based on a detection result of the object from the plurality of frame images; and an integration means for integrating the trajectory data and a feature generated by the feature generation means to generate integrated data, wherein the feature generation means generates the feature using the position information and the time information extracted from the trajectory data, and the inference means performs the inference using the integrated data.
 (付記4)
 前記物体の移動における前記コンテキストと、前記対象物の移動における前記コンテキストとの差異を吸収するように、前記特徴量の生成に用いる時刻情報および位置情報の少なくとも何れかを調整する調整手段を備える、付記2に記載の情報処理装置。
(Appendix 4)
3. The information processing device according to claim 2, further comprising an adjustment means for adjusting at least one of time information and position information used in generating the feature amount so as to absorb a difference between the context in the movement of the object and the context in the movement of the target object.
 (付記5)
 前記物体と前記対象物との差異、および、前記物体の周囲の環境と前記対象物の周囲の環境との差異の少なくとも何れかに基づき、前記物体の移動における前記コンテキストと、前記対象物の移動における前記コンテキストとの差異を特定する差異特定手段を備える、付記4に記載の情報処理装置。
(Appendix 5)
An information processing device as described in Appendix 4, comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between an environment surrounding the object and an environment surrounding the target object.
 (付記6)
 少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像における前記対象物の検出位置を示す位置情報と、当該フレーム画像が撮影されたタイミングを示す時刻情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成することと、前記特徴量に基づいて前記対象物に関する所定の推論を行うことと、を含む推論方法。
(Appendix 6)
An inference method comprising: at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features of the object appearing in the frame images corresponding to the context, using position information indicating the detection position of the object in the frame images and time information indicating the timing when the frame images were captured; and making a predetermined inference regarding the object based on the features.
 (付記7)
 コンピュータを所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段、および前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段、として機能させる推論プログラム。
(Appendix 7)
An inference program that causes a computer to function as a feature generation means that generates features corresponding to the context of an object appearing in a plurality of frame images extracted from a video of an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference means that makes a specified inference about the object based on the features.
 (付記8)
 少なくとも1つのプロセッサが、所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、前記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された前記特徴量に基づいて前記物体に関する所定の推論を行うことと、前記推論の結果が所定の正解データに近付くように前記特徴量生成モデルを更新することと、を含む特徴量生成モデルの生成方法。
(Appendix 8)
A method for generating a feature generation model, comprising: at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving in accordance with a predetermined context, time information indicating the timing at which the frame image was captured and positional information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby making a predetermined inference about the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined correct answer data.
 (付記9)
 少なくとも1つのプロセッサが前記複数のフレーム画像間の類似度を算出することを含み、前記特徴量生成モデルの更新において、前記複数のフレーム画像間の類似度が、当該フレーム画像について前記特徴量生成モデルにより生成される特徴量間の類似度に反映されるように前記特徴量生成モデルを更新する、付記8に記載の特徴量生成モデルの生成方法。
(Appendix 9)
9. The method for generating a feature generation model described in Appendix 8, further comprising: at least one processor calculating a similarity between the plurality of frame images; and updating the feature generation model such that the similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images, in updating the feature generation model.
 〔付記事項3〕
 上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。少なくとも1つのプロセッサを備え、前記プロセッサは、所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する処理と、前記特徴量に基づいて前記対象物に関する所定の推論を行う処理とを実行する情報処理装置。
[Additional Note 3]
Some or all of the above-described embodiments can also be expressed as follows: An information processing device including at least one processor, which executes, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a process of generating a feature amount of the object appearing in the frame image according to the context using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and a process of making a predetermined inference regarding the object based on the feature amount.
 なお、この情報処理装置は、更にメモリを備えていてもよく、このメモリには、前記特徴量を生成する処理と、前記所定の推論を行う処理とを前記プロセッサに実行させるための推論プログラムが記憶されていてもよい。また、この推論プログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。 The information processing device may further include a memory, and the memory may store an inference program for causing the processor to execute the process of generating the feature amount and the process of performing the predetermined inference. The inference program may also be recorded on a computer-readable, non-transitory, tangible recording medium.
1   情報処理装置
11  推論部
12  学習部
2   情報処理装置
21  特徴量生成部
22  推論部
3   情報処理装置
311 軌跡データ
312 特徴量生成モデル
302 軌跡データ生成部
303 特徴量生成部
304 統合部
305 推論部
306 学習部
307 差異特定部
308 調整部

 
Reference Signs List 1 Information processing device 11 Inference unit 12 Learning unit 2 Information processing device 21 Feature amount generation unit 22 Inference unit 3 Information processing device 311 Trajectory data 312 Feature amount generation model 302 Trajectory data generation unit 303 Feature amount generation unit 304 Integration unit 305 Inference unit 306 Learning unit 307 Difference identification unit 308 Adjustment unit

Claims (9)

  1.  所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段と、
     前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段と、を備える情報処理装置。
    a feature generating means for generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image;
    and an inference means for performing a predetermined inference regarding the object based on the feature amount.
  2.  前記特徴量生成手段は、前記コンテキストと同一または類似のコンテキストに沿って移動する物体が撮影されたタイミングを示す時刻情報、および、当該タイミングで撮影された画像における前記物体の検出位置を示す位置情報と、当該タイミングにおける前記物体の前記コンテキストに応じた特徴量との関係を学習した特徴量生成モデルを用いて前記特徴量を生成する、請求項1に記載の情報処理装置。 The information processing device according to claim 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating the timing at which an object moving along a context identical to or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and the feature of the object according to the context at that timing.
  3.  前記複数のフレーム画像からの前記対象物の検出結果に基づいて当該対象物が移動した軌跡を示す軌跡データを生成する軌跡データ生成手段と、
     前記軌跡データと前記特徴量生成手段が生成した特徴量とを統合して統合データを生成する統合手段と、を備え、
     前記特徴量生成手段は、前記軌跡データから抽出した前記位置情報と前記時刻情報とを用いて前記特徴量を生成し、
     前記推論手段は、前記統合データを用いて前記推論を行う、請求項1または2に記載の情報処理装置。
    a trajectory data generating means for generating trajectory data indicating a trajectory of the movement of the object based on a detection result of the object from the plurality of frame images;
    an integration unit that integrates the trajectory data and the feature generated by the feature generating unit to generate integrated data,
    the feature generating means generates the feature using the position information and the time information extracted from the trajectory data;
    The information processing apparatus according to claim 1 , wherein the inference means performs the inference using the integrated data.
  4.  前記物体の移動における前記コンテキストと、前記対象物の移動における前記コンテキストとの差異を吸収するように、前記特徴量の生成に用いる時刻情報および位置情報の少なくとも何れかを調整する調整手段を備える、請求項2に記載の情報処理装置。 The information processing device according to claim 2, further comprising an adjustment means for adjusting at least one of the time information and the position information used to generate the feature amount so as to absorb the difference between the context in the movement of the object and the context in the movement of the target object.
  5.  前記物体と前記対象物との差異、および、前記物体の周囲の環境と前記対象物の周囲の環境との差異の少なくとも何れかに基づき、前記物体の移動における前記コンテキストと、前記対象物の移動における前記コンテキストとの差異を特定する差異特定手段を備える、請求項4に記載の情報処理装置。 The information processing device according to claim 4, further comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between the environment surrounding the object and the environment surrounding the target object.
  6.  少なくとも1つのプロセッサが、
     所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像における前記対象物の検出位置を示す位置情報と、当該フレーム画像が撮影されたタイミングを示す時刻情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成することと、
     前記特徴量に基づいて前記対象物に関する所定の推論を行うことと、を含む推論方法。
    At least one processor
    For each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, generating a feature quantity of the object appearing in the frame image according to the context, using position information indicating a detection position of the object in the frame image and time information indicating a timing when the frame image was captured;
    and making a predetermined inference about the object based on the feature amount.
  7.  コンピュータを
     所定のコンテキストに沿って移動する対象物を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記対象物の検出位置を示す位置情報とを用いて、当該フレーム画像に写る前記対象物の前記コンテキストに応じた特徴量を生成する特徴量生成手段、および
     前記特徴量に基づいて前記対象物に関する所定の推論を行う推論手段、として機能させる推論プログラム。
    An inference program that causes a computer to function as: a feature generation means that generates, for each of a plurality of frame images extracted from a video of an object moving along a specified context, features corresponding to the context of the object appearing in the frame images, using time information indicating the timing at which the frame images were captured and position information indicating the detection position of the object in the frame images; and an inference means that makes a specified inference about the object based on the features.
  8.  少なくとも1つのプロセッサが、
     所定のコンテキストに沿って移動する物体を撮影した動画像から抽出された複数のフレーム画像のそれぞれについて、当該フレーム画像が撮影されたタイミングを示す時刻情報と、当該フレーム画像における前記物体の検出位置を示す位置情報とを、前記コンテキストに応じた特徴量を生成するための特徴量生成モデルに入力することにより算出された前記特徴量に基づいて前記物体に関する所定の推論を行うことと、
     前記推論の結果が所定の正解データに近付くように前記特徴量生成モデルを更新することと、を含む特徴量生成モデルの生成方法。
    At least one processor
    For each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image are input to a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features;
    updating the feature generation model so that the result of the inference approaches predetermined correct answer data.
  9.  少なくとも1つのプロセッサが前記複数のフレーム画像間の類似度を算出することを含み、
     前記特徴量生成モデルの更新において、前記複数のフレーム画像間の類似度が、当該フレーム画像について前記特徴量生成モデルにより生成される特徴量間の類似度に反映されるように前記特徴量生成モデルを更新する、請求項8に記載の特徴量生成モデルの生成方法。

     
    at least one processor calculates a similarity between the plurality of frame images;
    9. The method for generating a feature generation model according to claim 8, further comprising updating the feature generation model so that a similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images.

PCT/JP2022/046044 2022-12-14 2022-12-14 Information processing device, inference method, inference program, and method for generating feature value generation model WO2024127554A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/046044 WO2024127554A1 (en) 2022-12-14 2022-12-14 Information processing device, inference method, inference program, and method for generating feature value generation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/046044 WO2024127554A1 (en) 2022-12-14 2022-12-14 Information processing device, inference method, inference program, and method for generating feature value generation model

Publications (1)

Publication Number Publication Date
WO2024127554A1 true WO2024127554A1 (en) 2024-06-20

Family

ID=91484702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/046044 WO2024127554A1 (en) 2022-12-14 2022-12-14 Information processing device, inference method, inference program, and method for generating feature value generation model

Country Status (1)

Country Link
WO (1) WO2024127554A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021214994A1 (en) * 2020-04-24 2021-10-28 日本電気株式会社 Inspection system
JP7138264B1 (en) * 2021-10-08 2022-09-15 楽天グループ株式会社 Information processing device, information processing method, information processing system, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021214994A1 (en) * 2020-04-24 2021-10-28 日本電気株式会社 Inspection system
JP7138264B1 (en) * 2021-10-08 2022-09-15 楽天グループ株式会社 Information processing device, information processing method, information processing system, and program

Similar Documents

Publication Publication Date Title
US11610115B2 (en) Learning to generate synthetic datasets for training neural networks
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
US11321847B2 (en) Foreground-aware image inpainting
US10325181B2 (en) Image classification method, electronic device, and storage medium
KR102338372B1 (en) Device and method to segment object from image
US20190095730A1 (en) End-To-End Lightweight Method And Apparatus For License Plate Recognition
US11514694B2 (en) Teaching GAN (generative adversarial networks) to generate per-pixel annotation
CN111639744A (en) Student model training method and device and electronic equipment
JP2020508531A (en) Image quality evaluation method and image quality evaluation system
US20200074707A1 (en) Joint synthesis and placement of objects in scenes
CN113159073B (en) Knowledge distillation method and device, storage medium and terminal
WO2020047854A1 (en) Detecting objects in video frames using similarity detectors
WO2022077978A1 (en) Video processing method and video processing apparatus
US11816185B1 (en) Multi-view image analysis using neural networks
US20220335672A1 (en) Context-aware synthesis and placement of object instances
US20230281974A1 (en) Method and system for adaptation of a trained object detection model to account for domain shift
US11373352B1 (en) Motion transfer using machine-learning models
CN113762461A (en) Training neural networks with finite data using reversible enhancement operators
KR20220134627A (en) Hardware-optimized neural architecture discovery
CN108921023A (en) A kind of method and device of determining low quality portrait data
WO2022205416A1 (en) Generative adversarial network-based facial expression generation method
CN115661336A (en) Three-dimensional reconstruction method and related device
US20230139994A1 (en) Method for recognizing dynamic gesture, device, and storage medium
US20230081128A1 (en) Picture quality-sensitive semantic segmentation for use in training image generation adversarial networks
WO2024127554A1 (en) Information processing device, inference method, inference program, and method for generating feature value generation model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22968466

Country of ref document: EP

Kind code of ref document: A1