WO2024127554A1

WO2024127554A1 - Information processing device, inference method, inference program, and method for generating feature value generation model

Info

Publication number: WO2024127554A1
Application number: PCT/JP2022/046044
Authority: WO
Inventors: あずさ澤田; 尚司谷内田
Original assignee: 日本電気株式会社
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2024-06-20

Abstract

In order to enable context-based inference while suppressing calculation cost, this information processing device (2) is provided with: a feature value generation unit (21) that uses time information indicating timing at which a frame image was captured and position information indicating the detection position of an object in the frame image to generate a feature value according to the context; and an inference unit (22) that makes an inference on the basis of the generated feature value.

Description

Information processing device, inference method, inference program, and method for generating feature quantity generation model

The present invention relates to an information processing device that performs inferences about an object using video images of the object.

In recent years, by using technologies such as deep learning, it has become possible to perform object detection and object identification (hereafter collectively referred to as inference) in images with a very high degree of accuracy. Research into inference for video images is also progressing.

One known method for improving the accuracy of inferences made on video images is to perform inferences taking into account the context of the video images. For example, Non-Patent Document 1 below discloses a technology that uses a trained convolutional neural network to extract context features from video images, and then uses the context features to identify the actions of people appearing in the video images.

However, the above-mentioned conventional techniques have a problem in that the computational costs involved in inputting video images into a convolutional neural network and extracting context features are very high. One aspect of the present invention aims to realize an information processing device or the like that is capable of performing inference that takes into account context while keeping computational costs down.

An information processing device according to one aspect of the present invention includes a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.

An inference method according to one aspect of the present invention includes at least one processor generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, features of the object appearing in the frame images according to the context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured, and performing a predetermined inference regarding the object based on the features.

An inference program according to one aspect of the present invention causes a computer to function as a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.

A method for generating a feature generation model according to one aspect of the present invention includes: at least one processor inputs, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined ground truth data.

According to one aspect of the present invention, it is possible to perform inference that takes into account context while keeping computational costs down.

1 is a block diagram showing a configuration of an information processing device according to a first exemplary embodiment of the present invention; 1 is a flow diagram showing the flow of a method for generating and inferring a feature generation model according to an exemplary embodiment 1 of the present invention. FIG. 1 is a diagram for explaining a method of inspection for checking for foreign matter. FIG. 13 is a diagram for explaining an overview of an inference method according to an exemplary embodiment 2 of the present invention. FIG. 11 is a block diagram showing an example of the configuration of an information processing device according to an exemplary embodiment 2 of the present invention. FIG. 13 is a diagram showing an example in which a difference occurs between contexts during learning and inference. FIG. 11 is a flowchart showing a flow of processing performed by an information processing device according to an exemplary embodiment 2 of the present invention during learning. FIG. 11 is a flow chart showing the flow of processing performed during inference by an information processing device according to an exemplary embodiment 2 of the present invention. FIG. 1 is a diagram showing an example of a computer that executes instructions of a program, which is software that realizes the functions of each device according to each exemplary embodiment of the present invention.

[Example embodiment 1]
A first exemplary embodiment of the present invention will be described in detail with reference to the drawings. This exemplary embodiment is a basic form of the exemplary embodiments described below. First,

information processing devices

1 and 2 according to this exemplary embodiment will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the configuration of the

information processing devices

1 and 2.

(Configuration of information processing device 1)
As shown in the figure, the information processing device 1 includes an inference unit 11 and a learning unit 12. The inference unit 11 performs a predetermined inference regarding an object moving according to a predetermined context, based on features calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context.

The learning unit 12 updates the feature generation model so that the result of the inference by the inference unit 11 approaches the predetermined correct answer data.

In this manner, the information processing device 1 according to this exemplary embodiment includes an inference unit 11 that performs a predetermined inference regarding the object based on features calculated by inputting, for each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches predetermined correct answer data.

The above configuration makes it possible to generate a feature generation model that can generate features according to the context from time information and location information. This has the effect of making it possible to perform inference that takes into account the context while reducing computational costs compared to conventional techniques that generate context features from the entire video.

(Configuration of information processing device 2)
The information processing device 2 includes a feature generating unit 21 and an inference unit 22. For each of a plurality of frame images extracted from a moving image capturing an object moving along a predetermined context, the feature generating unit 21 generates a feature corresponding to the context of the object captured in the frame image by using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.

The inference unit 22 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.

In this way, the information processing device 2 according to this exemplary embodiment includes a feature generation unit 21 that generates features according to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a predetermined context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.

An object that moves according to a specific context is affected differently at each point in time based on that context. Therefore, with the above configuration, it is possible to generate features according to the context of the object and perform inference based on those features.

Furthermore, the data size of location information and time information is significantly smaller than that of video images. Therefore, the information processing device 2 according to this exemplary embodiment has the advantage of being able to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.

(Study Program)
The above-mentioned functions of the information processing device 1 can also be realized by a program. The learning program according to this exemplary embodiment causes a computer to function as an inference unit 11 that performs a predetermined inference regarding the object based on a feature calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image into a feature generation model for generating a feature according to the context, for each of a plurality of frame images extracted from a video image capturing an object moving along a predetermined context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches a predetermined correct answer data. According to this learning program, a feature generation model capable of generating a feature according to a context from time information and position information can be generated, and thus an effect is obtained in which it is possible to perform inference taking into account the context while suppressing calculation costs.

(Inference Program)
Similarly, the functions of the information processing device 2 described above can also be realized by a program. The inference program according to this exemplary embodiment causes a computer to function as a feature amount generating unit 21 that generates a feature amount according to the context of an object appearing in a plurality of frame images extracted from a video in which an object moving along a predetermined context is captured, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the feature amount generated by the feature amount generating unit 21. This inference program provides the effect of being able to perform inference taking into account the context while suppressing calculation costs.

(Flow of feature generation model generation and inference methods)
The flow of the method for generating and inferring a feature generation model according to this exemplary embodiment will be described with reference to Fig. 2. Fig. 2 is a flow diagram showing the flow of the method for generating and inferring a feature generation model. Note that the execution subject of each step in the determination method shown in Fig. 2 may be a processor provided in the

information processing device

1 or 2, or a processor provided in another device, or each step may be a processor provided in a different device.

The flow diagram shown on the left side of FIG. 2 illustrates a method for generating a feature generation model according to this exemplary embodiment. In S11, at least one processor inputs time information indicating the timing at which each of a plurality of frame images extracted from a video of an object moving along a predetermined context was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and performs a predetermined inference regarding the object based on the calculated features.

In S12, at least one processor updates the feature generation model so that the result of the inference in S11 approaches the predetermined ground truth data.

As described above, the method for generating a feature generation model according to this exemplary embodiment includes at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features (S11), and updating the feature generation model so that the result of the inference in S11 approaches predetermined ground truth data (S12). Thus, it is possible to generate a feature generation model capable of generating features according to the context from the time information and position information, which has the effect of making it possible to perform inference taking into account the context while suppressing calculation costs.

On the other hand, the flow diagram shown on the right side of FIG. 2 illustrates an inference method according to this exemplary embodiment. In S21, at least one processor generates, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature quantity corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.

In S22, at least one processor performs a predetermined inference regarding the object based on the features generated in S21.

As described above, the inference method according to this exemplary embodiment includes at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, feature values of the object appearing in the frame images according to the above-mentioned context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S21), and making a predetermined inference about the object based on the feature values generated in S21 (S22). This provides the effect of being able to make inferences that take into account the context while keeping computational costs down.

Exemplary embodiment 2
A second exemplary embodiment of the present invention will be described in detail with reference to the drawings. In the following, an example will be described in which an inference method according to this exemplary embodiment (hereinafter, referred to as this inference method) is used in an inspection to check whether a foreign object is present in a liquid (e.g., medicine, beverage, etc.) sealed in a transparent container (hereinafter, referred to as a foreign object confirmation inspection).

(Method of inspection for foreign matter)
Prior to the description of this inference method, a method of foreign object inspection will be described with reference to FIG. 3. FIG. 3 is a diagram for explaining the method of foreign object inspection. In the foreign object inspection, a container filled with a predetermined liquid, which is an object to be inspected, is fixed in a device (not shown in FIG. 3), and the device is used to rock the container. A control sequence for rocking the container is determined in advance. For example, the control sequence in the example of FIG. 3 is such that the container is rotated in a vertical plane for a predetermined time, then stopped for a predetermined time, and then rotated in a horizontal plane for a predetermined time. This control sequence may be repeated multiple times.

In this foreign body inspection, the container is rocked using the above-mentioned control while a moving image of the liquid inside the container is captured. Next, frame images are extracted from the captured moving image, and object detection is performed on each frame image. Then, for each object detected by this object detection, it is determined whether the object is an air bubble or a foreign body, and if there is no object determined to be a foreign body, the inspected item is determined to be a good product, and if there is even one object determined to be a foreign body, it is determined to be a defective product.

In the foreign body inspection described above, the device rocks the container by controlling a specific pattern, so that the object inside the container (air bubbles or foreign body) moves along a specific context based on this pattern. For example, the flow of liquid inside the container accelerates for a while after the container starts to rotate, so the object also accelerates along this flow. In addition, the object moves in the direction of the container's rotation. After that, when the container stops rotating, the flow rate of the liquid gradually slows down and stabilizes in a steady state. The speed and direction of the object's movement during this time also follow the flow rate and direction of the liquid. The same applies to subsequent controls.

In this inference method, for each of multiple frame images extracted from a video that captures an object moving in accordance with the above-mentioned context, a feature amount corresponding to the above-mentioned context of the object appearing in the frame image is generated. Then, based on the generated feature amount, it is determined whether the object is an air bubble or a foreign object. In other words, the inference in this inference method is to determine whether the object is an air bubble or a foreign object.

The details will be described later, but this inference method makes it possible to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.

(Summary of reasoning method)
Next, an overview of this inference method will be described with reference to Fig. 4. Fig. 4 is a diagram for explaining an overview of this inference method. Prior to execution of this inference method, frame images are first extracted from a video image capturing an object. In the example of Fig. 4, n frame images from _FR1 to _FRn are extracted. In the illustrated frame image _FR1 , a container filled with liquid is captured, as well as objects OB1 and OB2. The same is true for the other frame images.

In this inference method, first, objects are detected from each frame image. As mentioned above, the objects are air bubbles and foreign objects. Since both are small in size and have similar appearances, it is difficult to accurately identify whether a detected object is an air bubble or a foreign object based on only one frame image.

Next, in this inference method, based on the detection results of the objects OB1 and OB2 from the frame images FR ₁ to FR _n , trajectory data is generated that indicates the trajectory of the movement of the objects. Fig. 4 shows schematic diagrams of trajectory data A1 of the object OB1 and trajectory data A2 of the object OB2.

The trajectory data A1 includes time information indicating the timing at which the frame image in which the object OB1 was detected was shot, and position information indicating the detected position of the object OB1 in the frame image. For example, assume that the object OB1 was detected in each of the frame images FR ₁ to FR _10. In this case, the trajectory data A1 includes time information indicating the shooting timing of each of the frame images FR ₁ to FR ₁₀ , and position information indicating the detected position of the object OB1 in the frame images FR ₁ to FR ₁₀ .

The trajectory data A1 may also include information indicating the characteristics of the detected object OB1. For example, the trajectory data A1 may include an image patch that is an image cut out of an area in the frame image that includes the object OB1, or feature amounts extracted from the image patch, information indicating the size of the detected object, and information indicating the moving speed of the detected object.

In addition, in this inference method, normalized position information may be used to avoid being affected by differences in container size and liquid volume. In this case, for example, normalized position information can be generated by applying at least one of translation, rotation, and scale transformation to position information indicating the position of the object on the frame image.

The time information may also be normalized. In this case, for example, the time information of the frame image FR ₁ may be set to 0, and the time information of the frame image FR _n captured at the timing when a series of control sequences ends may be set to 1, and the time information of the frame images FR ₂ to FR _n-1 may be determined based on these values.

The same is true for the trajectory data A2, which includes time information indicating the timing at which the frame image in which the object OB2 was detected was captured, and position information indicating the detected position of the object OB2 in the frame image. The trajectory data A2 may also include information indicating the characteristics of the detected object OB2.

Note that, for example, if the object is an air bubble, it may appear and disappear during the oscillation, so the range of time information in the trajectory data A1 and A2 may differ. Also, only the trajectory data A1 and A2 for two objects OB1 and OB2 are shown here, but a larger number of objects may be detected in an actual foreign object confirmation inspection.

Next, in this inference method, the time information and position information contained in each of the trajectory data A1 and A2 are input into a feature generation model to generate feature quantities B1 and B2 according to the context. These feature quantities are generated for each time. For example, a feature quantity at time t1 is generated from the time information and position information at that time t1. Also, for example, a previously generated judgment function may be used to judge whether the frame image is inside or outside a liquid region, and a value indicating the judgment result may be included in feature quantities B1 and B2. This makes it possible to eliminate the influence of objects, etc. detected outside the liquid region and obtain valid inference results.

The feature generation model makes it possible to generate features according to the context using time and position information, which are significantly smaller in data size than video images. In this inference method, the trajectory data and features generated as described above are integrated to generate integrated data, and the integrated data is input to the inference model to output the inference result. This completes the inference method. Specifically, the inference result indicates whether each of the objects OB1 and OB2 is an air bubble or a foreign body.

　The objects OB1 and OB2 detected in the above-mentioned foreign body confirmation inspection are both small in size and similar in appearance. For this reason, it is difficult to accurately determine whether a detected object is an air bubble or a foreign body. However, this inference method uses integrated data that reflects features generated using a feature generation model to perform estimation that takes into account the context, making it possible to perform difficult estimations with high accuracy.

(Configuration of information processing device 3)
The configuration of the information processing device 3 will be described with reference to Fig. 5. Fig. 5 is a block diagram showing an example of the configuration of the information processing device 3. The information processing device 3 includes a control unit 30 that controls each unit of the information processing device 3, and a storage unit 31 that stores various data used by the information processing device 3. The information processing device 3 also includes a communication unit 32 that allows the information processing device 3 to communicate with other devices, an input unit 33 that accepts input of various data to the information processing device 3, and an output unit 34 that allows the information processing device 3 to output various data.

The control unit 30 also includes an object detection unit 301, a trajectory data generation unit 302, a feature generation unit 303, an integration unit 304, an inference unit 305, a learning unit 306, a difference identification unit 307, an adjustment unit 308, and a similarity calculation unit 309. The memory unit 31 stores trajectory data 311, a feature generation model 312, an inference model 313, and teacher data 314. The learning unit 306, the similarity calculation unit 309, and the teacher data 314 will be described later in the section "About learning", and the difference identification unit 307 and the adjustment unit 308 will be described later in the section "About absorbing context differences".

The object detection unit 301 detects a specific object from each of multiple frame images extracted from a video. If the target video has been shot at a high frame rate, the object detection unit 301 can detect the object from each frame image with relatively lightweight image processing by utilizing the positional continuity. There are no particular limitations on the method of detecting the object. For example, the object detection unit 301 may detect the object using a detection model that has been trained to detect the object using an image of the object as training data. There are no particular limitations on the algorithm of the detection model. For example, the object detection unit 301 may use a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.

The trajectory data generation unit 302 generates trajectory data indicating the trajectory of the movement of an object based on the detection result of the object from multiple frame images by the object detection unit 301. As described above, the trajectory data includes time information indicating the timing when the frame image in which the object is detected was captured, and position information indicating the detected position of the object in the frame image, and may also include image patches or the like as information indicating the characteristics of the detected object. The time information may also include information indicating the time difference between the frame images. The generated trajectory data is stored in the memory unit 31 as trajectory data 311.

The feature generation unit 303 generates features according to the context of an object appearing in each of a plurality of frame images extracted from a video capturing an object moving along a specific context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image. Specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data 311 to the feature generation model 312.

The feature generation model 312 is a trained model that has been trained to generate features according to a context. More specifically, the feature generation model 312 is generated by learning the relationship between time information indicating the timing at which an object moving in accordance with a specified context was photographed, and position information indicating the detected position of the object in the image photographed at that timing, and the feature of the object at that timing. The feature generation model 312 may be a function that uses the above-mentioned feature as a response variable and time information and position information as explanatory variables.

The context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the algorithm of the feature generation model 312 is not particularly limited. For example, the feature generation model 312 may be a model such as a convolutional neural network, a recurrent neural network, or a transformer, or may be a model that combines two or more of these.

The integration unit 304 integrates the trajectory data 311 generated by the trajectory data generation unit 302 and the features generated by the feature generation unit 303 to generate integrated data. The integration method is not particularly limited. For example, the integration unit 304 may combine the feature as an additional dimension with respect to each time component in the trajectory data 311 (specifically, position information or an image patch associated with one piece of time information, etc.). Furthermore, for example, the integration unit 304 may generate integrated data reflecting the feature by adding the feature to each time component or multiplying the feature by each time component. Furthermore, for example, the integration unit 304 may reflect the feature in each time component by an attention mechanism.

The inference unit 305 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 303. Specifically, the inference unit 305 inputs integrated data reflecting the features generated by the feature generation unit 303 to the inference model 313, thereby obtaining an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.

The inference model 313 is a model generated by using integrated data generated from an image showing air bubbles or foreign objects to learn whether an object shown in the image is an air bubble or a foreign object. The algorithm of the inference model 313 is not particularly limited. For example, the inference model 313 may be a model such as a convolutional neural network, a recursive neural network, or a transformer, or may be a model that combines two or more of these.

As described above, the information processing device 3 includes a feature generating unit 303 that generates features according to the context of an object appearing in a frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, and an inference unit 305 that performs a predetermined inference regarding the object based on the features generated by the feature generating unit 303. This provides the effect of being able to perform inference that takes into account the context while suppressing calculation costs, similar to the information processing device 2 according to the exemplary embodiment 1. Note that still images in a time series obtained by continuously capturing still images are also included in the category of "a plurality of frame images extracted from a video".

Furthermore, as described above, the feature generation unit 303 may generate features using a feature generation model 312 that has learned the relationship between time information indicating the timing at which an object moving along a context that is the same as or similar to the context in which the target object moves was photographed, position information indicating the detected position of the object in the image photographed at that timing, and features according to the context of the object at that timing. This provides the effect of being able to generate appropriate features based on the learning results, in addition to the effect provided by the information processing device 2 according to exemplary embodiment 1.

As described above, the information processing device 3 includes a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images, and an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data, the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data. As a result, in addition to the effect achieved by the information processing device 2 according to exemplary embodiment 1, an effect is obtained in that inference taking into account the context can be performed within the framework of performing inference using the trajectory data 311 generated based on the frame images.

(About learning)
This section describes learning by the learning unit 306. It also describes the teacher data 314 and the similarity calculation unit 309. The learning unit 306 updates the feature generation model 312 and the inference model 313 by learning using the teacher data 314.

The teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with the trajectory data of an object. The teacher data 314 may also include multiple frame images that are the source of the trajectory data. As described above, the feature generation unit 303 generates features from the time information and position information included in the trajectory data, and the integration unit 304 integrates the generated features with the trajectory data to generate integrated data. The inference unit 305 then performs inference using the integrated data, thereby obtaining an inference result based on the trajectory data included in the teacher data 314.

The learning unit 306 updates the inference model 313 and the feature generation model 312 so that the result of inference based on the trajectory data included in the teacher data 314 approaches the predetermined correct answer data indicated in the teacher data 314. For example, the learning unit 306 may use a gradient descent method to update each of the inference model 313 and the feature generation model 312 so as to minimize a loss function that is the sum of the errors between the inference result and the correct answer data.

As described above, during inference, video images or frame images are not used as is to generate features, but frame images may be used during learning. For example, since similar frame images are considered to have similar contexts, the similarity between frame images may be used to update the feature generation model 312.

The similarity calculation unit 309 calculates the similarity between frame images, and is configured to be used when the similarity is used to update the feature generation model 312. When the similarity calculation unit 309 calculates the similarity, the learning unit 306 updates the feature generation model 312 so that the similarity between multiple frame images is reflected in the similarity between features generated by the feature generation model 312 for the frame images. For example, by adding a normalization term to the loss function described above, the feature generation model 312 can be updated so that the similarity between frame images becomes closer to the similarity of the features.

(Regarding absorption of contextual differences)
As described above, the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the contexts need only be at least partially the same or similar, and do not have to be entirely the same or similar.

The difference identification unit 307 and adjustment unit 308 are used when there is a difference between the context of the movement of the object used in learning and the context of the movement of the target object that is the subject of inference.

The adjustment unit 308 adjusts at least one of the time information and the position information used to generate the features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred.

The difference identification unit 307 identifies the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the environment surrounding the object and the environment surrounding the target object. The difference identification unit 307 may also identify the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the object used for learning and the target object to be inferred.

The difference identification unit 307 and the adjustment unit 308 will be described in more detail below with reference to FIG. 6. FIG. 6 is a diagram showing an example in which a difference occurs between contexts. In the example EX1 shown in FIG. 6, as in FIG. 3, it is assumed that a foreign object confirmation inspection is performed using a control sequence of rotation, rest, and rotation, and the controls included in each control sequence during learning and inference and their execution order are the same. However, the rest period during inference is longer than during learning. More specifically, in the example EX1, during both learning and inference, rotation starts at time t1 and ends at time t2 to enter a rest state, and the movement of the liquid in the container becomes steady at time t3. After that, during learning, the second rotation starts at time t4 and ends at time t5, whereas during inference, the second rotation starts at time t4', which is Δt earlier than time t4. Accordingly, the time at which the second rotation ends is also Δt earlier than time t5 at time t5'. Thus, in example EX1, there is a difference of Δt in the context from time t4' onwards between learning and inference.

In this case, the adjustment unit 308 performs an adjustment by adding Δt to the time indicated in each piece of time information corresponding to the period from time t4' to time t5' among the time information used to generate the feature. This makes it possible to absorb the difference in context between learning and inference. Note that, contrary to example EX1, if the still period during inference is made longer by Δt than during learning, the adjustment unit 308 can perform an adjustment by subtracting Δt from the time indicated in each piece of time information after the end of the still period.

In this way, if the control sequence that affects the context is different during learning and inference, the adjustment unit 308 can absorb the difference in context by adjusting the time information accordingly.

Furthermore, if the direction in which the container is moved is reversed between learning and inference, the direction in which the liquid flows will also be reversed between learning and inference, and in this case too, a difference in context will occur between learning and inference. In this way, when a difference in context related to position occurs between learning and inference, the adjustment unit 308 can adjust the position information used to generate features so as to absorb the difference.

For example, suppose that by reversing the direction in which the container is moved, the pattern of movement of the object becomes a left-right inversion of the pattern of movement of the object during learning. In this case, the adjustment unit 308 may absorb the difference between contexts by performing a left-right inversion process on the position information of the object. For example, when coordinate values are used as position information, the adjustment unit 308 may perform a conversion that inverts left-right about a specified axis for each coordinate value during the period in which the pattern of movement is inverted left-right. Also, for example, when the pattern of movement of the object and the pattern of movement of the object during learning are in a rotationally symmetric relationship, the adjustment unit 308 may absorb the difference between contexts by rotationally transforming the position information of the object. Note that the adjustment unit that adjusts the time information and the adjustment unit that adjusts the position information may each be separate blocks.

In addition, even if the control sequence that affects the context is the same during learning and inference, the time information corresponding to each frame image used during inference may differ from the time information corresponding to each frame image used for learning. For example, in example EX1 of Figure 6, the time of the frame image at the start of the first rotation among the frame images used for learning is t1. In this case, if the time of the frame image at the start of the first rotation among the frame images used for inference is t1' (t1' < t1), a difference of (t1 - t1') will occur between the contexts. In such a case, the adjustment unit 308 can perform an adjustment by adding the value of (t1 - t1') to the time indicated in each piece of time information used for inference. Furthermore, if the time of the frame image at the start of the first rotation among the frame images used in inference is t1" (t1" > t1), the adjustment unit 308 can perform an adjustment by subtracting the value of (t1" - t1) from the time indicated in each piece of time information used in inference. By making such an adjustment, the relationship between the time indicated in the time information and the control timing can be aligned with that during learning, thereby allowing the feature generation model 312 to output appropriate features.

Furthermore, factors that cause differences in context are not limited to control sequences. For example, differences between contexts can arise when there is a difference between the object used in learning and the subject of inference, or when there is a difference between the environment surrounding the object used in learning and the environment surrounding the subject of inference.

The difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the differences described above, i.e., the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. Therefore, according to the information processing device 3 of this exemplary embodiment, in addition to the effect achieved by the information processing device 2 of exemplary embodiment 1, the effect of being able to automatically identify context differences can be obtained. Furthermore, since the information processing device 3 is equipped with the adjustment unit 308, it is possible to cause the adjustment unit 308 to make adjustments to absorb the differences identified by the difference identification unit 307.

For example, example EX2 in Figure 6 shows a case where the viscosity of the liquid sealed in the container is different during learning and inference in a foreign object confirmation inspection that is performed in a sequence of rotation, rest, and rotation. That is, in example EX2, the environment surrounding the object used in learning is different from the environment surrounding the target object. Specifically, the liquid in the container used in inference has a higher viscosity than the liquid used in learning, and therefore, during inference, the time from when the container is brought to rest until the inside of the container becomes steady is shorter than during learning. In other words, the time t3' when the container becomes steady during inference is earlier than the time t3 when the container becomes steady during learning (t3 > t3').

In this case, the difference identification unit 307 identifies the time t3' at which the liquid stabilized, which is the difference between the contexts at the time of learning and the time of inference, based on the viscosity of the liquid in the container used for inference. If the relationship between the viscosity and the time required for the liquid to stabilize is identified and modeled in advance, the difference identification unit 307 can identify the time t3' using the model and the viscosity of the liquid in the container used for inference.

Then, the adjustment unit 308 absorbs the above-mentioned difference by adjusting the time information based on the result of the identification by the difference identification unit 307. Specifically, the adjustment unit 308 performs an adjustment by adding the value of (t3-t3') to the time indicated in each piece of time information from time t3' to time t4.

The adjustment unit 308 may adjust all times in the steady state to the same value. In this case, the adjustment unit 308 may replace the times indicated in each piece of time information from time t3' to t4 with, for example, time t3. In this way, the adjustment unit 308 may set the time information for a period in which the context is constant during inference to a constant value. Furthermore, the constant value may be selected from the time values for a period in which an object moved according to a context that was the same as or similar to the above context during learning (the period from time t3 to t4 in the above example).

As described above, the information processing device 3 includes an adjustment unit 308 that adjusts at least one of the time information and the position information used to generate features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred. Therefore, according to the information processing device 1 according to this exemplary embodiment, even if there is a difference between the context of the movement of the target object and the context of the movement of the object used in learning, it is possible to obtain the effect that appropriate features can be generated using the same feature generation model 312.

(Learning process flow)
7 is a flow diagram showing the flow of processing performed by the information processing device 3 during learning. Note that, when learning is performed, the teacher data 314 and the feature generation model 312 are stored in advance in the storage unit 31. The feature generation model 312 stored in the storage unit 31 may have parameters in an initial state, or may be a model in which learning has progressed to a certain degree.

In S31, the learning unit 306 acquires the teacher data 314 stored in the memory unit 31. As described above, the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with respect to the trajectory data of the object. In addition, here, the teacher data 314 also includes multiple frame images that are the basis of the trajectory data.

In S32, the feature generation unit 303 uses the teacher data 314 acquired in S31 to generate features according to the context of the object used in learning. More specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data included in the teacher data 314 acquired in S31 to the feature generation model 312.

In S33, the integration unit 304 integrates the trajectory data included in the teacher data 314 acquired in S31 with the feature amount generated in S32 to generate integrated data. Then, in S34, the inference unit 305 performs a predetermined inference using the integrated data generated in S33. Specifically, the inference unit 305 inputs the integrated data generated in S33 to the inference model 313 to obtain an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.

In S35, the similarity calculation unit 309 calculates the similarity between the frame images included in the teacher data 314 acquired in S31. The frame images may be clipped from around a location corresponding to the position information indicated in the trajectory data. The similarity calculation unit 309 may calculate the similarity for each combination of multiple frame images (corresponding to one trajectory data) included in the teacher data 314, or may calculate the similarity for some combinations. The process of S35 may be performed before S36, and may be performed before S32, for example, or in parallel with the processes of S32 to S34.

In S36, the learning unit 306 updates the feature generation model 312 so that the result of the inference in S34 approaches the predetermined correct answer data indicated in the teacher data 314. In this update, the learning unit 306 updates the feature generation model 312 so that the similarity between the frame images calculated in S35 is reflected in the similarity between the features generated by the feature generation model 312 for the frame images.

In S37, the learning unit 306 determines whether or not to end learning. The condition for ending learning may be determined in advance, and learning may end, for example, when the number of updates of the feature generation model 312 reaches a predetermined number. If the learning unit 306 determines NO in S37, it returns to the process of S31 and acquires new teacher data 314. On the other hand, if the learning unit 306 determines YES in S37, it stores the updated feature generation model 312 in the memory unit 31 and ends the process of FIG. 7.

In S31, instead of the teacher data 314, a video image showing a predetermined object (specifically, at least one of an air bubble and a foreign object) or a frame image extracted from the video image may be acquired. In this case, the object detection unit 301 detects the object from the acquired frame image, and the trajectory data generation unit 302 generates trajectory data 311 of the detected object. The teacher data 314 is generated by labeling this trajectory data 311 with the correct answer data. The processing after the teacher data 314 is generated is the same as the processing from S32 onwards described above.

As described above, the method for generating the feature generation model 312 according to this exemplary embodiment includes: for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into the feature generation model 312 for generating features according to the context, performing a predetermined inference regarding the object based on the calculated features (S34); and updating the feature generation model 312 so that the result of the inference approaches predetermined correct answer data (S36).

The above configuration makes it possible to generate a feature generation model 312 capable of generating features according to a context. This makes it possible to generate a feature generation model capable of generating features according to a context from time information and location information, and has the effect of making it possible to perform inference that takes into account the context while suppressing computational costs.

As described above, the method for generating the feature generation model 312 according to this exemplary embodiment includes calculating the similarity between a plurality of frame images (S35), and in updating the feature generation model 312, the feature generation model 312 is updated so that the similarity between a plurality of frame images is reflected in the similarity between the features generated by the feature generation model 312 for the frame images. Since similar frame images are considered to have similar contexts, the above configuration makes it possible to generate a feature generation model 312 that can generate more valid features that take into account the similarity between frame images.

(Processing flow during inference)
Fig. 8 is a flow diagram showing the flow of processing (inference method) performed by the information processing device 3 during inference. Fig. 8 shows processing after a plurality of frame images extracted from a moving image to be inferred are input to the information processing device 3. The moving image shows an object to be determined as being an air bubble or a foreign object. The information processing device 3 may also perform the processing of extracting frame images from the moving image.

In S41, the object detection unit 301 detects an object from each of the frame images. Then, in S42, the trajectory data generation unit 302 generates trajectory data 311 indicating the trajectory of the object movement based on the object detection result in S41. Note that the following describes the process when one object and one trajectory data 311 indicating the trajectory of the object movement are generated. When multiple trajectory data 311 are generated, the processes of S43 to S47 described below are performed for each trajectory data 311.

In S43, the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. For example, if the viscosity of the liquid sealed in the container is different during learning and during inference, the difference identification unit 307 may calculate the time at which the liquid in the container becomes steady based on the difference in the viscosity of the liquid, and calculate the difference between this time and the time at which the liquid in the container becomes steady during learning.

In S44, the adjustment unit 308 adjusts at least one of the time information and the location information used to generate the features so as to absorb the difference between the contexts identified in S43. For example, if a difference in the time at which the features become stable is calculated as described above in S43, the adjustment unit 308 adjusts the time information so as to absorb the time difference. Note that if there is no difference between the contexts, the processes of S43 and S44 are omitted. Furthermore, if at least one of the time information and the location information was normalized during learning, the adjustment unit 308 similarly normalizes the time information and the location information used to generate the features.

In S45, the feature generation unit 303 generates features according to the context. Specifically, for each of a plurality of frame images corresponding to one piece of trajectory data 311, the feature generation unit 303 extracts position information and time information of an object appearing in the frame image from the trajectory data 311. The feature generation unit 303 then inputs the extracted time information and position information into the feature generation model 312 to generate features. As a result, for each frame image, features according to the context of the object appearing in the frame image are generated.

In S46, the integration unit 304 integrates the trajectory data 311 generated in S42 with the feature quantities generated in S45 to generate integrated data. Then, in S47, the inference unit 305 performs a predetermined inference regarding the object based on the feature quantities generated in S45. Specifically, the inference unit 305 obtains an inference result by inputting the integrated data reflecting the feature quantities generated in S45 to the inference model 313, and the processing in FIG. 8 ends. Note that the inference unit 305 may output the inference result to the output unit 34, etc., or may store it in the memory unit 31, etc.

As described above, the inference method according to this exemplary embodiment includes generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features corresponding to the context of the object appearing in the frame images using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S45), and making a predetermined inference about the object based on the generated features (S47). This has the effect of making it possible to perform inference that takes into account the context while keeping computational costs down.

[Modifications]
The execution subject of each process described in the above exemplary embodiment is arbitrary and is not limited to the above example. In other words, an information processing system having the same functions as the information processing devices 1 to 3 can be constructed by a plurality of devices that can communicate with each other. For example, the process in the flow chart of FIG. 7 and the process in the flow chart of FIG. 8 may be executed by different information processing devices (or processors). Also, each process in the flow chart shown in FIG. 7 or FIG. 8 can be shared and executed by a plurality of information processing devices (or processors).

Furthermore, the content of the predetermined inference executed by the

inference units

11, 22, and 305 is not particularly limited as long as it is related to the object. For example, in addition to classification or identification as described in the exemplary embodiment 2, it may be prediction, conversion, etc.

Furthermore, the factors that give rise to a context are also arbitrary. For example, with

information processing device

2 or 3, it is possible to perform inference that takes into account the context for an object that moves in accordance with a context that arises due to various devices whose operations change at a predetermined cycle, or due to natural phenomena that change at a predetermined cycle, etc. Furthermore, with

information processing device

1 or 3, it is possible to generate a feature generation model that makes it possible to perform inference that takes into account the above-mentioned context.

For example, the movement of moving objects (vehicles, people, etc.) around a traffic light is affected by the periodic light emission control of the traffic light. In other words, it can be said that the moving objects move according to the context resulting from the light emission control of the traffic light.

For this reason, the

information processing device

1 or 3 performs a predetermined inference regarding the moving object based on the feature calculated by inputting time information and position information into the feature generation model for each of a plurality of frame images extracted from a video image capturing a moving object moving along the context, and updates the feature generation model so that the inference result approaches predetermined correct answer data. By repeating this process, it is possible to generate a feature generation model according to the context. Then, the

information processing device

2 or 3 performs a predetermined inference regarding the moving object based on the feature generated using the feature generation model thus generated, thereby obtaining a highly valid inference result that takes the context into account. The content of the inference is not particularly limited, and may be, for example, a position prediction of the moving object after a predetermined time, a behavior classification of the moving object, or detection of abnormal behavior of the moving object. It is preferable that these inferences also take into account interactions between vehicles and pedestrians, between vehicles, etc.

[Software implementation example]
Some or all of the functions of the information processing devices 1 to 3 may be realized by hardware such as an integrated circuit (IC chip), or may be realized by software.

In the latter case, information processing devices 1 to 3 are realized, for example, by a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function. An example of such a computer (hereinafter referred to as computer C) is shown in Figure 9. Computer C has at least one processor C1 and at least one memory C2. Memory C2 stores program P for operating computer C as any one of information processing devices 1 to 3. In computer C, processor C1 reads and executes program P from memory C2, thereby realizing the function of any one of information processing devices 1 to 3.

The processor C1 may be, for example, a CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit), microcontroller, or a combination of these. The memory C2 may be, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination of these.

Computer C may further include a RAM (Random Access Memory) for expanding program P during execution and for temporarily storing various data. Computer C may further include a communications interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

The program P can also be recorded on a non-transitory, tangible recording medium M that can be read by the computer C. Such a recording medium M can be, for example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit. The computer C can obtain the program P via such a recording medium M. The program P can also be transmitted via a transmission medium. Such a transmission medium can be, for example, a communications network or broadcast waves. The computer C can also obtain the program P via such a transmission medium.

[Additional Note 1]
The present invention is not limited to the above-described embodiment, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the above-described embodiment are also included in the technical scope of the present invention.

[Additional Note 2]
Some or all of the above-described embodiments can be described as follows. However, the present invention is not limited to the aspects described below.

(Appendix 1)
An information processing device comprising: a feature generation means for generating features corresponding to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image; and an inference means for making a specified inference regarding the object based on the features.

(Appendix 2)
The information processing device described in Appendix 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating a timing at which an object moving in accordance with a context identical or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and feature values corresponding to the context of the object at that timing.

(Appendix 3)
3. The information processing device according to claim 1, further comprising: a trajectory data generation means for generating trajectory data indicating a trajectory of movement of an object based on a detection result of the object from the plurality of frame images; and an integration means for integrating the trajectory data and a feature generated by the feature generation means to generate integrated data, wherein the feature generation means generates the feature using the position information and the time information extracted from the trajectory data, and the inference means performs the inference using the integrated data.

(Appendix 4)
3. The information processing device according to claim 2, further comprising an adjustment means for adjusting at least one of time information and position information used in generating the feature amount so as to absorb a difference between the context in the movement of the object and the context in the movement of the target object.

(Appendix 5)
An information processing device as described in Appendix 4, comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between an environment surrounding the object and an environment surrounding the target object.

(Appendix 6)
An inference method comprising: at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features of the object appearing in the frame images corresponding to the context, using position information indicating the detection position of the object in the frame images and time information indicating the timing when the frame images were captured; and making a predetermined inference regarding the object based on the features.

(Appendix 7)
An inference program that causes a computer to function as a feature generation means that generates features corresponding to the context of an object appearing in a plurality of frame images extracted from a video of an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference means that makes a specified inference about the object based on the features.

(Appendix 8)
A method for generating a feature generation model, comprising: at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving in accordance with a predetermined context, time information indicating the timing at which the frame image was captured and positional information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby making a predetermined inference about the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined correct answer data.

(Appendix 9)
9. The method for generating a feature generation model described in Appendix 8, further comprising: at least one processor calculating a similarity between the plurality of frame images; and updating the feature generation model such that the similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images, in updating the feature generation model.

[Additional Note 3]
Some or all of the above-described embodiments can also be expressed as follows: An information processing device including at least one processor, which executes, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a process of generating a feature amount of the object appearing in the frame image according to the context using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and a process of making a predetermined inference regarding the object based on the feature amount.

The information processing device may further include a memory, and the memory may store an inference program for causing the processor to execute the process of generating the feature amount and the process of performing the predetermined inference. The inference program may also be recorded on a computer-readable, non-transitory, tangible recording medium.

Reference Signs List 1 Information processing device 11 Inference unit 12 Learning unit 2 Information processing device 21 Feature amount generation unit 22 Inference unit 3 Information processing device 311 Trajectory data 312 Feature amount generation model 302 Trajectory data generation unit 303 Feature amount generation unit 304 Integration unit 305 Inference unit 306 Learning unit 307 Difference identification unit 308 Adjustment unit

Claims

a feature generating means for generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image;
and an inference means for performing a predetermined inference regarding the object based on the feature amount.
The information processing device according to claim 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating the timing at which an object moving along a context identical to or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and the feature of the object according to the context at that timing.
a trajectory data generating means for generating trajectory data indicating a trajectory of the movement of the object based on a detection result of the object from the plurality of frame images;
an integration unit that integrates the trajectory data and the feature generated by the feature generating unit to generate integrated data,
the feature generating means generates the feature using the position information and the time information extracted from the trajectory data;
The information processing apparatus according to claim 1 , wherein the inference means performs the inference using the integrated data.
The information processing device according to claim 2, further comprising an adjustment means for adjusting at least one of the time information and the position information used to generate the feature amount so as to absorb the difference between the context in the movement of the object and the context in the movement of the target object.
The information processing device according to claim 4, further comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between the environment surrounding the object and the environment surrounding the target object.
At least one processor
For each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, generating a feature quantity of the object appearing in the frame image according to the context, using position information indicating a detection position of the object in the frame image and time information indicating a timing when the frame image was captured;
and making a predetermined inference about the object based on the feature amount.
An inference program that causes a computer to function as: a feature generation means that generates, for each of a plurality of frame images extracted from a video of an object moving along a specified context, features corresponding to the context of the object appearing in the frame images, using time information indicating the timing at which the frame images were captured and position information indicating the detection position of the object in the frame images; and an inference means that makes a specified inference about the object based on the features.
At least one processor
For each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image are input to a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features;
updating the feature generation model so that the result of the inference approaches predetermined correct answer data.
at least one processor calculates a similarity between the plurality of frame images;
9. The method for generating a feature generation model according to claim 8, further comprising updating the feature generation model so that a similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images.