CN112149503A

CN112149503A - Target event detection method and device, electronic equipment and readable medium

Info

Publication number: CN112149503A
Application number: CN202010845538.8A
Authority: CN
Inventors: 黄华宾; 袁野
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-12-29

Abstract

The embodiment of the application discloses a target event detection method, a target event detection device, electronic equipment and a readable medium. An embodiment of the method comprises: acquiring an image to be detected; inputting an image to be detected into a pre-trained target event detection model to obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on single-stage full convolution target detection network training; determining whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event. The implementation mode realizes the automatic detection of the target event, reduces the labor cost for searching the target event to be processed, and improves the searching efficiency of the target event to be processed.

Description

Target event detection method and device, electronic equipment and readable medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a target event detection method, a target event detection device, electronic equipment and a readable medium.

Background

In recent years, with the rapid development of economy in China, the urbanization speed is increasingly enhanced, and some conditions such as airing along streets, illegal store-outside operation of stores and the like are seen everywhere in cities, so that great difficulties are caused in aspects of city appearance, resident trip and commercial standardized management energy. Therefore, the target event to be processed needs to be actively searched and stopped, so as to ensure the convenience of the resident in going out.

In the prior art, urban management law enforcement officers usually search for the events such as airing along the street and operation outside the store manually in the modes of on-site patrol, monitoring video viewing and the like, and the mode needs to consume a large amount of manpower and has lower detection efficiency. In addition, conventional detection algorithms are generally only capable of detecting objects of a specific form, such as fruits, vegetables, etc., but an outdoor business event, etc., generally comprises a complex composition and does not have a specific form. Therefore, for the detection of target events such as business events, the traditional detection algorithm usually cannot obtain accurate detection results.

Disclosure of Invention

The embodiment of the application provides a target event detection method, a target event detection device, electronic equipment and a readable medium, so that the labor cost for searching a target event to be processed is reduced, and the searching efficiency of the target event to be processed is improved.

In a first aspect, an embodiment of the present application provides a target event detection method, where the method includes: acquiring an image to be detected; inputting an image to be detected into a pre-trained target event detection model to obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on single-stage full convolution target detection network training; determining whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event.

In a second aspect, an embodiment of the present application provides a target event detection apparatus, including: an acquisition unit configured to acquire an image to be measured; the input unit is configured to input the image to be detected into a pre-trained target event detection model, and obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on single-stage full convolution target detection network training; a determination unit configured to determine whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to carry out the method as described in the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method as described in the first aspect above.

According to the target event detection method, the target event detection device, the electronic equipment and the computer readable medium, the image to be detected is input into the pre-trained target event detection model, the position of the target event area in the image to be detected and the category of the target event are obtained, and then whether the target event is the target event to be processed is determined based on at least one of the position of the target event area and the category of the target event. On the other hand, the target event detection model is obtained based on single-stage full convolution target detection network detection, and the full convolution target detection network is a network without an anchor frame (anchor free), so that positive and negative samples do not need to be judged according to the IOU (Intersection over Unit) of the anchor frame (anchor box) and the marking frame, a smaller sub-region in the target event region cannot be mistaken as a negative sample in the training process, the conditions of conflict and imbalance of the positive and negative samples in the training process are avoided, the characteristics of the target event can be comprehensively learned, and the accuracy of target event detection is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of one embodiment of a target event detection method of the present application;

FIG. 2 is a schematic diagram of the comparison of the irregular polygon labeling of the present application with the conventional rectangular box labeling;

FIG. 3 is a schematic diagram of the architecture of a single-stage full convolution target detection network of the present application;

FIG. 4a is a schematic diagram of a positive and negative sample decision rule of a conventional target detection network;

FIG. 4b is a schematic diagram of an improved full convolution target detection network positive and negative sample decision rule of the present application;

FIG. 5 is a schematic block diagram of one embodiment of a target event detection apparatus according to the present application;

fig. 6 is a schematic structural diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to FIG. 1, a flow 100 of one embodiment of a target event detection method according to the present application is shown. The target event detection method can be applied to electronic devices such as mobile phones, tablet computers, laptop portable computers, vehicle-mounted computers, desktop computers, wearable devices and other various electronic devices.

The target event detection method comprises the following steps:

step 101, obtaining an image to be detected.

In this embodiment, the execution subject of the target event detection method (the electronic device described above) may acquire an image to be detected. The image to be detected may be an image to be subject to target event detection, such as a street image captured by an image capture device (e.g., a camera), an image outside a store, and so forth.

In this embodiment, the target event may include, but is not limited to, an event of performing an operation, an airing event along a street, and the like. Taking the event of performing the business as an example, the event can include, but is not limited to, placing a stall, placing a fruit stall, placing a vegetable stall, selling small goods, sticking a film, and the like. Some of the target events are illegal events, which need to be further processed (which may be referred to as target events to be processed). Taking the target event as an event for implementing the business, the target event to be processed may refer to an event for implementing the business in an area where the business is not allowed to be implemented. For example, but not limited to, a stall, a vegetable stall, a fruit stall, a vending small item, a film, etc. may be placed outside a store, on a roadside, a sidewalk, etc.

In some scenes, related law enforcement officers can install an image acquisition device near an area needing illegal behavior monitoring (such as illegal operation behaviors, street-following drying behaviors and the like) in advance, so that the image acquisition device can acquire images of the area. The execution main body can be in communication connection with the image acquisition device, so that the to-be-detected image acquired by the image acquisition device can be acquired in real time.

Step 102, inputting the image to be detected into a pre-trained target event detection model, and obtaining the position of a target event area and the category of a target event in the image to be detected.

In this embodiment, the execution subject may run a pre-trained target event detection model. The target event detection model may identify the location of a target event region in an image and the category of the target event. The execution body may input the to-be-detected image obtained in step 101 to the target event detection model, so as to obtain a position of a target event area in the to-be-detected image and a category of the target event. The location of the target event area may be a rectangular box surrounding the target event area.

The target event area in the image under test may refer to an area in which a feature of the target event is present. For example, when the target event is a business event, the characteristics of the target event may include, but are not limited to, selling fruits, selling vegetables, selling small commodities, and the like, and accordingly, the target event area may be a rectangular area where a fruit stall is located, a rectangular area where a vegetable stall is located, a rectangular area where a small commodity stall is located, and the like.

The categories of the target events may be pre-divided as desired. For example, when the target event is a business event, the target event may be classified according to the category of the business item, and may be classified into a fruit share target event, a vegetable share target event, a small commodity target event, and the like, without being limited to such a classification manner.

In this embodiment, the target event Detection model may be obtained by training based on a single-Stage full-convolution target Detection network (FCOS), and specifically may be obtained by training the full-convolution target Detection network by using a machine learning method (such as a supervised learning method). The full convolution target detection network is a network without an anchor frame (anchor free), and can directly perform regression operation on feature points on a feature map (feature map), map the feature points in the feature map into an original image, and take the distances of the left boundary, the upper boundary, the right boundary and the lower boundary of a frame (Bounding Box, BB) of a mapping position distance labeling frame (group Truth, GT) as a regression target. If the position mapped to the original image is within the frame range, the feature point can be used as a positive sample, otherwise, the feature point can be used as a negative sample. In this embodiment, the label box is a circumscribed rectangle box that can be the outline of the target event.

The existing conventional target detection model is usually an anchor based target detection model. These models typically require a dense anchor box to be placed over the input image. In the training process of the conventional target detection model, the IOU of all anchor frames and marking frames needs to be calculated, so that the calculation amount is large. In addition, in the training process of the conventional object detection model, when the IOU of a certain anchor box is smaller than a certain threshold (e.g., 0.3), the anchor box is marked as a negative sample, and when the IOU of a certain anchor box is larger than a certain threshold (e.g., 0.7), the anchor box is marked as a positive sample. However, such a positive and negative sample division method is not suitable for target event detection. The composition of the target event area is complex and is not a fixed-form object in general, such as a local sub-area in a fruit stall area in an image (i.e. a local area of a fruit stall) also conforms to the characteristics of an event of selling fruits, but the local sub-area and the IOU of a labeling frame are generally small, so the local sub-area is generally mistaken for a negative sample, and the sample class in the learning process conflicts with the real sample class. Meanwhile, since the local sub-region is usually mistaken for a negative sample, most of the anchor boxes are marked as negative samples, and a few of the anchor boxes are marked as positive samples, resulting in an imbalance of positive and negative samples. Therefore, the model cannot accurately and comprehensively learn the characteristics of the target event region, and the trained model cannot accurately detect the target event.

In the application, the target event detection model is trained by using the full convolution target detection network, and the full convolution target detection network is a network without an anchor frame and does not depend on the anchor frame, so that complex calculation related to the anchor frame is avoided, such as IOU (Intersection over Unit) calculation of the anchor frame (anchor box) and a marking frame, and the training memory is obviously reduced. Meanwhile, the full convolution target detection network does not depend on the anchor frame, so that the setting of hyper-parameters related to the anchor frame, such as threshold values of IOUs for judging positive and negative samples, is avoided, and the influence of the hyper-parameters on the detection performance is reduced. Specifically, positive and negative samples do not need to be judged according to the IOU of the anchor frame and the labeling frame, if a corresponding region of a certain feature point in the original image falls into the labeling frame, the corresponding region is marked as a positive sample, and the category of the corresponding region is marked as the category of the falling labeling frame. Therefore, the small sub-area inside the target event area is not mistaken for the negative sample in the training process, the conflict and the unbalance of the positive and negative samples in the training process are avoided, the characteristics of the target event can be comprehensively learned, and the accuracy of target event detection is improved.

In some optional implementations of this embodiment, the target event detection model may be pre-trained through sub-steps S11 to S14 as follows:

in sub-step S11, a sample set is obtained.

Here, the sample set may include a large number of sample images. The sample image may carry an annotation indicating the outline of the target event and an annotation indicating the category of the target event.

Alternatively, the annotation used to indicate the contour of the target event may be an irregular polygon annotation. By way of example, FIG. 2 shows a schematic diagram of an irregular polygon label compared to a conventional rectangular box label. As shown in fig. 2, when the conventional rectangular frame is labeled, due to the irregular shape of the object, frames with many non-target areas in the conventional labeled rectangular frame are erroneously determined as positive samples (i.e., erroneous positive samples), such as the frames located inside the conventional rectangular frame label and outside the irregular polygonal rectangular frame in fig. 2. And irregular polygon labeling can circumvent this problem.

And a substep S12, constructing a single-stage full convolution target detection network.

The structure of the existing full convolution target detection network can be improved, and a single-stage full convolution target detection network is obtained. The single-stage full convolution target detection network is an improved full convolution target detection network.

As an example, an existing fully-convolutional target detection network may include a backbone network (backbone), a location regression network, and a classification network including a center-through (center-less) branch structure. Here, the central point prediction branch structure of the full convolution target detection network may be deleted to obtain a single-stage full convolution target detection network. The center-less structure can enable the model to ignore the point of the boundary, so that the non-central area of the target event area generates a relatively low response, and the overall consistency and the local consistency of the target event area are influenced, therefore, the center-less structure is removed in the event detection, and the performance of the model can be improved.

Optionally, the backbone network may include a deep convolutional neural network. The position regression network is a convolutional neural network comprising two layers of convolution kernels of 1 × 1 size. The classification network is a convolutional neural network comprising two layers of convolution kernels of 1 x 1 size.

As an example, a schematic structural diagram of a single-stage full convolution target detection network can be seen in fig. 3, which includes a backbone network (backbone), a location regression network, and a classification network. And after the image to be detected is input to the backbone network, a target event characteristic diagram can be obtained. And inputting the target event characteristic graph into a position regression network and a classification network respectively to obtain the position of a target event area in the image to be detected and the category of the target event respectively.

In practice, the target event feature map may be a feature map with multiple scales, such as 5-scale feature maps, which are respectively denoted as P3, P4, P5, P6, and P7. Wherein, P3, P4, and P5 may be obtained by performing convolution with a size of 1 × 1 on three feature maps (which may be respectively referred to as C3, C4, and C5) generated by a deep convolutional neural network, P6 may be obtained by performing convolution operation with a step size of 2 on P5 (which may be referred to as downsampling), and P7 may be obtained by performing convolution operation with a step size of 2 on P6. The number and the scale of the feature maps are not limited in this embodiment.

It should be noted that, when the existing full convolution target detection network is improved, the partition rule of the positive and negative samples in the training process and the feature map mapping rule in the positive and negative sample partition process can be modified.

Specifically, the prior art is generally applicable only to object detection, which uses a specific object (e.g., a vehicle) in a screen as a positive sample, and an internal region (e.g., a wheel) in the object is not used as a positive sample, so that the specific object is a positive sample and the specific object is a negative sample. The use of existing networks directly for event detection results in the event being a positive sample for the entire event (e.g., the entire booth) and a negative sample for the local event (e.g., a basket of fruit in the booth). However, for event detection, not only the whole event (e.g. the whole booth) but also a part of the event (e.g. a basket of fruit in the booth) needs to be used as a positive sample, and therefore, the dividing rule for the positive and negative samples needs to be reset to satisfy the above requirement.

In the existing fully-convolutional target detection network, large-sized objects are generally regressed through a small-scale feature map (i.e., a deep feature map, such as P7), and small-sized objects are regressed through a large-scale feature map (i.e., a shallow feature map, such as P3), so that for event detection, only positive samples can be detected through the deep feature map. Thus, it is difficult to determine a local part of the entire event (e.g., a basket of fruit in a booth) as a positive sample. In order to solve this problem, in this embodiment, it may be modified that events are assigned to feature maps of different scales (e.g., P3, P4, P5, P6, and P7) and regressed, so that positive sample detection can be performed using feature maps of different scales, and not only the entire event but also a local region of the event can be determined as a positive sample. Meanwhile, for a certain event, the corresponding point in each characteristic diagram can be regarded as a positive sample, so that the number of the positive samples can be increased, the imbalance of the positive and negative samples is further avoided, and the accuracy of the event detection result is improved.

Compared with the traditional target detection network based on the anchor frame, the single-stage full-convolution target detection network obtained through the improvement can not only generate more positive samples, but also effectively prevent the problem of the whole and partial positive and negative conflicts of events. As an example, fig. 4a is a schematic diagram of a conventional positive and negative sample decision rule of the target detection network, and fig. 4b is a schematic diagram of a modified positive and negative sample decision rule of the full convolution target detection network. As shown in fig. 4a, when the conventional target detection network determines positive and negative samples, since the IOU of the anchor frame of the target event local area (i.e. the local area of the fruit stall) and the IOU of the target event area as a whole (i.e. the whole area of the fruit stall) are small, the target event local area (i.e. the local area of the fruit stall) may be erroneously determined as a negative sample, which may conflict with the definition of the target event. As shown in fig. 4b, when the improved full-convolution target detection network performs positive and negative sample determination, the target event local area (i.e. the local area of the fruit stall) can still be determined as a positive sample, thereby avoiding the above-mentioned conflict.

In substep S13, the following training steps are iteratively performed: taking the sample image as the input of a single-stage full convolution target detection network, and determining a loss value based on the label in the sample image; and adjusting parameters of the single-stage full convolution target detection network based on the loss value.

In the training process, the sample images can be input into the single-stage full convolution target detection network one by one, and the detection result output by the single-stage full convolution target detection network is obtained. The detection result may include a position detection result of the target event area and a category detection result of the target event. Then, a loss value may be determined based on the detection result and the annotation of the input sample image. The loss values can be used to characterize the difference between the test results and the actual results. The larger the loss value, the larger the difference. The loss value includes a two-part loss, the first part loss is determined based on the position detection result of the target event region and the label indicating the contour of the target event, and the other part loss is determined based on the category detection result of the target event and the label indicating the category of the target event. The two part losses can be calculated by adopting a conventional loss function. The sum of the losses of the two parts is the loss value. The loss value may then be used to update parameters of a single-stage full convolution target detection network. Therefore, each time a sample image is input, parameters of the single-stage full-convolution target detection network can be updated once based on the loss value corresponding to the sample image until training is completed.

Specifically, the loss value may be determined by:

in the first step, the sample image is input to a single-stage full-convolution target detection network to obtain feature maps of multiple scales (such as the above-mentioned P3, P4, P5, P6 and P7).

And secondly, mapping each feature point in each feature map to the sample image to obtain a corresponding point of each feature point in the sample image, and determining the corresponding point in the target event contour as a positive sample.

Since the feature map is a result of down-sampling the sample image, each feature point is mapped to the sample image and corresponds to the reception field of one rectangular region. And mapping each characteristic point to a corresponding point obtained in the sample image, namely the central point of the receptive field of the rectangular region.

Because each feature point in the feature map of each scale is mapped into the sample image for positive and negative sample division, compared with the method for positive and negative sample division based on the feature point in the feature map of a specific scale in the prior art, the method can increase the number of positive samples and avoid imbalance of the positive and negative samples.

And thirdly, acquiring the prediction type of each corresponding point, taking a positive sample of which the prediction type is any type indicated by the label as a target positive sample, and acquiring the prediction distance from each target positive sample to each frame of the target event prediction frame.

Here, after the sample image is input to the single-stage full-convolution target detection network, each feature point in the feature map of each scale can obtain one prediction category. For each feature point, the prediction category corresponding to the feature point is the prediction category of the corresponding point mapped to the sample image by the feature point. If the corresponding point is a positive sample, the target type (i.e. real type) should be the type labeled by the target event contour region in the sample image. In the initial stage of the training process, the prediction class and the target class usually have differences, and the differences can be gradually reduced in the training process.

Here, a positive sample whose prediction category is a category indicated by any label may be taken as a target positive sample. For example, the target event is a business event, and three types of business events A, B, C are labeled in the sample image. If a feature point is mapped to any one of the points in the sample image that are positive samples and the prediction type is A, B, C, the positive sample can be used as the target positive sample. If none of the prediction categories for the positive sample are A, B, C, then the positive sample is not the target positive sample.

After each target positive sample is determined, the predicted distance from the target positive sample to each frame (including the upper frame, the lower frame, the left frame, and the right frame) of the target event prediction box can be obtained. In the initial stage of training, the target event prediction box and the actual labeling box (e.g., the circumscribed rectangle of the target event outline, which may be labeled when labeling the target event outline or may be automatically generated according to the target event outline) generally have a difference, and the difference may be gradually reduced in the training process.

And fourthly, determining a loss value based on the determined prediction type, the prediction distance and the label in the sample image.

Specifically, a circumscribed rectangle of the target event outline may be first used as a regression labeling box, and a target distance between each target positive sample and each frame (including an upper frame, a lower frame, a left frame, and a right frame) of the regression labeling box may be determined. The target distances are regression targets.

Then, a first loss value is determined based on the predicted distance and the target distance corresponding to each target positive sample. Here, the predicted distance and the target distance may be input to a predetermined loss function to obtain a first loss value.

Thereafter, a second loss value may be determined based on the prediction category for each corresponding point and the label for the category indicative of the target event in the sample image. Here, the prediction class and the label of the class indicating the target event can both be represented by a vector, so that both can be directly input to another preset loss function to obtain a second loss value.

Finally, a final loss value may be determined based on the first loss value and the second loss value. For example, the first loss value may be added to or weighted with the second loss value to obtain a final loss value.

It can be understood that if a certain feature point is not a target positive sample, that is, the corresponding point mapped to the sample image by the feature point is not a positive sample, or the prediction type of the feature point is not A, B, C, a first loss value cannot be determined based on the predicted distance corresponding to the feature point and the target distance, and the loss value associated with the feature point in the final loss values only includes a second loss value corresponding to the feature point.

And a substep S14, determining the single-stage full convolution target detection network after training as a target event detection model.

In practice, whether training is complete may be determined in a number of ways. As an example, training may be determined to be complete when the accuracy of the detection result output by the single-stage full convolution target detection network reaches a preset value (e.g., 99%). As yet another example, training may be determined to be complete if the number of times of training of the single-stage full convolution target detection network is equal to a preset number of times. Here, when it is determined that the training is completed, the trained single-stage full convolution target detection network may be determined as the target event detection model.

And 103, determining whether the target event is a target event to be processed or not based on at least one of the position of the target event area and the category of the target event.

In this embodiment, the executing entity may determine whether the target event is a target event to be processed based on the location of the target event area or the category of the target event; whether the target event is a target event to be processed or not can be determined based on the position of the target event area or the category of the target event. If the event is to be processed, further processing can be performed on the person involved in the event, such as performing security punishment and the like. If the event is not a pending event, the event can be regarded as a legal and compliant event, and no further processing is carried out.

In one scenario, multiple geographic areas (such as the inside area and the outside pedestrian area of a store) may be involved in the image to be tested, and all target events are generally prohibited in some areas (such as the pedestrian area outside the store), so that the target event can be considered as the target event to be processed as long as the target event occurs in the area. Therefore, whether the target event is the target event to be processed or not can be determined based on the position of the target event area in the image to be processed. Specifically, a target region in the image to be measured may be determined first. For example, a certain store and a pedestrian area outside the store may be imaged in the image to be measured, and the pedestrian area may be set as the target area. And then, detecting the position relation between the target event area and the target area, such as detecting whether the target event area is located in the target area. And determining that the target event is a target event to be processed in response to the target event area being located in the target area. Otherwise, the target event area may be determined not to be the pending target event.

In another scenario, only one region is involved in the image under test and allows for certain types of target events. For example, it allows the vending of vegetables and fruits, but not small goods. At this time, it may be determined whether the target event is a target event to be processed based on the category of the detected target event. For example, a to-be-processed target event category set corresponding to the to-be-processed image acquisition area may be first obtained, and if the to-be-processed target event category set includes the category of the target event determined in step 102, the target event may be determined as the to-be-processed target event. On the contrary, if the target event category set to be processed does not include the category of the target event determined in step 102, the target event may be determined as a legal target event. In addition, a legal target event category set may also be obtained, and if the legal target event category set includes the category of the target event determined in step 102, the target event may be determined as a legal target event. Otherwise, if the legal target event category set does not include the category of the target event determined in step 102, the target event may be determined as a target event to be processed.

In another scenario, multiple geographic regions (e.g., pedestrian areas inside and outside of a store) may be involved in the image under test, with some regions (e.g., pedestrian areas outside of a store) allowing certain types of targeted events, such as allowing the sale of small goods, but not allowing the placement of a barbeque booth. At this time, a target region (such as a pedestrian region outside the store) in the image to be measured may be first determined; then detecting the position relation between the target event area and the target area; and determining whether the target event is a target event to be processed or not based on the category of the target event in response to the target event area being located in the target area.

Optionally, the determining, based on the category of the target event, whether the target event is a target event to be processed may include: acquiring a target event category set to be processed corresponding to a target area; and determining the target event as the target event to be processed in response to the category of the target event to be processed included in the category set of the target event to be processed. Alternatively, it may include: acquiring a legal target event category set corresponding to a target area; and determining the target event as a target event to be processed in response to the fact that the legal target event category set does not include the category of the target event.

According to the method provided by the embodiment of the application, the image to be detected is input into the pre-trained target event detection model, the position of the target event area in the image to be detected and the category of the target event are obtained, and then whether the target event is the target event to be processed is determined based on at least one of the position of the target event area and the category of the target event. On the other hand, the target event detection model is obtained based on the detection of the full convolution target detection network, the full convolution target detection network is a network without an anchor frame, and positive and negative samples do not need to be judged according to the IOU of the anchor frame and a marking frame, so that a small sub-region in the target event region is not mistaken as a negative sample in the training process, the conditions of conflict and unbalance of the positive and negative samples in the training process are avoided, the characteristics of the target event can be comprehensively learned, and the accuracy of target event detection is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a target event detection apparatus, which corresponds to the method embodiment shown in fig. 1, and which can be applied in various electronic devices.

As shown in fig. 5, the target event detecting apparatus 500 of the present embodiment includes: an acquisition unit 501 configured to acquire an image to be measured; an input unit 502 configured to input the image to be detected into a pre-trained target event detection model, to obtain a position of a target event area in the image to be detected and a category of a target event, wherein the target event detection model is obtained based on a single-stage full convolution target detection network training; a determining unit 503 configured to determine whether the target event is a target event to be processed based on at least one of a position of the target event area and a category of the target event.

In some optional implementations of this embodiment, the target event detection model is obtained by pre-training through the following steps: obtaining a sample set, wherein the sample set comprises a sample image, and the sample image is provided with an annotation used for indicating the outline of a target event and an annotation used for indicating the category of the target event; constructing a single-stage full convolution target detection network; the following training steps are performed iteratively: determining a loss value based on the label in the sample image by taking the sample image as the input of the single-stage full convolution target detection network; adjusting parameters of the single-stage full convolution target detection network based on the loss value; and determining the single-stage full convolution target detection network after training as a target event detection model.

In some optional implementations of the present embodiment, the label for indicating the contour of the target event is an irregular polygon label.

In some optional implementations of this embodiment, determining the loss value based on the label in the sample image with the sample image as an input of the single-stage full-convolution target detection network includes: taking the sample image as the input of a single-stage full convolution target detection network to obtain characteristic graphs of multiple scales; mapping each feature point in each feature map to a sample image to obtain a corresponding point of each feature point in the sample image, and determining the corresponding point in the target event contour as a positive sample; acquiring the prediction type of each corresponding point, taking a positive sample of which the prediction type is any type indicated by the label as a target positive sample, and acquiring the prediction distance from each target positive sample to each frame of the target event prediction frame; determining a loss value based on the determined prediction category, the prediction distance, and the annotation in the sample image.

In some optional implementations of the embodiment, determining the loss value based on the determined prediction category, the prediction distance, and the label in the sample image includes: taking an external rectangular frame of the target event outline as a regression marking frame, and determining the target distance between each target positive sample and each frame of the regression marking frame; determining a first loss value based on the prediction distance and the target distance corresponding to each target positive sample; determining a second loss value based on the prediction category of each corresponding point and an annotation of a category in the sample image for indicating the target event; a final loss value is determined based on the first loss value and the second loss value.

In some optional implementations of this embodiment, the single-stage full convolution target network includes a backbone network, a location regression network, and a classification network, and the classification network does not include a midpoint prediction branch structure.

In some optional implementations of this embodiment, the backbone network is a deep convolutional neural network, the position regression network is a convolutional neural network including two layers of convolution kernels with a size of 1 × 1, and the classification network is a convolutional neural network including two layers of convolution kernels with a size of 1 × 1.

In some optional implementations of this embodiment, the determining unit 503 is further configured to: determining a target area in the image to be detected; detecting the position relation between the target event area and the target area; and determining whether the target event is a target event to be processed based on the category of the target event in response to the target event area being located in the target area.

In some optional implementations of this embodiment, the determining unit 503 is further configured to: acquiring a target event category set to be processed corresponding to the target area; and determining the target event as the target event to be processed in response to the category of the target event included in the target event category set to be processed.

According to the device provided by the embodiment of the application, the image to be detected is input into the pre-trained target event detection model, the position of the target event area in the image to be detected and the category of the target event are obtained, and then whether the target event is the target event to be processed is determined based on at least one of the position of the target event area and the category of the target event. On the other hand, the target event detection model is obtained based on the detection of the full convolution target detection network, the full convolution target detection network is a network without an anchor frame, and positive and negative samples do not need to be judged according to the IOU of the anchor frame and a marking frame, so that a small sub-region in the target event region is not mistaken as a negative sample in the training process, the conditions of conflict and unbalance of the positive and negative samples in the training process are avoided, the characteristics of the target event can be comprehensively learned, and the accuracy of target event detection is improved.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The units described may also be provided in a processor, where the names of the units do not in some cases constitute a limitation of the units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring an image to be detected; inputting an image to be detected into a pre-trained target event detection model to obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on full convolution target detection network training; determining whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for target event detection, the method comprising:

acquiring an image to be detected;

inputting the image to be detected into a pre-trained target event detection model to obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on single-stage full convolution target detection network training;

determining whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event.

2. The method of claim 1, wherein the target event detection model is pre-trained by:

obtaining a sample set, wherein the sample set comprises a sample image with an annotation for indicating the outline of a target event and an annotation for indicating the category of the target event;

constructing a single-stage full convolution target detection network;

the following training steps are performed iteratively: taking the sample image as the input of the single-stage full convolution target detection network, and determining a loss value based on the label in the sample image; adjusting parameters of the single-stage full convolution target detection network based on the loss value;

and determining the single-stage full convolution target detection network after training as a target event detection model.

3. The method of claim 2, wherein the label indicating the contour of the target event is an irregular polygon label.

4. The method of claim 2 or 3, wherein determining a loss value based on an annotation in the sample image using the sample image as an input to the single-stage full-convolution target detection network comprises:

taking the sample image as the input of the single-stage full convolution target detection network to obtain feature maps of multiple scales;

mapping each feature point in each feature map to the sample image to obtain a corresponding point of each feature point in the sample image, and determining the corresponding point in the target event contour as a positive sample;

acquiring the prediction type of each corresponding point, taking a positive sample of which the prediction type is any type indicated by the label as a target positive sample, and acquiring the prediction distance from each target positive sample to each frame of the target event prediction frame;

determining a loss value based on the determined prediction category, the prediction distance, and the label in the sample image.

5. The method of claim 4, wherein determining a loss value based on the determined prediction category, prediction distance, and annotation in the sample image comprises:

taking an external rectangular frame of the target event outline as a regression marking frame, and determining the target distance between each target positive sample and each frame of the regression marking frame;

determining a first loss value based on the prediction distance and the target distance corresponding to each target positive sample;

determining a second loss value based on the prediction category of each corresponding point and an annotation of a category in the sample image for indicating a target event;

determining a final loss value based on the first loss value and the second loss value.

6. The method of any of claims 1-5, wherein the single-stage full convolution target network comprises a backbone network, a location regression network, and a classification network, wherein the classification network does not include a centerpoint predicted branch structure.

7. The method of claim 6, wherein the backbone network comprises a deep convolutional neural network, wherein the position regression network is a convolutional neural network comprising two layers of convolution kernels of size 1 x 1, and wherein the classification network is a convolutional neural network comprising two layers of convolution kernels of size 1 x 1.

8. The method according to any one of claims 1-7, wherein the determining whether the target event is a pending target event based on at least one of a location of the target event area and a category of the target event comprises:

determining a target area in the image to be detected;

detecting the position relation between the target event area and the target area;

and determining whether the target event is a target event to be processed based on the category of the target event in response to the target event area being located in the target area.

9. The method of claim 8, wherein the determining whether the target event is a pending target event based on the category of the target event comprises:

acquiring a target event category set to be processed corresponding to the target area;

determining the target event as a target event to be processed in response to the category of the target event included in the target event to be processed category set.

10. An apparatus for target event detection, the apparatus comprising:

an acquisition unit configured to acquire an image to be measured;

the input unit is configured to input the image to be detected into a pre-trained target event detection model, and obtain the position of a target event area and the category of a target event in the image to be detected, wherein the target event detection model is obtained based on single-stage full convolution target detection network training;

a determination unit configured to determine whether the target event is a target event to be processed based on at least one of a location of the target event area and a category of the target event.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.