WO2022047342A1 - System and method for using deep neural networks for adding value to video streams - Google Patents
System and method for using deep neural networks for adding value to video streams Download PDFInfo
- Publication number
- WO2022047342A1 WO2022047342A1 PCT/US2021/048300 US2021048300W WO2022047342A1 WO 2022047342 A1 WO2022047342 A1 WO 2022047342A1 US 2021048300 W US2021048300 W US 2021048300W WO 2022047342 A1 WO2022047342 A1 WO 2022047342A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video stream
- information
- video
- neural network
- server
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/59—Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
- G06V20/597—Recognising the driver's state or behaviour, e.g. attention or drowsiness
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the field of the invention relates generally to video image analysis and using deep neural networks for automatically adding value added annotations to the video streams.
- Video feeds for monitoring environments are commonly utilized in personal and commercial settings in a large variety of applications. For example, video feeds are used in monitoring warehouses to make sure that the premises are secured.
- Existing security systems for homes and commercial properties feature have been disclosed which provide a video feed from multiple video cameras typically connected to a manual monitoring station to facilitate observation by security personnel.
- Some home or commercial settings utilize cameras for monitoring the surroundings for the purposes of recording suspicious activities.
- What is needed is a system that can continually monitor the conditions in a video stream, for example being generated from the monitoring of commercial restaurant kitchen or the monitoring of a delivery vehicle’s occupants, add value to the video stream by adding annotations to the video stream that are meant to serve as alert if certain protocols for observing basic hygiene are violated in the preparation or its delivery.
- What is also needed is a system for rating establishments that conform to the health and safety guidelines in food preparation and delivery and use the rating as a mechanism for promoting these establishments and use market competitive forces to bring others in line, so they are also motivated to their compliance score. Additionally, what is needed is an ability for customers to have a system detect and report any noncompliance in real time and thereby empower consumers to force compliance by using the captured value-added video streams from the kitchen and delivery vehicles and reporting non-compliance to the establishment personnel.
- client applications that are further configured to make clips of the value added video streams and add comments about non-compliance and forward these clips as emails or upload to social media sites as a method enhancing transparency and encouraging compliance. And what is needed is the ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions, so customers monitor trends.
- What is needed is a system that uses machine learning and artificial intelligence models for detecting at-risk behavior and annotates the video feed to alert the consumers and stakeholders.
- the disclosed application for utilizing a deep learning system for generating value added video streams has several practical applications.
- the system and method for performing the steps of capturing video feeds and analyzing each frame with a deep neural network and annotating the result of the analysis back onto the frame and the stream is disclosed.
- One specific application of the invention uses video streams from a commercial kitchen establishment where the video stream is generated from collecting optical information from cameras monitoring the cooking area.
- field cameras with the delivery are used by the establishment to monitor transporting of prepared orders of food to the consumers.
- any establishment may also include video feeds from the food preparation facilities located remotely such as those used for serving specific needs like a bakery, or confectionary wing of the establishment.
- the client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites.
- Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
- Embodiments utilize a multitude of deep neural network models with each neural network dedicated to directing a specific or closely related conditions. Furthermore, the deep neural network models begin with a specific set of parameters which generally perform well by detecting a broad set of conditions, such as objects in a video frame. These deep neural network models can be fine-tuned by training on specialized set of examples encountered in a specific setting. Embodiments of invention begin with general model and tune these using additional examples from the situations or scenarios being monitored.
- the system and method are used for performing the steps of capturing video feeds and adding annotations indicating noncompliance or other aspects of the video stream and making the annotated video streams available as “value added video streams” to subscribers over the Internet.
- the motivation of the system and processes disclosed is to create transparency from the monitored activities, like having transparency from kitchen to the consumer.
- video feeds from an end point such as a restaurant are collected and processed by an end point monitor or aggregator and transmitted to a server over wired or wireless network.
- the end point monitor or aggregator performs the function of combing several feeds into a single stream and using compression, error correction, and encryption to communicate the video stream securely and efficiently to the server for processing.
- the server is a computer system using deep neural network learning model for detecting if the personnel in the kitchen and the field, including the delivery personnel, are observing health protocols, such as wearing of face masks and hand gloves.
- the video stream, plus the information detected, is referred to as a “value added video stream,” or simply a “value added stream.
- the server delivers the value added stream to a recipient subscriber who can take any necessary actions such as informing the establishment about the breach or non-compliance.
- application used by the client is used for searching, viewing, saving, and uploading the value added video streams.
- An embodiment further uses triangulation of information obtained from a plurality of cameras to detect if the workers are observing social distancing protocols while working in the commercial kitchen.
- the disclosed application helps enforce health protocols and designed to be yet another precautionary measure for humanity’s fight against communicable infections like the COVID-19 and help protect against epidemiological outbreaks.
- An embodiment of the system further empowers consumers by allowing them to use client viewing application to report concerning behavior to the restaurant owners by capturing clips or plurality of snapshots of any of the value-added video streams attached to an electronic message sent to the establishment. Further, the client application enables the completion of a survey which is used to update a compliance score indicative of the user perspective of the extent to which the restaurant is complying with health protocols.
- An embodiment of the invention comprises of system for capturing video and adding annotations to the video stream to create a value added video stream which is then streamed to a plurality of subscribing client applications.
- the client applications are further configured to make clips of the value added video streams, save and forward these clips as emails, or upload to social media sites.
- Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
- the deep learning neural network creates a value added stream by adding annotations on sections of video streams that pertain to personnel violating or adhering protocols for health, safety, and hygiene.
- An embodiment measures the distance between the personnel on the floor to determine the distance between the personnel.
- An embodiment further alerts a human to confirm model’s predicted violations for further validation.
- An embodiment of the system comprises a process where a plurality of input data is compiled into a calculation of a safety score where the safety score is reflective of the viewers’ perception of the level to which an establishment is adhering to a relevant set of health, safety and hygiene protocols.
- a process is disclosed where the list of establishments is searched from a database in a predefined order dependent upon the safety scores recorded for the establishments.
- the invention accordingly comprises several steps and the relation of one or more of such steps with respect to each other, and the apparatus embodying features of construction, combinations of elements and arrangement of parts that are adapted to affect such steps, all is exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.
- FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications;
- FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained models to annotate individual frames, and streaming the frames and the associated application to the client application;
- FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame
- FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server;
- station and environment camera such as the fish-eye cameras shown
- FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud;
- FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server ;
- FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior;
- FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face
- FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face;
- FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition;
- FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment
- FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above
- FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face;
- FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user;
- FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve;
- FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information
- FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like;
- FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles;
- FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior;
- FIG. 17 shows a component and packaging diagram of an embodiment of the system.
- FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10.
- FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications.
- the Video Acquisition Process 22 is configured to receiving the plurality of video streams from End Point Monitor 16.
- the Video Acquisition Process 22 is a software process configured to capture the incoming video streams. Upon receiving the streams, Video Acquisition Process 22 forwards them over to the process for Video Analysis and Annotation 26.
- the Video Analysis and Annotation 26 is a process configured to perform an analysis of the frames of a video stream and generate a predefined set of annotations.
- the Video Analysis and Annotation 26 results in creating the annotations where the annotations are specific to the problem being solved such as whether the images depict people wearing face masks or not.
- the annotated video streams are subsequently conveyed to Annotated Video Steaming Process 28.
- the Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications.
- An embodiment uses a deep neural network based system comprising an optical sensor configured to capture information where the optical sensor is in communication with a network interface; the network interface configured to receive the information that is captured and transmit said information to a server; the server configured to execute a deep learning neural network based computer implemented method to perform an analysis of said information to detect a presence of a plurality of monitored conditions, and label said information with the presence that is detected of the plurality of monitored conditions.
- the labeling of said information comprises adding a visual artifact, or adding an audio artifact, or adding both the visual artifact and the audio artifact to the captured information.
- the end clients use Web Client 30 and Mobile Client 32 and obtain a value-added video stream from the Annotated Video Steaming Process 28.
- the Video Analysis and Annotation 26 process receives frames and processes them using a high performance computing server and upon completing the processing merges the annotation with the existing frame.
- the annotation on a stream may be somewhat lagging in phase from the video stream being disseminated to the client application. It will be appreciated by one skilled in the art that with high performance capability of the computing server being utilized this delay will be minimized and any lag in phase will be imperceptible to the consumer of the value-added annotated video stream.
- FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained models to annotate individual frames, and streaming the frames and the associated application to the client application.
- the Video Splitting Process 34 is configured to receive a video feed from Video Acquisition Process 22.
- the Video Splitting Process 34 is a process to split an incoming video stream into individual frames or a small group of frames for the purpose of analysis.
- the Predefined Trained Model 36 is a predefined computational model that is given an input data, such as comprising an image frame in an embodiment, can cause the production of a either a discrete value or a classification value where the discrete or classification value is a function of the input data.
- Predefined Trained Model 36 would be configured to perform a function that detects for the presence of a face mask in the output of Video Splitting Process 34.
- Predefined Trained Model 36 would be configured to detect the distance between objects and individual in the output of Video Splitting Process 34.
- the Video Splitting Process 34 generates input data as frames for applying a Predefined Trained Model 36 through the process Apply Model 42 which is a process that takes the frame from Video Splitting Process 34 and performs the function of the computational model provided by the Predefined Trained Model 36 and causes the production of discrete or classification value.
- the output of Apply Model 42 is then merged with the output of Video Splitting Process 34 in by the process Overlay Classification on Video 38.
- Overlay Classification on Video 38 is a process that takes the classification or discrete value provided by the Apply Model 42 process and overlays a visual representation of the classification or discrete over the video where the visual representation stays overlaid for predetermined number of frames or time duration.
- the output of Apply Model 42 is a binary classification if an individual in the output of Video Splitting Process 34 is wearing a mask
- the output of Overlay Classification on Video 38 would be a visual annotation of a combination of text and pictorial feedback indicating if an individual in the output of Video Splitting Process 34 is wearing a mask.
- Overlay Classification on Video 38 can provide annotation dependent on the output of Apply Model 42.
- Annotated Video Steaming Process 28 receives the output of Overlay Classification on Video 38.
- Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications by communicating this stream to client applications, such as a plurality of Web Clients 30 and Mobile Clients 32.
- client application will be used to refer to any of the plurality of applications designed to access the restaurants and observe their respective value- added video streams.
- the client applications include but are not limited to Web Client 30 and Mobile Client 32.
- a skilled artisan will be able to envision other dedicated applications and appliances for accessing the Cloud Server 20, search for restaurants or other establishments and observe their corresponding value added video feeds delivered through the Annotated Video Steaming Process 28.
- a client application will utilize at least a display surface - such as screen - as an output device for rendering the value-added video stream. It will further utilize a processor, a memory, a network interface, and input devices to enable in selection of establishment to view the stream, compose a survey, and report concerning behavior.
- the client application is a software process executing on any general-purpose computing device.
- FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame. This process is being continually performed on the Cloud Server 20, specifically the Apply Model 42 process on the Cloud Server 20, on all the video streams received by the Video Acquisition Process 22.
- Apply Model 42 is applying a Predefined Trained Model 36 to detect the presence of a face mask on individuals located in a frame.
- Preprocess Model Input Frame 76 receives input from Video Splitting Process 34.
- Preprocess Model Input Frame 76 is a component to pre-process the output of Video Splitting Process 34, that will render the frame as an acceptable input to the model application processes within Apply Model 42.
- Preprocess Model Input Frame 76 would covert the image to a black-and-white image from a color image.
- Preprocess Model Input Frame 76 would reduce the size of the image which enhances the ability for specific Al algorithms to process the image.
- Preprocess Model Input Frame 76 is then applied to Apply Face Detector 78.
- Apply Face Detector 78 is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76.
- the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be Convolutional Neural Network that is trained on a dataset of open-source images.
- the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
- the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a model that has been pretrained to detect faces.
- the output of Apply Face Detector 78 is the location of the face or faces detect in the output of Preprocess Model Input Frame 76.
- Apply Mask Detector 80 detects if a mask is present given the location of face.
- Apply Mask Detector 80 is a component that will apply a mask detector from Predefined Trained Model 36, to the output of Apply Face Detector 78.
- the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be Convolutional Neural Network that is trained on a dataset of open-source images.
- the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
- the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that has been pretrained to detect masks.
- Model Output 82 receives the output from Apply Mask Detector 80.
- Model Output 82 is a component that will output a binary classification or a discrete value with the results of the application of Predefined Trained Model 36.
- Predefined Trained Model 36 is detecting the presence of a face mask
- Model Output 82 will output a binary value as to if a person in the frame is wearing a face mask.
- FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server.
- a plurality of Station Cameras 12 are installed at predetermined locations.
- a Station Camera 12 is a camera installed for observing a full view at a station configured to monitor the activities in the close proximity of the station.
- the location of each Station Camera 12 is configured get a clear view of the personnel working at that specific station to ensure compliance with health or other hygiene related protocols being annotated by the system.
- the Station Cameras 12 are configured to detect whether the personnel preparing food in a commercial kitchen are complying with the requirements of wearing marks while working at their station.
- a Fish Eye Camera 14 is a plurality of camera installed on the ceiling or a similar location to monitor the area at an environmental level, such as the entire commercial kitchen facility.
- a plurality of Fish Eye Cameras 14 can help establish the distance between each of the personnel working in the kitchen and use this information to annotate the video stream with a level of social distancing being observed by the personnel.
- computer vision algorithms can be used to detect a face in a given image from Fish Eye Cameras 14.
- the distance between multiple faces located in an image can be computed by taking the pixel differential between multiple faces and applying a general heuristic or scaling it using an object with known dimensions.
- fixed bounding boxes are placed on the regions with the stream processing ensuring that the personnel working in the kitchen remain within the confines of the bounding boxes and triggering a non-compliance when personnel step outside of the bounding box for a period greater than a predefined threshold.
- the focal length of a plurality of cameras is used to compute the global coordinates of each of the personnel and using the Euclidean distance between each location to ensure observance of social distance.
- the information from the plurality of Station Cameras 12 and the plurality of Fish Eye Cameras 14 is processed by the End Point Monitor 16.
- the End Point Monitor 16 is a system designed to collect and preprocess the feeds from each of the plurality of Station Cameras 12 and each of the plurality of Fish Eye Cameras 14 and transmit the consolidated feed over the network to a server for further processing.
- the End Point Monitor 16 communicates the video streams from the plurality of cameras (from both the plurality of Station Cameras 12 and the plurality of Fish Eye Cameras 14) to the Cloud Server 20.
- the Cloud Server 20 is a high throughput performance computing server with integrated database and processing for streaming in live videos, running processes for annotating video streams, and making the value added streams available for client devices.
- End Point Monitor 16 communicates the physical locations and information including but not limited to focal length, aperture setting, geolocation, contrast, and enhancement settings, about the cameras to the Cloud Server 20.
- the cameras also communicate shutter speeds of any still photographs taken to the End Point Monitor 16 which in turn communicates this information to the Cloud Server 20.
- FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud.
- a Field Camera 11 is located inside of a delivery vehicle and is recording the delivery personal.
- Field Camera 11 is a camera installed for observing the activities inside of the delivery vehicle.
- the location of a Field Camera 11 is configured get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system.
- An embodiment the Field Camera 11 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks.
- the information from the Field Camera 11 is processed by Field Endpoint 13.
- the Field Endpoint 13 is a system designed to collect and preprocess the feeds from each of the plurality of Field Cameras 11 and transmit the consolidated feed over the network to a server for further processing.
- the Field Endpoint 13 communicates the video streams from the Field Camera 11 to the Cloud Server 20 over a wireless data network provisioned for the use by the Field Endpoint 13.
- An embodiment has the Field Endpoint 13 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Field Endpoint 13 where the End Point Monitor 16 consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and Field Cameras 11, and uploads the consolidated stream to Cloud Server 20.
- An embodiment uses a deep neural network based system comprising a plurality of cameras or optical sensors connected to an end-point monitor where the plurality of cameras or optical sensors capture information and communicate the information to the end-point monitor; the endpoint monitor collects the information that is captured to create a video stream and further communicates the video stream to a server, wherein the server is configured to execute a deep neural network based computer implemented method to perform an analysis of the video stream to detect a presence of a plurality of monitored conditions, and annotate the video stream with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
- An embodiment of the system uses plurality of cameras or optical sensors are configured to monitor an interior of a food preparation facility where wherein the computer implemented method is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including a detecting of a presence of a human face in the video stream, and upon detecting the presence of the human face further detecting whether the human face includes a mark or a face covering.
- An embodiment of the system uses the computer implementation is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including an indication of presence of detecting a plurality of humans in the video stream, and upon the indication of presence of the plurality of humans further indicates whether said plurality of humans qualify a predefined separation condition from each other.
- FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server .
- a Mobile Phone 15 is located inside of a delivery vehicle and is recording the delivery personal.
- Mobile Phone 15 is a portable cellular device which includes a camera and is executing an application that combines the functionality of Field Camera 11 and Field Endpoint 13.
- the location of Mobile Phone 15 is configured to get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system.
- An embodiment uses the optical sensor, and the network interface are configured to monitor conditions inside of a vehicle.
- the deep learning neural network based computer implemented method is configured for detecting a presence of a human face from the information that is captured, and upon detecting the presence of the human, further detecting whether the human face included a mask or a face covering.
- the Mobile Phone 15 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks.
- the Mobile Phone 15 records and communicates video streams to Cloud Server 20 over a wireless data network provisioned for the use by the Mobile Phone 15.
- An embodiment has the Mobile Phone 15 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Mobile Phone 15 where the End Point Monitor 16 consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and the Mobile Phone 15, and uploads the consolidated stream to Cloud Server 20.
- the system is also configured to view the packages being delivered in plain view.
- the video feeds for the Field Cameras 11 and Mobile Phone 15 are configured to keep the food packages within the field of view.
- the food packages being in plain view serves as a deterrent since the delivery personnel actions are being recorded in plain view.
- the consumers watching any of the video feeds, including the feeds from within the delivery vehicles can observe any non-compliant behavior and report it to the establishment. This further helps in achieving the application goal of maintaining transparency from kitchen to the consumer.
- An embodiment of the video stream annotation system further including a mobile cellular device including a camera and a wireless networking interface adapted to communicate over a wireless network, with the camera on the mobile cellular device serving as an optical sensor, the wireless networking interface serving as the network interface, where information of the camera is configured to be received by the wireless networking interface and further transmitted to the server over the wireless network.
- the server is in a communication with a client system, and the server is further configured to communicate to the client system the said information and the label said information with the presence of the plurality of monitored conditions.
- the client system further communicates the presence of the monitored conditions to a plurality of subscribers and further alerts the subscribers about the presence of a predefined set of monitored conditions.
- FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior.
- the genesis of the process begins with a Login 44 step wherein an authentication of the client application is performed with the Cloud Server 20 to establish privileges level of the client application in continuing with further processing.
- Login 44 step is further performed by Authenticate Login 46 which is a subordinate process to supporting the authenticate Login 44 with the help of local authentication or biometric keys alleviating the need to authenticate by communicating with Cloud Server 20.
- Select Restaurant 48 represents a selection step configured to enable a selection of one of the many value-added video streams provided by the Cloud Server 20.
- Select Restaurant 48 is configured to enable selections of restaurant by options to filter by geolocation or other location specifiers of the restaurant, options to filter by the name, or a part of the name, of the restaurant, filtering by the type of cuisine served by the restaurant, filtering by the Cloud Server 20 assigned score of the restaurant, or filter by the hours of operation of the restaurant.
- the client application proceeds to the next step of Display Value Added Stream 50.
- Display Value Added Stream 50 is a step of the client application that connects to the Annotated Video Steaming Process 28 on the Cloud Server 20 to obtain and display the value- added video stream for the restaurant selected in Select Restaurant 48 step. While the Display Value Added Stream 50 is ongoing, the client application further offers the capability report concerning behavior with the step of Report Behavior 58 which is a step in the client application configured to capture input and attachment and send a report of to the restaurant. The idea here is that upon observing of concerning behavior on the value-added video feed, a report with attachments of the frames depicting the concerning behavior and further input text-based input gathered should be reported to the restaurant. In an embodiment concerning behavior messages and attachments will be delivered to the restaurant by the client application using an electronic messaging system.
- Report Behavior 58 can be generated by rendering a one-time prompt on the client-device. In another embodiment of the invention, Report Behavior 58 can be generated by rendering a button which would continuously prompt feedback from the Mobile Client 32 or the Web Client 30.
- Report to Restaurant 60 is a step on the client application that enables sending a report of concerning behavior to the restaurant.
- Report to Restaurant 60 will render a prompt on the client application confirming that a report would be sent to the establishment regarding the incident.
- the client application will provide data input fields where detailed information regarding the incident could be entered.
- Attach Concerning Behavior Frame 62 is a step that provides a capability of the client application to attach a single or plurality of frames of the video to the report of concerning behavior to further substantiate and provide additional information with the report being sent to the restaurant.
- the business would receive the message from Attach Concerning Behavior Frame 62 in the form of an email.
- the business would receive the message from Attach Concerning Behavior Frame 62 in the form of a message on the platform directly.
- Exit Stream 64 is a step in which the client application stops the feed being received from the Annotated Video Steaming Process 28.
- the client application proceeds to the Survey 66 step.
- Survey 66 is a step of conducting a survey through a series of question related to the streama step of conducting a survey through a series of question related to the stream.
- the Survey 66 would elicit feedback from the operator of the client application regarding the contents of Display Value Added Stream 50 regarding adherence to predefined protocols understood to enhance safety.
- Send to Business 68 is a step that sends the data collected in Survey 66 to the restaurant. Additionally, upon submission of Survey 66, the client application also performs Send to Provider 70, a step that sends the data collected in Survey 66 to the Cloud Server 20 and to the value-added stream service provider.
- the cumulative data collected from all instances of Survey 66 triggering Send to Provider 70 would allow the value added stream provider Update Restaurant Score 72 for each establishment on the platform which is a computation to update a score for each establishment on the platform, given data from Survey 66. This result of Update Restaurant Score 72 then become available as a search criterion for Select Restaurant 48 step in the client application.
- FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face.
- the resultant output of Video Splitting Process 34 is used to Apply Model 42 by Cloud Server 20.
- FIG. 8(A) depicts the embodiment of the invention where Apply Model 42 is used to detect a face in the output of Video Splitting Process 34.
- Preprocess Model Input Frame 76 is a process to convert the output of Video Splitting Process 34 to grayscale.
- Apply Model 42 next applies the output of Preprocess Model Input Frame 76, which in the embodiment shown converts a colored image to a grayscale image, is next processed through the Apply Face Detector 78 which is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76.
- the result of Apply Face Detector 78 produces the Face Detector Output 84 as its output which delineates the location of faces detected in a video frame.
- FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face.
- Python is an interpreted programming language commonly used for Machine Learning Applications. Additional information can be found on Python’s official website, www .python. org . The specific libraries used in building the models are included in attached sequence listing and incorporated herein by reference.
- An embodiment of the flow chart shown uses functions from the OpenCV library for receiving the output of Video Splitting Process 34 in FIG. 8(B) and using the Predefined Trained Model 36 as the Open CV’s prebuilt face detection model using Haar-cascade preprocessing for detecting faces.
- This model is incorporated herein by reference.
- An embodiment uses Open CV method to convert the image to grayscale within the flowchart block Preprocess Model Input Frame 76.
- an embodiment applies Predefined Trained Model 36 - utilizing Haar-cascade model of OpenCV applied by the Apply Face Detector 78 block - to search for faces in the processed frame produced by Preprocess Model Input Frame 76.
- the output of the Apply Face Detector 78 is the Face Detector Output 84 which is a set of coordinates corresponding to the location of the face or faces in the image analyzed by Apply Face Detector 78.
- the Haar-cascade model being used in an embodiment can be found on OpenCV’ s official repository on https://github.com/opencv/opencv/blob/master/data/haarcascades/haarcascade Jrontalface def ault.xml. This model is incorporated herein by reference.
- FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition.
- Each of the Filter Layer 91 performs a plurality of preconfigured filtering operations where each is adapted to capture a specific image property.
- the Filter Layer 91 is followed by Pooling Layer 92 which performs an aggregation of the filtered image layers to create coarser image layers which in turn capture occurrence of higher level features.
- An embodiment uses two instances of Filter Layer 91 the first instance using 64 and 128 filters respectively each filter using a 5 by 5 mask for convolving with the input.
- the set of filters is chosen a-priori including blurring filters, edge detection filters, or sharpening filters
- the actual weights of the filters used in a convolutional neural network is not defined a-priori but learned during the training process.
- the deep convolution neural network can find features in the image that pre-configured filters may not find.
- Each Filter Layer 91 is followed by Pooling Layer 92 layer.
- the Pooling Layer 92 simply replaces a 4x4 neighborhood in an image with the maximum value. Since by Pooling Layer 92 is done with tiling, the output of the Pooling Layer 92 results in cutting the input image size by half. Thus, the image after two instances of by Pooling Layer 92 layers is reduced to one quarter of the original size.
- the size of input image of a human face which was standardized to 255 by 255 will be reduced to 253 by 253 after the first instance of Filter Layer 91, further gets reduced to 126 by 126 after the first instance of Pooling Layer 92, which gets reduced to 124 by 124 after the second instance of Filter Layer 91, which ultimately is reduced to 62 by 62 by the second instance of Pooling Layer 92.
- This 62 by 62 image is next flattened by Flatten Layer 96 which essentially reorganizes the two or higher dimensional data set, such as an image, to a single dimension vector making it possible to be fed to a dense neural network layer.
- Flatten Layer 96 which essentially reorganizes the two or higher dimensional data set, such as an image, to a single dimension vector making it possible to be fed to a dense neural network layer.
- the feed forward process works just like it would in a multi-layer perceptron with the additional characteristic that a deep neural network will involve a plurality of hidden layers.
- Each of these layers are instances of Dense Layer 93 which comprises neurons that are fully connected to any neurons in the previous layer with a weight associated with each of these connections.
- the last dense layer is often labelled as the Output Layer 94 it being the final layer of neurons that are followed by a function such as the sigmoid or a softmax function for classification.
- An embodiment shown has three neurons in the Output Layer 94 would therefore utilize a softmax function to convert output of the three neurons into a probability density function using the formula below where probability of the i th . output is calculated given the neurons in the Output Layer 94 produced outputs of Oi, O2, ... On:
- An embodiment of the system used for detecting a plurality of conditions where said computer implemented method further comprises a plurality of deep neural networks for each for detecting one or more of the plurality of conditions, and each of deep neural networks comprising of a convolutional neural network including a plurality of filtering layers where each of the filtering layers has an associated set of filtering layer parameters, a plurality of pooling layers where each of the pooling layers has an associated set of pooling layer parameters, a plurality of dense layers where each of the dense layers has an associated set of dense layer parameters, an output layer where the output layer has an associated set of output layer parameters, and where the convolutional neural network is configured to detect and the output layer is configured to report a status of one or more of the predefined monitored condition.
- a base set of deep neural network models obtained from a source like ImageNet are utilized that is further fine-tuned, where an inductive tuning of the deep neural network is further performed by training the deep neural network on a plurality of examples of the predefined monitored condition, where each example is used for updating the filtering parameters, the pooling parameters, the dense layer parameters, and the output layer parameters; and the inductive tuning of the deep neural network is configured to cause the deep neural network to recognize the presence of the predefined monitored condition in a manner of improved accuracy.
- FIGS. 10 (A) and (B) shows the architecture of a Long Short Term Memory or L STM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the condition’s value in prior frames.
- FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment
- FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above.
- This architecture of deep neural network is configured to capture changes in subsequent images which may not be readily perceptible.
- a deep neural network for learning time dependent variability in the image sequences is utilized in an embodiment.
- the embodiment shows the architecture of a Long Short Term Memory or LSTM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the condition’s value in prior frames.
- the LSTM belongs to the class of Recurrent Neural Networks which retain the state of learning from one time step to the next and the results from the previous frame influence the interpretation of the subsequent frame.
- the entire “video sequence” is used for detection of a monitored condition, where the deep learning neural network utilizes an autoregression of a time series of said captured information where said autoregression is perform using a computer implemented method utilizing a deep recurrent neural network based on a Long Short Term Memory (LSTM) architecture.
- LSTM Long Short Term Memory
- a sequence of frames is fed into a LSTM deep neural network where each LSTM Cell 95 which is a component of the LSTM chain where the LSTM cell also includes a memory and uses a neural network to combine the memory with image frame and the previous state to produce an output which is then fed to a next LSTM cell and to a dense layer.
- a five second sequence will typically comprise of anywhere between 50 to 150 frames corresponding to a frame capture rate of 10 fps to 30 fps respectively.
- Embodiments of 100 LSTM Cells 95 offer sufficient or long enough span of time for detection of monitored conditions in an embodiment.
- Each of the LSTM Cells 95 is fed a portion of the video frame comprising of a detected region of interest.
- the region of interest is the face of any humans in the video frame which is detected by Face Detector Output 84.
- the subsection detected by Face Detector Output 84 from the video frame is standardized to a size of 255 by 255 and then passed through an embedding process which converts the 2D image data into a single dimensional vector which is then supplied as an input to LSTM Cells 95.
- the output of the chain of LSTM Cells 95 is fed to a plurality of neurons of a Dense Layer 93.
- the embodiment shown uses a single Dense Layer 93 which is also the Output Layer 94.
- Other embodiments use multiple instances of Dense Layer 93.
- the output of the series of LSTM Cells 95 comprises a flattened set of inputs that are fed into a multi-layer perceptron having a plurality of hidden layers.
- FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face.
- a Deep Neural Network (DNN) is used to classify if a mask is present or absent within a given bounding box containing a human face.
- FIG. 11 is depicting an embodiment of the invention such that Predefined Trained Model 36 is classifying the presence or absence of a mask within a given bounding box containing a human face.
- the input into this DNN is Face Detector Output 84.
- the DNN consists of a base model, Mobile Net V286.
- the Mobile Net V2 model used in an embodiment is included in the attached sequence listing and is incorporated herein by reference.
- Mobile Net V286 is a general and stable open-source model architecture used for multiple applications in computer vision use cases. Additional information is available at https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV2. The model information and parameters are incorporated herein by reference. Face Detector Output 84 is standardized to an image size of 255 by 255 pixels before it is fed to the Mobile Net V286.
- Head Model 88 is the portion of the DNN architecture that is added to the underlying model to accomplish the specialized training of the base model as will be appreciated by a practitioner in the art.
- Head Model 88 consists of three layers. The first layer flattens the output of Mobile Net V286. The second layer is a layer of 128 dense neurons. The third layer is a layer of 2 dense neurons. The two neurons in this layer correspond to the binary output.
- the output of the DNN is Mask Detector Output 90.
- Mask Detector Output 90 is the output of Apply Mask Detector 80 which is a binary classification of the presence of a mask within a given bounding box containing a human face.
- the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on a dataset of open-source images. In another embodiment of the invention, the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
- FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user.
- a client application facilitates Login 44, Authenticate Login 46, and Select Restaurant 48 processes.
- the client device selects to Display Value Added Stream 50 by establishing a connection between the client application and the Cloud Server 20.
- Shown in FIG. 12 is an embodiment where the streaming process commences with End Point Monitor 16 capturing the video feed Acquire Video 52.
- Acquire Video 52 is a subsystem in which the End Point Monitor 16 acquires video feed from a Station Camera 12 or a Fish Eye Camera 14.
- the End Point Monitor 16 connects to the cloud Connect to Cloud 54.
- Connect to Cloud 54 is a process to connect the End Point Monitor 16 to the Cloud Server 20.
- Connect to Cloud 54 feeds into Stream to Cloud 56.
- Stream to Cloud 56 is a process to transmit video streams from the End Point Monitor 16 to the Cloud Server 20.
- Stream to Cloud 56 connects to the Cloud Server 20, specifically to Video Acquisition Process 22.
- Video Acquisition Process 22 inputs into Video Splitting Process 34.
- the output of Video Splitting Process 34 is the input in Apply Model 42.
- the output of Apply Model 42 is the input to Annotated Video Steaming Process 28. This process is further detailed in FIG. 3.
- the output of Annotated Video Steaming Process 28 is made available to be displayed on a Web Client 30 or Mobile Client 32.
- FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve.
- the client application enables the specification of searching using a plurality of Search Criteria 49.
- Each Search Criteria 49 is search criteria to be used for finding establishments of interest including geographic location, or establishment name, and the like.
- the client application communicates the search criteria to the Cloud Server 20 and receives a list of restaurants that meet the Search Criteria 49.
- TRM Score 71 is a score assigned by considering the answers to Survey 66 received by the client application indicative of a perception of health or other concerns pertaining to an establishment.
- the client application further offers the ability to present the list of restaurants ordered by TRM Score 71.
- the display of establishments is further facilitated by a TRM Sort 74 which is a list of establishments presented that are sorted by the TRM Score 71 scores with the listing of any establishments falling below a predefined TRM Score 71 threshold getting suppressed.
- the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the video stream.
- FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information.
- This allows the Cloud Server 20 to use the Database 24 in managing information about the locations of cameras, the survey results, other pertinent information about the establishments.
- This database is accessed by the client application while running a Select Restaurant 48 step in response to the Search Criteria 49 specified in Search Box 47. Additionally, Cloud Server 20 uses information stored in the Database 24 to provide additional details to the client application related to health measures or other important attributes.
- FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like. Shown here a plurality of Establishment Pages 19 where each Establishment Page 19 corresponds to one of the many establishments that meet the selection criteria specified in the Search Box 47. As illustrated, the Search Box 47 is a search criteria input box allowing users to specify a plurality of Search Criteria 49.
- the Establishment Page 19 is a component of the GUI that refers to the dedicated page of an establishment on the Real Meal Platform. Furthermore, the client application also displays a TRM or the Real Meal Score, computed by the Cloud Server 20 and associated with each Value Added Stream 18. In an embodiment of the invention, TRM Score 71 is computed based on the input received by the client application in response to a survey questionnaire, Survey 66. The client application further provides the capability presenting the search results in a sorted manner.
- the server further assigns an identifier to the annotated video stream and saves the identifier to a database wherein the database is configured to search and retrieve the annotated video stream by the identifier; the server further accepts a request from a client software wherein the request includes the identifier of the annotated video stream; and the server is configured to retrieve and deliver the annotated video stream to the client software.
- FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles. Shown here are a plurality of Value Added Streams 18 where each Value Added Stream 18 is one of many Value Added Streams 18 for a selected Establishment Page 19. Value Added Stream 18 is a raw video stream(s) obtained from the End Point Monitor 16 or the Field Endpoint 13 that has been annotated with additional information by overlaying informational items, including text and color codes to convey a specific message to the recipient. In the embodiment of the invention shown, there are four Value Added Streams 18 originating from End Point Monitor 16 and two Value Added Streams 18 originating from a Field Endpoint 13.
- an embodiment has the Field Endpoint 13 send the video feed to the End Point Monitor 16 which sends a consolidated feed of comprising all feeds to the Cloud Server 20.
- the feeds from the Field Endpoint 13 are generated by a Mobile Phone 15 mounted inside a delivery vehicle configured to monitor delivery personnel complying with requirement of wearing a face mask, for example, or monitoring that food packets are visible in plain view.
- FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior.
- Illustrated herein is a Snap Frame 63 input on the client application.
- the Snap Frame 63 is an input on the client application adapted to allow the instantaneous snapping of the frame being displayed in the Value Added Stream 18 section of the application.
- the client application enables the sending of Report Behavior 58 message by attaching a Snap Frame 63 as evidence to the message and sending an electronic message with Report to Restaurant 60.
- the Snap Frame 63 also allows the user to post the clip with their own message to one of the social media platforms.
- FIG. 17 shows a component and packaging diagram of an embodiment of the system. This figure depicts the various components used for the implementation of the disclosed system. As illustrated the Station Camera 12 and Fish Eye Camera 14 are in communication with End Point Monitor 16 which in turn streams the video to the Cloud Server 20.
- the Video Acquisition Process 22 process manages all the streams correlates the streams with the establishments and conveys the streams for analysis to Video Analysis and Annotation 26 which has a plurality of processes for splitting the video and analyzing it using deep convolution neural network and adding the result of the analysis as a value added annotation on the respective video streams.
- the annotated video streams, or Value Added Stream 18 are then streamed to the Web Client 30 or Mobile Client 32 when they request a specific stream satisfying their search of the Database 24 for further information on specific establishments.
- FIG. 18 shows examples of video streams that have been annotated.
- FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and
- FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10.
- the TRM or The Reel Meal Score is assigned based on the number of monitored conditions that are met. In an embodiment, when none of the monitored conditions are met the score assigned is zero, and a maximum score of 10 is assigned when all the monitored conditions are met.
- the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the video stream.
- the client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites. Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
- An embodiment is a process of using a deep neural network comprising having a plurality of cameras or optical sensors connected to an end-point aggregator where the plurality of cameras or optical sensors capture information and communicate the information to the end-point aggregator; having the end-point aggregator collect the information that is captured to create a video stream and having the end-point aggregator further communicate the video stream to a server, having the server further configured to executing a deep neural network based computer implemented method to perform an analysis of the video stream wherein the analysis is configured to detect a presence of a plurality of monitored conditions, and having the server annotate the video stream, with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
- Closed-Circuit Television Systems (AREA)
Abstract
A system and method for analyzing frames of a video stream is disclosed. The analysis is performed using a computer implemented deep neural network method that yields one or more of predefined labels based on the presence or absence of specific elements in the video stream. The computer generated labels are added to the video stream to create an annotated or "value-added" video stream. In some embodiments, the value added video streams from cameras or optical sensors in commercial kitchens or delivery vehicles indicate whether the workers or drivers are wearing face coverings and adhering to health and safety protocols. Other embodiments include generating value added streams from retirement homes, children's daycare facilities, food trucks, or other establishments. The value added video streams are made accessible over the network and configured to be streamed over to subscribing client applications that can clip, email, save, or upload to social media.
Description
SYSTEM AND METHOD FOR USING DEEP NEURAL NETWORKS FOR
ADDING VALUE TO VIDEO STREAMS
CROSS REFERENCE AND RELATED APPLICTION
The application claims the benefits of U.S. Provisional Patent Application Ser. No.
63/071,365 filed on August 28, 2020, by Inventor Rogelio Aguilera Jr., which is incorporated herein by reference.
TECHNICAL FIELD
The field of the invention relates generally to video image analysis and using deep neural networks for automatically adding value added annotations to the video streams.
BACKGROUND OF THE INVENTION
Video feeds for monitoring environments are commonly utilized in personal and commercial settings in a large variety of applications. For example, video feeds are used in monitoring warehouses to make sure that the premises are secured. Existing security systems for homes and commercial properties feature have been disclosed which provide a video feed from multiple video cameras typically connected to a manual monitoring station to facilitate observation by security personnel. Some home or commercial settings utilize cameras for monitoring the surroundings for the purposes of recording suspicious activities.
What is needed is a system that can continually monitor the conditions in a video stream, for example being generated from the monitoring of commercial restaurant kitchen or the monitoring of a delivery vehicle’s occupants, add value to the video stream by adding annotations to the video stream that are meant to serve as alert if certain protocols for observing basic hygiene are violated in the preparation or its delivery.
What is also needed is a system for rating establishments that conform to the health and safety guidelines in food preparation and delivery and use the rating as a mechanism for promoting these establishments and use market competitive forces to bring others in line, so they are also motivated to their compliance score. Additionally, what is needed is an ability for customers to have a system detect and report any noncompliance in real time and thereby empower consumers to force compliance by using the captured value-added video streams from the kitchen and delivery vehicles and reporting non-compliance to the establishment personnel.
Such a need exists in a variety of other application domains as cameras are installed in facilities such as elderly care, day care, nursing homes where the safety protocols will need to be observed and non-compliance activities are monitored through deep learning systems that detect non-compliant behavior in real time and automatically alert the establishment authorities when such behavior is observed in the video streams or other types of surveillance.
What are also needed are client applications that are further configured to make clips of the value added video streams and add comments about non-compliance and forward these clips as emails or upload to social media sites as a method enhancing transparency and encouraging compliance. And what is needed is the ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions, so customers monitor trends.
What is needed is a system that uses machine learning and artificial intelligence models for detecting at-risk behavior and annotates the video feed to alert the consumers and stakeholders.
SUMMARY OF THE INVENTION
The disclosed application for utilizing a deep learning system for generating value added video streams has several practical applications. The system and method for performing the steps of capturing video feeds and analyzing each frame with a deep neural network and annotating the result of the analysis back onto the frame and the stream is disclosed. One specific application of the invention uses video streams from a commercial kitchen establishment where the video stream is generated from collecting optical information from cameras monitoring the cooking area. In another application, field cameras with the delivery are used by the establishment to monitor transporting of prepared orders of food to the consumers. In another embodiment, any establishment may also include video feeds from the food preparation facilities located remotely such as those used for serving specific needs like a bakery, or confectionary wing of the establishment. The client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites. Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
One aspect of using deep neural network is the ability to perform training or inductive learning of surroundings. Embodiments utilize a multitude of deep neural network models with each neural network dedicated to directing a specific or closely related conditions. Furthermore, the deep neural network models begin with a specific set of parameters which generally perform well by detecting a broad set of conditions, such as objects in a video frame. These deep neural
network models can be fine-tuned by training on specialized set of examples encountered in a specific setting. Embodiments of invention begin with general model and tune these using additional examples from the situations or scenarios being monitored.
The system and method are used for performing the steps of capturing video feeds and adding annotations indicating noncompliance or other aspects of the video stream and making the annotated video streams available as “value added video streams” to subscribers over the Internet. The motivation of the system and processes disclosed is to create transparency from the monitored activities, like having transparency from kitchen to the consumer.
In an embodiment, video feeds from an end point such as a restaurant are collected and processed by an end point monitor or aggregator and transmitted to a server over wired or wireless network. The end point monitor or aggregator performs the function of combing several feeds into a single stream and using compression, error correction, and encryption to communicate the video stream securely and efficiently to the server for processing. The server is a computer system using deep neural network learning model for detecting if the personnel in the kitchen and the field, including the delivery personnel, are observing health protocols, such as wearing of face masks and hand gloves. The video stream, plus the information detected, is referred to as a “value added video stream,” or simply a “value added stream. The server delivers the value added stream to a recipient subscriber who can take any necessary actions such as informing the establishment about the breach or non-compliance. In an embodiment, application used by the client is used for searching, viewing, saving, and uploading the value added video streams.
An embodiment further uses triangulation of information obtained from a plurality of cameras to detect if the workers are observing social distancing protocols while working in the commercial kitchen. In this manner, the disclosed application helps enforce health protocols and designed to be yet another precautionary measure for humanity’s fight against communicable infections like the COVID-19 and help protect against epidemiological outbreaks.
An embodiment of the system further empowers consumers by allowing them to use client viewing application to report concerning behavior to the restaurant owners by capturing clips or plurality of snapshots of any of the value-added video streams attached to an electronic message sent to the establishment. Further, the client application enables the completion of a survey which is used to update a compliance score indicative of the user perspective of the extent to which the restaurant is complying with health protocols.
An embodiment of the invention comprises of system for capturing video and adding annotations to the video stream to create a value added video stream which is then streamed to a
plurality of subscribing client applications. The client applications are further configured to make clips of the value added video streams, save and forward these clips as emails, or upload to social media sites. Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
The deep learning neural network creates a value added stream by adding annotations on sections of video streams that pertain to personnel violating or adhering protocols for health, safety, and hygiene. An embodiment measures the distance between the personnel on the floor to determine the distance between the personnel. An embodiment further alerts a human to confirm model’s predicted violations for further validation.
An embodiment of the system comprises a process where a plurality of input data is compiled into a calculation of a safety score where the safety score is reflective of the viewers’ perception of the level to which an establishment is adhering to a relevant set of health, safety and hygiene protocols. A process is disclosed where the list of establishments is searched from a database in a predefined order dependent upon the safety scores recorded for the establishments.
The invention accordingly comprises several steps and the relation of one or more of such steps with respect to each other, and the apparatus embodying features of construction, combinations of elements and arrangement of parts that are adapted to affect such steps, all is exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.
BRIEF DESCRIPTION OF DRAWINGS
The invention will be described in conjunction with the attached drawings in which referenced numerals designate elements. The figures are intended to be illustrative, not limiting. Certain elements in some of the figures may be omitted, or illustrated not-to-scale, for illustrative clarity. Similar elements may be referred to by similar numbers in various figures (FIGs) of the drawing.
FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications;
FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained
models to annotate individual frames, and streaming the frames and the associated application to the client application;
FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame;
FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server;
FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud;
FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server ;
FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior;
FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face; FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face;
FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition;
FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment; FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above;
FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face;
FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user;
FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve;
FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information;
FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like; FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles;
FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior;
FIG. 17 shows a component and packaging diagram of an embodiment of the system; and
FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10.
DETAILED DESCRIPTION
Features, including various novel details of implementation and combination of elements will now be particularly described with reference to the accompanying drawings and pointed out in the claims. It will be understood that the methods and implementations described herein are shown by way of illustration only and not as limitations. As will be understood by those skilled
in the art, the principles and features described herein may be employed in various and numerous embodiments without departing from the scope of the invention.
FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications. As shown the Video Acquisition Process 22 is configured to receiving the plurality of video streams from End Point Monitor 16. The Video Acquisition Process 22 is a software process configured to capture the incoming video streams. Upon receiving the streams, Video Acquisition Process 22 forwards them over to the process for Video Analysis and Annotation 26. The Video Analysis and Annotation 26 is a process configured to perform an analysis of the frames of a video stream and generate a predefined set of annotations. The Video Analysis and Annotation 26 results in creating the annotations where the annotations are specific to the problem being solved such as whether the images depict people wearing face masks or not. The annotated video streams are subsequently conveyed to Annotated Video Steaming Process 28. The Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications.
An embodiment uses a deep neural network based system comprising an optical sensor configured to capture information where the optical sensor is in communication with a network interface; the network interface configured to receive the information that is captured and transmit said information to a server; the server configured to execute a deep learning neural network based computer implemented method to perform an analysis of said information to detect a presence of a plurality of monitored conditions, and label said information with the presence that is detected of the plurality of monitored conditions. In an embodiment, the labeling of said information comprises adding a visual artifact, or adding an audio artifact, or adding both the visual artifact and the audio artifact to the captured information.
It will be appreciated by a person skilled in the art that the goal of the system to provide a real time annotation of video streams as value added streams to the end client. In an embodiment shown the end clients use Web Client 30 and Mobile Client 32 and obtain a value-added video stream from the Annotated Video Steaming Process 28. Regarding the real time aspect of providing a value added stream - the value-added stream being the raw stream plus the annotations - the Video Analysis and Annotation 26 process receives frames and processes them using a high performance computing server and upon completing the processing merges the annotation with the existing frame. Thus, the annotation on a stream may be somewhat lagging
in phase from the video stream being disseminated to the client application. It will be appreciated by one skilled in the art that with high performance capability of the computing server being utilized this delay will be minimized and any lag in phase will be imperceptible to the consumer of the value-added annotated video stream.
FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained models to annotate individual frames, and streaming the frames and the associated application to the client application. As shown the Video Splitting Process 34 is configured to receive a video feed from Video Acquisition Process 22. The Video Splitting Process 34 is a process to split an incoming video stream into individual frames or a small group of frames for the purpose of analysis.
The Predefined Trained Model 36 is a predefined computational model that is given an input data, such as comprising an image frame in an embodiment, can cause the production of a either a discrete value or a classification value where the discrete or classification value is a function of the input data. In an embodiment of the invention, Predefined Trained Model 36 would be configured to perform a function that detects for the presence of a face mask in the output of Video Splitting Process 34. In another embodiment of the invention, Predefined Trained Model 36 would be configured to detect the distance between objects and individual in the output of Video Splitting Process 34. In an embodiment of the invention, there would be multiple instances of Predefined Trained Model 36. It will be appreciated by a person skilled in the art that Predefined Trained Model 36 can be generalized to solve any computation where the input is the result of Video Splitting Process 34.
As shown in FIG. 2, the Video Splitting Process 34 generates input data as frames for applying a Predefined Trained Model 36 through the process Apply Model 42 which is a process that takes the frame from Video Splitting Process 34 and performs the function of the computational model provided by the Predefined Trained Model 36 and causes the production of discrete or classification value. The output of Apply Model 42 is then merged with the output of Video Splitting Process 34 in by the process Overlay Classification on Video 38. In an embodiment Overlay Classification on Video 38 is a process that takes the classification or discrete value provided by the Apply Model 42 process and overlays a visual representation of the classification or discrete over the video where the visual representation stays overlaid for predetermined number of frames or time duration.
In an embodiment where the output of Apply Model 42 is a binary classification if an individual in the output of Video Splitting Process 34 is wearing a mask, then the output of Overlay Classification on Video 38 would be a visual annotation of a combination of text and pictorial feedback indicating if an individual in the output of Video Splitting Process 34 is wearing a mask.
It will be appreciated by a person skilled in the art that Overlay Classification on Video 38 can provide annotation dependent on the output of Apply Model 42. In an embodiment of the invention, there are multiple instances of Predefined Trained Model 36, which result in multiple visual annotations on the output of the frame which is the result of Video Splitting Process 34 and are applied in Overlay Classification on Video 38.
Annotated Video Steaming Process 28 receives the output of Overlay Classification on Video 38. Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications by communicating this stream to client applications, such as a plurality of Web Clients 30 and Mobile Clients 32.
In the following discussion, the term client application will be used to refer to any of the plurality of applications designed to access the restaurants and observe their respective value- added video streams. The client applications include but are not limited to Web Client 30 and Mobile Client 32. A skilled artisan will be able to envision other dedicated applications and appliances for accessing the Cloud Server 20, search for restaurants or other establishments and observe their corresponding value added video feeds delivered through the Annotated Video Steaming Process 28. A client application will utilize at least a display surface - such as screen - as an output device for rendering the value-added video stream. It will further utilize a processor, a memory, a network interface, and input devices to enable in selection of establishment to view the stream, compose a survey, and report concerning behavior. Generally, the client application is a software process executing on any general-purpose computing device.
FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame. This process is being continually performed on the Cloud Server 20, specifically the Apply Model 42 process on the Cloud Server 20, on all the video streams received by the Video Acquisition Process 22.
In the embodiment of the invention shown in FIG. 3, Apply Model 42 is applying a Predefined Trained Model 36 to detect the presence of a face mask on individuals located in a frame. Preprocess Model Input Frame 76 receives input from Video Splitting Process 34.
Preprocess Model Input Frame 76 is a component to pre-process the output of Video Splitting Process 34, that will render the frame as an acceptable input to the model application processes within Apply Model 42. In an embodiment of the invention, Preprocess Model Input Frame 76 would covert the image to a black-and-white image from a color image. In another embodiment of the invention, Preprocess Model Input Frame 76 would reduce the size of the image which enhances the ability for specific Al algorithms to process the image. The output of Preprocess Model Input Frame 76 is then applied to Apply Face Detector 78. Apply Face Detector 78 is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76. In an embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be Convolutional Neural Network that is trained on a dataset of open-source images. In another embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art. In another embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a model that has been pretrained to detect faces. In an embodiment of the invention, the output of Apply Face Detector 78 is the location of the face or faces detect in the output of Preprocess Model Input Frame 76.
As shown in FIG. 3, Apply Mask Detector 80 detects if a mask is present given the location of face. Apply Mask Detector 80 is a component that will apply a mask detector from Predefined Trained Model 36, to the output of Apply Face Detector 78. In an embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be Convolutional Neural Network that is trained on a dataset of open-source images. In another embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art. In another embodiment of the invention, the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that has been pretrained to detect masks. Model Output 82 receives the output from Apply Mask Detector 80. Model Output 82 is a component that will output a binary classification or a discrete value with the results of the application of Predefined Trained Model 36. In the embodiment of the invention, where Predefined Trained Model 36 is detecting the presence of a face mask, Model Output 82 will output a binary value as to if a person in the frame is wearing a face mask.
FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server. In the embodiment shown, a plurality of Station Cameras 12 are installed at predetermined locations. A Station Camera 12 is a camera installed for observing a full view at a station configured to monitor the activities in the close proximity of the station. The location of each Station Camera 12 is configured get a clear view of the personnel working at that specific station to ensure compliance with health or other hygiene related protocols being annotated by the system. In an embodiment the Station Cameras 12 are configured to detect whether the personnel preparing food in a commercial kitchen are complying with the requirements of wearing marks while working at their station.
In addition to the wearing of masks, a plurality of Fish Eye Cameras 14 are used in an embodiment. A Fish Eye Camera 14 is a plurality of camera installed on the ceiling or a similar location to monitor the area at an environmental level, such as the entire commercial kitchen facility. Within an embodiment of the system being used in a commercial kitchen, a plurality of Fish Eye Cameras 14 can help establish the distance between each of the personnel working in the kitchen and use this information to annotate the video stream with a level of social distancing being observed by the personnel. As described in FIG. 8B, computer vision algorithms can be used to detect a face in a given image from Fish Eye Cameras 14. The distance between multiple faces located in an image can be computed by taking the pixel differential between multiple faces and applying a general heuristic or scaling it using an object with known dimensions. In an embodiment, fixed bounding boxes are placed on the regions with the stream processing ensuring that the personnel working in the kitchen remain within the confines of the bounding boxes and triggering a non-compliance when personnel step outside of the bounding box for a period greater than a predefined threshold. In an embodiment, the focal length of a plurality of cameras is used to compute the global coordinates of each of the personnel and using the Euclidean distance between each location to ensure observance of social distance.
The information from the plurality of Station Cameras 12 and the plurality of Fish Eye Cameras 14 is processed by the End Point Monitor 16. The End Point Monitor 16 is a system designed to collect and preprocess the feeds from each of the plurality of Station Cameras 12 and each of the plurality of Fish Eye Cameras 14 and transmit the consolidated feed over the network to a server for further processing. In an embodiment, the End Point Monitor 16 communicates the video streams from the plurality of cameras (from both the plurality of Station Cameras 12
and the plurality of Fish Eye Cameras 14) to the Cloud Server 20. The Cloud Server 20 is a high throughput performance computing server with integrated database and processing for streaming in live videos, running processes for annotating video streams, and making the value added streams available for client devices. In addition, End Point Monitor 16 communicates the physical locations and information including but not limited to focal length, aperture setting, geolocation, contrast, and enhancement settings, about the cameras to the Cloud Server 20. In an embodiment of the system, the cameras also communicate shutter speeds of any still photographs taken to the End Point Monitor 16 which in turn communicates this information to the Cloud Server 20.
FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud. In the embodiment shown in FIG. 5, a Field Camera 11 is located inside of a delivery vehicle and is recording the delivery personal. Field Camera 11 is a camera installed for observing the activities inside of the delivery vehicle. The location of a Field Camera 11 is configured get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system. An embodiment the Field Camera 11 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks.
The information from the Field Camera 11 is processed by Field Endpoint 13. The Field Endpoint 13 is a system designed to collect and preprocess the feeds from each of the plurality of Field Cameras 11 and transmit the consolidated feed over the network to a server for further processing. In an embodiment, the Field Endpoint 13 communicates the video streams from the Field Camera 11 to the Cloud Server 20 over a wireless data network provisioned for the use by the Field Endpoint 13. An embodiment has the Field Endpoint 13 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Field Endpoint 13 where the End Point Monitor 16 consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and Field Cameras 11, and uploads the consolidated stream to Cloud Server 20.
An embodiment uses a deep neural network based system comprising a plurality of cameras or optical sensors connected to an end-point monitor where the plurality of cameras or optical sensors capture information and communicate the information to the end-point monitor; the endpoint monitor collects the information that is captured to create a video stream and further communicates the video stream to a server, wherein the server is configured to execute a deep neural network based computer implemented method to perform an analysis of the video stream
to detect a presence of a plurality of monitored conditions, and annotate the video stream with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream. An embodiment of the system uses plurality of cameras or optical sensors are configured to monitor an interior of a food preparation facility where wherein the computer implemented method is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including a detecting of a presence of a human face in the video stream, and upon detecting the presence of the human face further detecting whether the human face includes a mark or a face covering. An embodiment of the system uses the computer implementation is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including an indication of presence of detecting a plurality of humans in the video stream, and upon the indication of presence of the plurality of humans further indicates whether said plurality of humans qualify a predefined separation condition from each other.
FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server . In the embodiment shown in FIG. 6, a Mobile Phone 15 is located inside of a delivery vehicle and is recording the delivery personal. Mobile Phone 15 is a portable cellular device which includes a camera and is executing an application that combines the functionality of Field Camera 11 and Field Endpoint 13. The location of Mobile Phone 15 is configured to get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system. An embodiment uses the optical sensor, and the network interface are configured to monitor conditions inside of a vehicle. Where the deep learning neural network based computer implemented method is configured for detecting a presence of a human face from the information that is captured, and upon detecting the presence of the human, further detecting whether the human face included a mask or a face covering.
In an embodiment the Mobile Phone 15 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks. In an embodiment, the Mobile Phone 15 records and communicates video streams to Cloud Server 20 over a wireless data network provisioned for the use by the Mobile Phone 15. An embodiment has the Mobile Phone 15 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Mobile Phone 15 where the End Point Monitor 16
consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and the Mobile Phone 15, and uploads the consolidated stream to Cloud Server 20.
In an embodiment the system is also configured to view the packages being delivered in plain view. The video feeds for the Field Cameras 11 and Mobile Phone 15 are configured to keep the food packages within the field of view. The food packages being in plain view serves as a deterrent since the delivery personnel actions are being recorded in plain view. Furthermore, the consumers watching any of the video feeds, including the feeds from within the delivery vehicles, can observe any non-compliant behavior and report it to the establishment. This further helps in achieving the application goal of maintaining transparency from kitchen to the consumer.
An embodiment of the video stream annotation system further including a mobile cellular device including a camera and a wireless networking interface adapted to communicate over a wireless network, with the camera on the mobile cellular device serving as an optical sensor, the wireless networking interface serving as the network interface, where information of the camera is configured to be received by the wireless networking interface and further transmitted to the server over the wireless network.
In an embodiment, the server is in a communication with a client system, and the server is further configured to communicate to the client system the said information and the label said information with the presence of the plurality of monitored conditions. In an embodiment, the client system further communicates the presence of the monitored conditions to a plurality of subscribers and further alerts the subscribers about the presence of a predefined set of monitored conditions.
FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior. As shown in FIG. 7, the genesis of the process begins with a Login 44 step wherein an authentication of the client application is performed with the Cloud Server 20 to establish privileges level of the client application in continuing with further processing. In an embodiment Login 44 step is further performed by Authenticate Login 46 which is a subordinate process to supporting the authenticate Login 44 with the help of local authentication or biometric keys alleviating the need to authenticate by communicating with Cloud Server 20.
After a successful Login 44, the client application proceeds to the next step of Select Restaurant 48 which represents a selection step configured to enable a selection of one of the many value-added video streams provided by the Cloud Server 20. In an embodiment, Select
Restaurant 48 is configured to enable selections of restaurant by options to filter by geolocation or other location specifiers of the restaurant, options to filter by the name, or a part of the name, of the restaurant, filtering by the type of cuisine served by the restaurant, filtering by the Cloud Server 20 assigned score of the restaurant, or filter by the hours of operation of the restaurant. Upon the completion of Select Restaurant 48 select, the client application proceeds to the next step of Display Value Added Stream 50.
Display Value Added Stream 50 is a step of the client application that connects to the Annotated Video Steaming Process 28 on the Cloud Server 20 to obtain and display the value- added video stream for the restaurant selected in Select Restaurant 48 step. While the Display Value Added Stream 50 is ongoing, the client application further offers the capability report concerning behavior with the step of Report Behavior 58 which is a step in the client application configured to capture input and attachment and send a report of to the restaurant. The idea here is that upon observing of concerning behavior on the value-added video feed, a report with attachments of the frames depicting the concerning behavior and further input text-based input gathered should be reported to the restaurant. In an embodiment concerning behavior messages and attachments will be delivered to the restaurant by the client application using an electronic messaging system.
In an embodiment of the invention, Report Behavior 58 can be generated by rendering a one-time prompt on the client-device. In another embodiment of the invention, Report Behavior 58 can be generated by rendering a button which would continuously prompt feedback from the Mobile Client 32 or the Web Client 30.
Upon submission of a Report Behavior 58, the client application will generate a prompt for Report to Restaurant 60. Report to Restaurant 60 is a step on the client application that enables sending a report of concerning behavior to the restaurant. In an embodiment of the invention, Report to Restaurant 60 will render a prompt on the client application confirming that a report would be sent to the establishment regarding the incident. In an embodiment of the invention, the client application will provide data input fields where detailed information regarding the incident could be entered. Upon receiving confirmation from the prompt in Report to Restaurant 60, the client application is further adapted to enable the attaching of one or more frames thereto by its Attach Concerning Behavior Frame 62 step Attach Concerning Behavior Frame 62 is a step that provides a capability of the client application to attach a single or plurality of frames of the video to the report of concerning behavior to further substantiate and provide additional information with the report being sent to the restaurant. In an embodiment of the invention, the business would
receive the message from Attach Concerning Behavior Frame 62 in the form of an email. In another embodiment of the invention, the business would receive the message from Attach Concerning Behavior Frame 62 in the form of a message on the platform directly.
As shown in FIG.7, when the client application exits the stream, it would trigger the Exit Stream 64 step on the client application. Exit Stream 64 is a step in which the client application stops the feed being received from the Annotated Video Steaming Process 28. Next, the client application proceeds to the Survey 66 step. Survey 66 is a step of conducting a survey through a series of question related to the streama step of conducting a survey through a series of question related to the stream. In an embodiment of the invention, the Survey 66 would elicit feedback from the operator of the client application regarding the contents of Display Value Added Stream 50 regarding adherence to predefined protocols understood to enhance safety.
Upon submission of Survey 66, the client application executes the Send to Business 68 step. Send to Business 68 is a step that sends the data collected in Survey 66 to the restaurant. Additionally, upon submission of Survey 66, the client application also performs Send to Provider 70, a step that sends the data collected in Survey 66 to the Cloud Server 20 and to the value-added stream service provider. The cumulative data collected from all instances of Survey 66 triggering Send to Provider 70, would allow the value added stream provider Update Restaurant Score 72 for each establishment on the platform which is a computation to update a score for each establishment on the platform, given data from Survey 66. This result of Update Restaurant Score 72 then become available as a search criterion for Select Restaurant 48 step in the client application.
FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face. As shown in FIG. 8(A), the resultant output of Video Splitting Process 34 is used to Apply Model 42 by Cloud Server 20. FIG. 8(A) depicts the embodiment of the invention where Apply Model 42 is used to detect a face in the output of Video Splitting Process 34. In an embodiment of the invention, Preprocess Model Input Frame 76 is a process to convert the output of Video Splitting Process 34 to grayscale. Apply Model 42 next applies the output of Preprocess Model Input Frame 76, which in the embodiment shown converts a colored image to a grayscale image, is next processed through the Apply Face Detector 78 which is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76. The result of Apply Face Detector 78 produces the Face Detector Output 84 as its output which delineates the location of faces detected in a video frame.
FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face. An embodiment is implemented utilizing the Open Source Computer Vision Library (wv v.opencv org.) invoked from a Python program. Python is an interpreted programming language commonly used for Machine Learning Applications. Additional information can be found on Python’s official website, www .python. org . The specific libraries used in building the models are included in attached sequence listing and incorporated herein by reference.
An embodiment of the flow chart shown uses functions from the OpenCV library for receiving the output of Video Splitting Process 34 in FIG. 8(B) and using the Predefined Trained Model 36 as the Open CV’s prebuilt face detection model using Haar-cascade preprocessing for detecting faces. This model is incorporated herein by reference. An embodiment uses Open CV method to convert the image to grayscale within the flowchart block Preprocess Model Input Frame 76. Next, an embodiment applies Predefined Trained Model 36 - utilizing Haar-cascade model of OpenCV applied by the Apply Face Detector 78 block - to search for faces in the processed frame produced by Preprocess Model Input Frame 76. The output of the Apply Face Detector 78 is the Face Detector Output 84 which is a set of coordinates corresponding to the location of the face or faces in the image analyzed by Apply Face Detector 78. The Haar-cascade model being used in an embodiment can be found on OpenCV’ s official repository on https://github.com/opencv/opencv/blob/master/data/haarcascades/haarcascade Jrontalface def ault.xml. This model is incorporated herein by reference.
FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition. Each of the Filter Layer 91 performs a plurality of preconfigured filtering operations where each is adapted to capture a specific image property. The Filter Layer 91 is followed by Pooling Layer 92 which performs an aggregation of the filtered image layers to create coarser image layers which in turn capture occurrence of higher level features. An embodiment uses two instances of Filter Layer 91 the first instance using 64 and 128 filters respectively each filter using a 5 by 5 mask for convolving with the input. While in some image processing application, the set of filters is chosen a-priori including blurring filters, edge detection filters, or sharpening filters, the actual weights of the filters used in a convolutional neural network is not defined a-priori but learned during the training process. Thus by using the ‘learned’ values for filter convolution, the
deep convolution neural network can find features in the image that pre-configured filters may not find.
Each Filter Layer 91 is followed by Pooling Layer 92 layer. In an embodiment the Pooling Layer 92 simply replaces a 4x4 neighborhood in an image with the maximum value. Since by Pooling Layer 92 is done with tiling, the output of the Pooling Layer 92 results in cutting the input image size by half. Thus, the image after two instances of by Pooling Layer 92 layers is reduced to one quarter of the original size. In the embodiment of the convolutional neural network shown, the size of input image of a human face which was standardized to 255 by 255, will be reduced to 253 by 253 after the first instance of Filter Layer 91, further gets reduced to 126 by 126 after the first instance of Pooling Layer 92, which gets reduced to 124 by 124 after the second instance of Filter Layer 91, which ultimately is reduced to 62 by 62 by the second instance of Pooling Layer 92.
This 62 by 62 image is next flattened by Flatten Layer 96 which essentially reorganizes the two or higher dimensional data set, such as an image, to a single dimension vector making it possible to be fed to a dense neural network layer. After the flattening of the image, the feed forward process works just like it would in a multi-layer perceptron with the additional characteristic that a deep neural network will involve a plurality of hidden layers. Each of these layers are instances of Dense Layer 93 which comprises neurons that are fully connected to any neurons in the previous layer with a weight associated with each of these connections. The last dense layer is often labelled as the Output Layer 94 it being the final layer of neurons that are followed by a function such as the sigmoid or a softmax function for classification. An embodiment shown has three neurons in the Output Layer 94 would therefore utilize a softmax function to convert output of the three neurons into a probability density function using the formula below where probability of the ith. output is calculated given the neurons in the Output Layer 94 produced outputs of Oi, O2, ... On:
An embodiment of the system used for detecting a plurality of conditions where said computer implemented method further comprises a plurality of deep neural networks for each for detecting one or more of the plurality of conditions, and each of deep neural networks comprising of a convolutional neural network including a plurality of filtering layers where each of the filtering layers has an associated set of filtering layer parameters, a plurality of pooling layers
where each of the pooling layers has an associated set of pooling layer parameters, a plurality of dense layers where each of the dense layers has an associated set of dense layer parameters, an output layer where the output layer has an associated set of output layer parameters, and where the convolutional neural network is configured to detect and the output layer is configured to report a status of one or more of the predefined monitored condition.
In an embodiment, a base set of deep neural network models obtained from a source like ImageNet are utilized that is further fine-tuned, where an inductive tuning of the deep neural network is further performed by training the deep neural network on a plurality of examples of the predefined monitored condition, where each example is used for updating the filtering parameters, the pooling parameters, the dense layer parameters, and the output layer parameters; and the inductive tuning of the deep neural network is configured to cause the deep neural network to recognize the presence of the predefined monitored condition in a manner of improved accuracy.
FIGS. 10 (A) and (B) shows the architecture of a Long Short Term Memory or L STM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the condition’s value in prior frames. FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment, and FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above. This architecture of deep neural network is configured to capture changes in subsequent images which may not be readily perceptible. This is particularly useful in detecting monitored conditions where the consecutive observations form a time series of observations, or a series of sequential observations that depend on previous values of the monitored conditions. Such observations are often modeled using an autoregressive model - or the regression of variable against itself. In an autoregressive model, the value of a monitored condition y at time t is denoted atJ7, is modeled as:
A deep neural network for learning time dependent variability in the image sequences is utilized in an embodiment. The embodiment shows the architecture of a Long Short Term Memory or LSTM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the
condition’s value in prior frames. The LSTM belongs to the class of Recurrent Neural Networks which retain the state of learning from one time step to the next and the results from the previous frame influence the interpretation of the subsequent frame. In this manner the entire “video sequence” is used for detection of a monitored condition, where the deep learning neural network utilizes an autoregression of a time series of said captured information where said autoregression is perform using a computer implemented method utilizing a deep recurrent neural network based on a Long Short Term Memory (LSTM) architecture.
As illustrated in FIG. 10 (A), a sequence of frames is fed into a LSTM deep neural network where each LSTM Cell 95 which is a component of the LSTM chain where the LSTM cell also includes a memory and uses a neural network to combine the memory with image frame and the previous state to produce an output which is then fed to a next LSTM cell and to a dense layer. A five second sequence will typically comprise of anywhere between 50 to 150 frames corresponding to a frame capture rate of 10 fps to 30 fps respectively. Embodiments of 100 LSTM Cells 95 offer sufficient or long enough span of time for detection of monitored conditions in an embodiment. Each of the LSTM Cells 95 is fed a portion of the video frame comprising of a detected region of interest. In an embodiment, the region of interest is the face of any humans in the video frame which is detected by Face Detector Output 84. In an embodiment, the subsection detected by Face Detector Output 84 from the video frame is standardized to a size of 255 by 255 and then passed through an embedding process which converts the 2D image data into a single dimensional vector which is then supplied as an input to LSTM Cells 95.
The output of the chain of LSTM Cells 95 is fed to a plurality of neurons of a Dense Layer 93. The embodiment shown uses a single Dense Layer 93 which is also the Output Layer 94. Other embodiments use multiple instances of Dense Layer 93. Essentially the output of the series of LSTM Cells 95 comprises a flattened set of inputs that are fed into a multi-layer perceptron having a plurality of hidden layers.
FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face. As shown in FIG. 9, a Deep Neural Network (DNN) is used to classify if a mask is present or absent within a given bounding box containing a human face. FIG. 11 is depicting an embodiment of the invention such that Predefined Trained Model 36 is classifying the presence or absence of a mask within a given bounding box containing a human face. The input into this DNN is Face Detector Output 84. The DNN consists of a base model, Mobile Net V286. The Mobile Net V2 model used in an embodiment is included in the attached sequence listing and is incorporated herein by reference.
Mobile Net V286 is a general and stable open-source model architecture used for multiple applications in computer vision use cases. Additional information is available at https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV2. The model information and parameters are incorporated herein by reference. Face Detector Output 84 is standardized to an image size of 255 by 255 pixels before it is fed to the Mobile Net V286.
Following the Mobile Net V286 is the Head Model 88. Head Model 88 is the portion of the DNN architecture that is added to the underlying model to accomplish the specialized training of the base model as will be appreciated by a practitioner in the art. In the embodiment of the invention shown, Head Model 88 consists of three layers. The first layer flattens the output of Mobile Net V286. The second layer is a layer of 128 dense neurons. The third layer is a layer of 2 dense neurons. The two neurons in this layer correspond to the binary output. The output of the DNN is Mask Detector Output 90. Mask Detector Output 90 is the output of Apply Mask Detector 80 which is a binary classification of the presence of a mask within a given bounding box containing a human face. In an embodiment of the invention, the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on a dataset of open-source images. In another embodiment of the invention, the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user. As shown, a client application facilitates Login 44, Authenticate Login 46, and Select Restaurant 48 processes. The client device selects to Display Value Added Stream 50 by establishing a connection between the client application and the Cloud Server 20. Shown in FIG. 12 is an embodiment where the streaming process commences with End Point Monitor 16 capturing the video feed Acquire Video 52. Acquire Video 52 is a subsystem in which the End Point Monitor 16 acquires video feed from a Station Camera 12 or a Fish Eye Camera 14. The End Point Monitor 16 connects to the cloud Connect to Cloud 54. Connect to Cloud 54 is a process to connect the End Point Monitor 16 to the Cloud Server 20. Connect to Cloud 54 feeds into Stream to Cloud 56. Stream to Cloud 56 is a process to transmit video streams from the End Point Monitor 16 to the Cloud Server 20. Stream to Cloud 56 connects to the Cloud Server 20, specifically to Video Acquisition Process 22. Video Acquisition Process 22 inputs into Video Splitting Process 34. The output of Video Splitting Process 34 is the input in Apply Model 42.
The output of Apply Model 42 is the input to Annotated Video Steaming Process 28. This process is further detailed in FIG. 3. The output of Annotated Video Steaming Process 28 is made available to be displayed on a Web Client 30 or Mobile Client 32.
FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve. As illustrated, the client application enables the specification of searching using a plurality of Search Criteria 49. Each Search Criteria 49 is search criteria to be used for finding establishments of interest including geographic location, or establishment name, and the like. The client application communicates the search criteria to the Cloud Server 20 and receives a list of restaurants that meet the Search Criteria 49. Associated with each of the restaurant is TRM Score 71. TRM Score 71 is a score assigned by considering the answers to Survey 66 received by the client application indicative of a perception of health or other concerns pertaining to an establishment. The client application further offers the ability to present the list of restaurants ordered by TRM Score 71. The display of establishments is further facilitated by a TRM Sort 74 which is a list of establishments presented that are sorted by the TRM Score 71 scores with the listing of any establishments falling below a predefined TRM Score 71 threshold getting suppressed. In an embodiment, the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the video stream.
FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information. This allows the Cloud Server 20 to use the Database 24 in managing information about the locations of cameras, the survey results, other pertinent information about the establishments. This database is accessed by the client application while running a Select Restaurant 48 step in response to the Search Criteria 49 specified in Search Box 47. Additionally, Cloud Server 20 uses information stored in the Database 24 to provide additional details to the client application related to health measures or other important attributes.
In an embodiment further the system including a database wherein a unique identifier is further associated with the said captured information, the server further stores the said captured information and the label for the said information in the database, and where the database is configured to make said captured information and the label for the said information retrievable by the unique identifier.
FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like. Shown here a plurality of Establishment Pages 19 where each Establishment Page 19 corresponds to one of the many establishments that meet the selection criteria specified in the Search Box 47. As illustrated, the Search Box 47 is a search criteria input box allowing users to specify a plurality of Search Criteria 49. The Establishment Page 19 is a component of the GUI that refers to the dedicated page of an establishment on the Real Meal Platform. Furthermore, the client application also displays a TRM or the Real Meal Score, computed by the Cloud Server 20 and associated with each Value Added Stream 18. In an embodiment of the invention, TRM Score 71 is computed based on the input received by the client application in response to a survey questionnaire, Survey 66. The client application further provides the capability presenting the search results in a sorted manner. In an embodiment the server further assigns an identifier to the annotated video stream and saves the identifier to a database wherein the database is configured to search and retrieve the annotated video stream by the identifier; the server further accepts a request from a client software wherein the request includes the identifier of the annotated video stream; and the server is configured to retrieve and deliver the annotated video stream to the client software.
FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles. Shown here are a plurality of Value Added Streams 18 where each Value Added Stream 18 is one of many Value Added Streams 18 for a selected Establishment Page 19. Value Added Stream 18 is a raw video stream(s) obtained from the End Point Monitor 16 or the Field Endpoint 13 that has been annotated with additional information by overlaying informational items, including text and color codes to convey a specific message to the recipient. In the embodiment of the invention shown, there are four Value Added Streams 18 originating from End Point Monitor 16 and two Value Added Streams 18 originating from a Field Endpoint 13.
It will be further understood by a skilled artisan that an embodiment has the Field Endpoint 13 send the video feed to the End Point Monitor 16 which sends a consolidated feed of comprising all feeds to the Cloud Server 20. In an embodiment, the feeds from the Field Endpoint 13 are generated by a Mobile Phone 15 mounted inside a delivery vehicle configured
to monitor delivery personnel complying with requirement of wearing a face mask, for example, or monitoring that food packets are visible in plain view.
FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior. Illustrated herein is a Snap Frame 63 input on the client application. The Snap Frame 63 is an input on the client application adapted to allow the instantaneous snapping of the frame being displayed in the Value Added Stream 18 section of the application. In this manner the client application enables the sending of Report Behavior 58 message by attaching a Snap Frame 63 as evidence to the message and sending an electronic message with Report to Restaurant 60. In addition, the Snap Frame 63 also allows the user to post the clip with their own message to one of the social media platforms.
FIG. 17 shows a component and packaging diagram of an embodiment of the system. This figure depicts the various components used for the implementation of the disclosed system. As illustrated the Station Camera 12 and Fish Eye Camera 14 are in communication with End Point Monitor 16 which in turn streams the video to the Cloud Server 20. The Video Acquisition Process 22 process manages all the streams correlates the streams with the establishments and conveys the streams for analysis to Video Analysis and Annotation 26 which has a plurality of processes for splitting the video and analyzing it using deep convolution neural network and adding the result of the analysis as a value added annotation on the respective video streams. The annotated video streams, or Value Added Stream 18 are then streamed to the Web Client 30 or Mobile Client 32 when they request a specific stream satisfying their search of the Database 24 for further information on specific establishments.
FIG. 18 shows examples of video streams that have been annotated. FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10. In this manner the TRM or The Reel Meal Score is assigned based on the number of monitored conditions that are met. In an embodiment, when none of the monitored conditions are met the score assigned is zero, and a maximum score of 10 is assigned when all the monitored conditions are met.
In an embodiment, the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the
video stream. The client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites. Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
An embodiment is a process of using a deep neural network comprising having a plurality of cameras or optical sensors connected to an end-point aggregator where the plurality of cameras or optical sensors capture information and communicate the information to the end-point aggregator; having the end-point aggregator collect the information that is captured to create a video stream and having the end-point aggregator further communicate the video stream to a server, having the server further configured to executing a deep neural network based computer implemented method to perform an analysis of the video stream wherein the analysis is configured to detect a presence of a plurality of monitored conditions, and having the server annotate the video stream, with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
It will thus be seen that the objects set forth above, among those made apparent from the preceding description, are efficiently attained and, because certain changes may be made in carrying out the above method and in the construction(s) set forth without departing from the spirit and scope of the invention, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of, and not restrictive on, the broad invention, and that this invention is not limited to the specific constructions and arrangements shown and described, since various other changes, combinations, omissions, modifications, and substitutions, in addition to those set forth in the above paragraphs, are possible. Those skilled in the art will appreciate that various adaptations and modifications of the just described embodiments can be configured without departing from the scope and spirit of the invention. Therefore, it is to be understood that, within the scope of the appended claims, the invention may be practiced other than as specifically described herein.
SEQUENCE LISTINGS
Claims
1. A deep neural network based system comprising: an optical sensor configured to capture information where the optical sensor is in communication with a network interface; the network interface configured to receive the information that is captured and transmit said information to a server; the server configured to execute a deep learning neural network based computer implemented method to perform an analysis of said information to detect a presence of a plurality of monitored conditions, and label said information with the presence that is detected of the plurality of monitored conditions.
2. The system of claim 1 wherein the labeling of said information comprises adding a visual artifact, or adding an audio artifact, or adding both the visual artifact and the audio artifact to the captured information.
3. The system of claim 1 wherein the optical sensor and the network interface are configured to monitor conditions inside of a vehicle.
4. The system of claim 1 further including a mobile cellular device including a camera and a wireless networking interface adapted to communicate over a wireless network, with the camera on the mobile cellular device serving as an optical sensor, the wireless networking interface serving as the network interface, where information of the camera is configured to be received by the wireless networking interface and further transmitted to the server over the wireless network.
5. The system of claim 1 where said computer implemented method further comprises a plurality of deep neural networks each comprising of a convolutional neural network including a plurality of filtering layers where each of the filtering layers has an associated set of filtering layer parameters, a plurality of pooling layers where each of the pooling layers has an associated set of pooling layer parameters,
a plurality of dense layers where each of the dense layers has an associated set of dense layer parameters, an output layer where the output layer has an associated set of output layer parameters, and where the convolutional neural network is configured to detect and the output layer is configured to report a status of one or more of the predefined monitored condition.
6. The system of claim 5 where an inductive tuning of the deep neural network is further performed by training the deep neural network on a plurality of examples of the predefined monitored condition, where each example is used for updating the filtering parameters, the pooling parameters, the dense layer parameters, and the output layer parameters; and the inductive tuning of the deep neural network is configured to cause the deep neural network to recognize the presence of the predefined monitored condition in a manner of improved accuracy.
7. The system of claim 1 where the deep learning neural network utilizes an autoregression of a time series of said captured information where said autoregression is perform using a computer implemented method utilizing a deep recurrent neural network based on a Long Short Term Memory (LSTM) architecture.
8. The system of claim 1 where the deep learning neural network based computer implemented method is configured for detecting a presence of a human face from the information that is captured, and upon detecting the presence of the human, further detecting whether the human face included a mask or a face covering.
9. The system of claim 1 wherein the server is in a communication with a client system, and the server is further configured to communicate to the client system the said information and the label said information with the presence of the plurality of monitored conditions.
10. The system of claim 9 the client system further communicates the presence of the monitored conditions to a plurality of subscribers and further alerts the subscribers about the presence of a predefined set of monitored conditions.
11. The system of claim 1 further including a database wherein a unique identifier is further associated with the said captured information,
the server further stores the said captured information and the label for the said information in the database, and where the database is configured to make said captured information and the label for the said information retrievable by the unique identifier.
12. A deep neural network based system comprising: a plurality of cameras or optical sensors connected to an end-point monitor where the plurality of cameras or optical sensors capture information and communicate the information to the end-point monitor; the end-point monitor collects the information that is captured to create a video stream and further communicates the video stream to a server, wherein the server is configured to execute a deep neural network based computer implemented method to perform an analysis of the video stream to detect a presence of a plurality of monitored conditions, and annotate the video stream with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
13. The system of claim 12 where the plurality of cameras or optical sensors are configured to monitor an interior of a food preparation facility.
14. The system of claim 13 wherein the computer implemented method is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including a detecting of a presence of a human face in the video stream, and upon detecting the presence of the human face further detecting whether the human face includes a mark or a face covering.
15. The system of claim 13 where the computer implementation is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including an indication of presence of detecting a plurality of humans in the video stream, and upon the indication of presence of the plurality of humans further indicates whether said plurality of humans qualify a predefined separation condition from each other.
16. The system of claim 12 wherein the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the video stream.
17. The system of claim 16 wherein the server further assigns an identifier to the annotated video stream and saves the identifier to a database wherein the database is configured to search and retrieve the annotated video stream by the identifier; the server further accepts a request from a client software wherein the request includes the identifier of the annotated video stream; and the server is configured to retrieve and deliver the annotated video stream to the client software.
18. The system of claim 17 wherein the client software is further configured to receive a plurality of annotated video streams; construct a collection of the plurality of annotated video streams where the collection is searchable by a plurality of search terms including a name, a score, and a location, and present a web interface for a user to connect and search the collection using one or more of the pluralities of search terms.
19. The system of claim 18 wherein the client software is configured to enable a user to create a clip and a save of a portion of one or more of the annotated video streams, with the client software further configured to allow the user to add comments and attach the clip to an electronic mail message, or upload the clip to a social media web site.
20. A process of using a deep neural network comprising having a plurality of cameras or optical sensors connected to an end-point aggregator where the plurality of cameras or optical sensors capture information and communicate the information to the end-point aggregator; having the end-point aggregator collect the information that is captured to create a video stream and having the end-point aggregator further communicate the video stream to a server, having the server further configured to executing a deep neural network based computer implemented method to perform an analysis of the video stream wherein the analysis is configured to detect a presence of a plurality of monitored conditions, and having the server annotate the video stream, with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202063071365P | 2020-08-28 | 2020-08-28 | |
US63/071,365 | 2020-08-28 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2022047342A1 true WO2022047342A1 (en) | 2022-03-03 |
WO2022047342A9 WO2022047342A9 (en) | 2022-06-30 |
Family
ID=80355785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2021/048300 WO2022047342A1 (en) | 2020-08-28 | 2021-08-30 | System and method for using deep neural networks for adding value to video streams |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022047342A1 (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6173317B1 (en) * | 1997-03-14 | 2001-01-09 | Microsoft Corporation | Streaming and displaying a video stream with synchronized annotations over a computer network |
US20050232462A1 (en) * | 2004-03-16 | 2005-10-20 | Vallone Robert P | Pipeline architecture for analyzing multiple video streams |
US20090327856A1 (en) * | 2008-06-28 | 2009-12-31 | Mouilleseaux Jean-Pierre M | Annotation of movies |
US20120150387A1 (en) * | 2010-12-10 | 2012-06-14 | Tk Holdings Inc. | System for monitoring a vehicle driver |
US20120189282A1 (en) * | 2011-01-25 | 2012-07-26 | Youtoo Technologies, Inc. | Generation and Management of Video Blogs |
US20140085501A1 (en) * | 2010-02-26 | 2014-03-27 | Bao Tran | Video processing systems and methods |
-
2021
- 2021-08-30 WO PCT/US2021/048300 patent/WO2022047342A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6173317B1 (en) * | 1997-03-14 | 2001-01-09 | Microsoft Corporation | Streaming and displaying a video stream with synchronized annotations over a computer network |
US20050232462A1 (en) * | 2004-03-16 | 2005-10-20 | Vallone Robert P | Pipeline architecture for analyzing multiple video streams |
US20090327856A1 (en) * | 2008-06-28 | 2009-12-31 | Mouilleseaux Jean-Pierre M | Annotation of movies |
US20140085501A1 (en) * | 2010-02-26 | 2014-03-27 | Bao Tran | Video processing systems and methods |
US20120150387A1 (en) * | 2010-12-10 | 2012-06-14 | Tk Holdings Inc. | System for monitoring a vehicle driver |
US20120189282A1 (en) * | 2011-01-25 | 2012-07-26 | Youtoo Technologies, Inc. | Generation and Management of Video Blogs |
Non-Patent Citations (3)
Title |
---|
ALEXANDER M. CONWAY , IAN N. DURBACH, ALISTAIR MCINNES , ROBERT N. HARRIS: "Frame-by-frame annotation of video recordings using deep neural networks", BIORXIV, 29 June 2020 (2020-06-29), pages 1 - 21, XP05909096 * |
ANONYMOUS: "4 Advantages of Video Surveillance in Food Manufacturing", 12 June 2018 (2018-06-12), pages 1 - 17, XP055909117, Retrieved from the Internet <URL:https://umbrellatech.co/video-surveillance-applications-for-food-processing-and-manufacturing> * |
BRAULIO RIOS , MARCOS TOSCANO , ALAN DESCOINS: "Face mask detection in street camera video streams using Al: behind the curtain", 9 July 2020 (2020-07-09), pages 1 - 29, XP055909110, Retrieved from the Internet <URL:https://tryolabs.com/blog/2020/07/09/face-mask-detection-in-street-camera-video-streams-using-ai-behind-the-curtain> * |
Also Published As
Publication number | Publication date |
---|---|
WO2022047342A9 (en) | 2022-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12008880B2 (en) | Utilizing artificial intelligence to detect objects or patient safety events in a patient room | |
US11151383B2 (en) | Generating visual event detectors | |
US8737688B2 (en) | Targeted content acquisition using image analysis | |
JP6905850B2 (en) | Image processing system, imaging device, learning model creation method, information processing device | |
JP2018139403A (en) | Method for generating alerts in video surveillance system | |
JP6397581B2 (en) | Congestion status visualization device, congestion status visualization system, congestion status visualization method, and congestion status visualization program | |
IL258817A (en) | Methods and apparatus for false positive minimization in facial recognition applications | |
US20150145991A1 (en) | System and method for shared surveillance | |
US20180150695A1 (en) | System and method for selective usage of inference models based on visual content | |
CN107273106A (en) | Object information is translated and derivation information acquisition methods and device | |
US20180150683A1 (en) | Systems, methods, and devices for information sharing and matching | |
KR20200097319A (en) | Gathering interest in a potential list in a photo or video | |
US20180157917A1 (en) | Image auditing method and system | |
JP2015070354A (en) | Mobile tracing device, mobile tracing system and mobile tracing method | |
US10191969B2 (en) | Filtering online content using a taxonomy of objects | |
CN108701393A (en) | System and method for event handling | |
JP4061821B2 (en) | Video server system | |
WO2022047342A1 (en) | System and method for using deep neural networks for adding value to video streams | |
US10902274B2 (en) | Opting-in or opting-out of visual tracking | |
Gascueña et al. | Engineering the development of systems for multisensory monitoring and activity interpretation | |
KR102387232B1 (en) | System for attendance notification and integrated management for children in nursery school | |
TWI706381B (en) | Method and system for detecting image object | |
Khot Harish et al. | Smart video surveillance | |
Harinee et al. | Digital Solutions for Crime Control: A Comprehensive Criminal Identification and Reporting Framework | |
US20140375827A1 (en) | Systems and Methods for Video System Management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21862943 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21862943 Country of ref document: EP Kind code of ref document: A1 |