CN118570979A - Digital monitoring on-site prompt system - Google Patents
Digital monitoring on-site prompt system Download PDFInfo
- Publication number
- CN118570979A CN118570979A CN202410747730.1A CN202410747730A CN118570979A CN 118570979 A CN118570979 A CN 118570979A CN 202410747730 A CN202410747730 A CN 202410747730A CN 118570979 A CN118570979 A CN 118570979A
- Authority
- CN
- China
- Prior art keywords
- module
- face
- video stream
- video
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 20
- 230000008451 emotion Effects 0.000 claims abstract description 21
- 238000013145 classification model Methods 0.000 claims abstract description 14
- 238000011084 recovery Methods 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 14
- 238000004458 analytical method Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 claims description 3
- 230000001960 triggered effect Effects 0.000 abstract 1
- 238000000034 method Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 206010063659 Aversion Diseases 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006854 communication Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08B—SIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
- G08B21/00—Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
- G08B21/18—Status alarms
- G08B21/24—Reminder alarms, e.g. anti-loss alarms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/01—Customer relationship services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G08—SIGNALLING
- G08B—SIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
- G08B25/00—Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems
- G08B25/01—Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium
- G08B25/08—Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems characterised by the transmission medium using communication transmission lines
-
- G—PHYSICS
- G08—SIGNALLING
- G08B—SIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
- G08B7/00—Signalling systems according to more than one of groups G08B3/00 - G08B6/00; Personal calling systems according to more than one of groups G08B3/00 - G08B6/00
- G08B7/06—Signalling systems according to more than one of groups G08B3/00 - G08B6/00; Personal calling systems according to more than one of groups G08B3/00 - G08B6/00 using electric transmission, e.g. involving audible and visible signalling through the use of sound and light sources
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Emergency Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Child & Adolescent Psychology (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention belongs to the technical field of monitoring and provides a digital monitoring on-site prompting system which comprises a camera module, a microphone module, a control module, a positioning module, a wireless module, a prompting module, a storage module, an edge video recognition module and a voice recognition module, wherein the control module is used for acquiring video streams from the camera module and audio streams of a user from the microphone module; the edge video recognition module is used for recognizing the human face in the preprocessed video stream in real time based on a locally built-in pre-trained custom MobileNet network; the voice recognition module is used for carrying out voice recognition and emotion classification on the mixed audio characteristics in real time based on the pre-trained classification model to obtain voice content and emotion post-record of the audio; the prompting module is used for being triggered by the control module to interactively remind a user; the invention can ensure video safety, analyze in real time and prompt correction users.
Description
Technical Field
The invention relates to the technical field of monitoring, in particular to a digital monitoring on-site prompting system.
Background
With the improvement of living standard, people need more convenience in life, and when problems are encountered, the cost of people is expected to be reduced more, so people who correspondingly solve the problems are expected to be better able to go to the door for service. Thus, the service of going home is becoming popular.
In order to improve the service quality, the operation enterprises have strict requirements on the conversation of literacy of the service on the gate of the technicians, wherein the operation enterprises mainly know the professional literacy of the technicians from the feedback of clients, and information lag exists in the technical literacy, so that the development of the enterprises is not facilitated. And part of the method is to record the service process of the technician through a video recorder, an administrator remotely monitors the service of the technician and corrects the service attitude and service of the technician in time, the method needs to upload video to a server, and the video recorder possibly records the privacy (face) of a customer, and the uploading to the server can be leaked due to various reasons. Also, limited by the bandwidth cost of the wireless network, the video may not be uploaded to the server in real time, and such analysis may likewise suffer from hysteresis or even interruption.
Disclosure of Invention
In view of the above technical problems, the invention provides a digital monitoring field prompt system, so as to provide a monitoring system which can ensure video safety, analyze in real time and prompt and correct users.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
The invention discloses a digital monitoring on-site prompting system, which comprises a camera module, a microphone module, a control module, a positioning module, a wireless module, a prompting module, a storage module, an edge video recognition module and a voice recognition module, wherein the positioning module is used for acquiring the current positioning, and the wireless module is used for being connected with an external server, wherein:
The control module is used for acquiring a video stream from the camera module and an audio stream of a user from the microphone module, preprocessing the video stream and the audio stream and then respectively sending the preprocessed video stream and the preprocessed audio stream to the edge video recognition module and the voice recognition module;
the storage module is used for locally storing the video stream and the audio stream;
The edge video identification module is used for: based on a local built-in pre-trained custom MobileNet network, recognizing the human face in the preprocessed video stream in real time, wherein the custom MobileNet network comprises a mixed depth convolution unit, a full convolution layer, a global average pooling layer, a full connection layer and a classification layer formed by an activation function, the mixed depth convolution unit consists of two point-by-point convolutions and one depth convolution, and the activation function of the classification layer comprises Mish, tanhEXp and ReLU;
The voice recognition module is used for: based on a pre-trained classification model, carrying out voice recognition and emotion classification on mixed audio characteristics in real time to obtain voice content and emotion of the audio, and recording the voice content and emotion, wherein the mixed audio characteristics are formed by fusing extracted MFCC characteristics and corresponding time domain characteristics in the preprocessed audio, the MFCC characteristics are obtained by carrying out frame segmentation, hamming window processing, discrete Fourier transformation on the preprocessed audio, obtaining a Mel frequency spectrum and applying discrete cosine transformation, and the time domain characteristics are a plurality of maximum value, minimum value, average value, median, mode, standard deviation, variance, covariance, root mean square and percentile of a partial area of a matrix formed by the MFCC characteristics;
The prompting module is used for: and after the edge video recognition module recognizes the face, and the voice recognition module does not recognize the first specific voice, or recognizes the second specific voice, or detects the bad emotion, the control module triggers the voice recognition module to interactively remind the user.
Further, the preprocessing of the video stream by the control module includes: adjusting the frame rate of the video stream, adjusting each frame of the video stream to n x m pixels, and normalizing the pixel values of each frame of the video stream.
Further, the preprocessing of the audio stream by the control module includes:
Removing a mute portion in the audio stream;
Enhancing high frequency components in the audio stream based on a finite impulse response high pass filter;
and carrying out signal normalization on the audio stream.
Further, the number of layers of the custom MobileNet network is less than that of the standard MobileNet V network.
Further, the system also comprises an encryption module, wherein the encryption module is used for extracting frames in the video stream with the face identified, encrypting the extracted frames of the video stream with the face and returning the encrypted frames to the video stream to obtain the encrypted video stream.
Further, the encryption module specifically includes:
A face region detection unit that performs face detection on frames of the video stream in which faces are recognized, and sequentially selects the video streams in which faces are detected to form an image dataset;
the virtual face feature generation unit is used for generating virtual random numbers for the image data set based on a random number generation function, the virtual random numbers correspond to the recording sequence appointed by the face region detection unit, and the virtual face feature generation unit is used for generating virtual face feature parameters according to original face feature parameters of real face information extracted from a face region by adding noise values based on the virtual random numbers;
The virtual face feature vector and data generation unit is used for searching virtual face information from a database based on principal component analysis and linear discriminant analysis technology after the virtual face feature vector is extracted from the virtual face feature parameters, inserting the virtual face information into a face area of an original image corresponding to the image dataset to obtain an encrypted face image, encrypting the real face information and recording the encrypted real face information in a database of a blockchain.
Further, the system further comprises a decryption module, the decryption module comprising:
The human face feature recovery unit is used for adding the recovery random number obtained based on the recovery random generation function and a predefined similarity value to reversely calculate the noise value used for generating the virtual human face feature parameter, and generating the recovery human face feature parameter corresponding to the original human face feature parameter after deducting the reversely calculated noise value;
The face image restoration unit is used for carrying out vectorization on the restoration face characteristic parameters, generating restoration face characteristic vectors by utilizing the vectorized values, searching the real face information in a database of a block chain according to the restoration face characteristic vectors, and inserting the real face information into a face area of the encrypted face image to obtain the restored original image.
Further, the classification model is a CNN model.
Further, the prompting module is based on one or more of the following modes when interactively prompting the user:
broadcasting by a loudspeaker;
Sending a short message to the mobile phone of the user;
sending an alert to an administrator platform of the external server to cause the administrator to alert the user;
sending prompt information to a headset connected in a wired or wireless way;
sending prompt information to a display screen to drive a vibrator to vibrate;
the LED is driven to flash.
The technical scheme of the present disclosure has the following beneficial effects:
Based on the edge video recognition module and the voice recognition module, the edge analysis can be locally performed on the video and the audio without uploading the video and the audio, so that the video safety is greatly ensured; meanwhile, by analyzing the video and the audio, the analysis of the face can prevent technicians from falsifying, the analysis of the audio can monitor the service attitude of the technicians, and the prompt module is used for correcting, so that the service quality is greatly improved; in face recognition, based on a simplified self-defined MobileNet network, the calculation pressure of the equipment can be greatly reduced, so that the equipment carrying the edge video recognition module and the voice recognition module can be smaller and more portable, and further is more suitable for portable mobile field monitoring.
Drawings
FIG. 1 is a block diagram of a digital monitoring field prompting system in an embodiment of the present disclosure;
fig. 2 is a comparison of a depth separable convolution unit and a hybrid depth convolution unit in an embodiment of the present description.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are only schematic illustrations of the present disclosure. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
As shown in fig. 1, an embodiment of the present disclosure provides a digital monitoring on-site reminder system, the system including: the device comprises a camera module 101, a microphone module 102, a control module 103, a positioning module 104, a wireless module 105, a prompt module 106, a storage module 107, an edge video recognition module 108 and a voice recognition module 109, wherein the positioning module 104 is used for acquiring the current positioning, the wireless module 105 is used for being connected with an external server, the control module 103 is used for acquiring a video stream from the camera module 101 and an audio stream from the microphone module 102, preprocessing the video stream and the audio stream and then respectively sending the preprocessed video stream and the preprocessed audio stream to the edge video recognition module 108 and the voice recognition module 109, and the storage module 107 is used for locally storing the video stream and the audio stream.
The above-mentioned combination of the camera module 101, the microphone module 102, the control module 103, the positioning module 104, the wireless module 105, the prompt module 106, and the storage module 107 may be in the form of a mobile phone, a tablet, or other mobile devices with GPUs, and the edge video recognition module 108 and the voice recognition module 109 may be in the form of hardware or software.
The edge video identification module 108 is configured to: based on a local built-in pre-trained custom MobileNet network, recognizing the human face in the preprocessed video stream in real time, wherein the custom MobileNet network comprises a mixed depth convolution unit, a full convolution layer, a global average pooling layer, a full connection layer and a classification layer formed by an activation function, the mixed depth convolution unit consists of two point-by-point convolutions and one depth convolution, and the activation function of the classification layer comprises Mish, tanhEXp and ReLU.
Wherein the MobileNet network is an optimized version-based CNN, has a compact size, and is implemented by using a depth separable convolution unit (dsc), a full convolution layer, a global averaging pooling layer, a full connection layer, and a classification layer using a softmax function. The depth separable convolution unit is composed of a depth convolution (Dwcv) and a one-to-one point-by-point convolution (Pwcv). In MobileNet-V1 networks, each input channel needs to be filtered separately in Dwcv and then filtered again in Pwcv, with the purpose of using Pwcv being to linearly combine all the outputs of Dwcv. MobileNet-V1 network model has super parameters of width multiplier and resolution multiplier, the value is 0-1, and the size of the model can be adjusted by adjusting the two super parameters, but the recognition rate is greatly reduced. In this embodiment, the mixed depth convolution unit is used to replace the depth separable convolution unit, thereby improving accuracy and reducing the number of layers of the model. Specifically, as shown in fig. 2, the hybrid depth convolution unit is combined into one layer from Pwcv at two edges and one Dwcv in the middle, rather than defining Pwcv and Dwcv as separate layers, so that the function of Dsc can be kept unchanged, but the number of layers can be reduced to 14 by using the hybrid depth convolution unit instead of the depth separable convolution unit, relative to the 28 layers of the standard MobileNet-V1 network, so that the volume of the model can be greatly reduced to enable use on a mobile device.
Additionally, in custom MobileNet networks, mish, tanhEXp, and ReLU are used as activation functions. ReLU is a standard activation function that is commonly used in large architectures, but it has the problem of "ReLU dead zone", i.e., some neurons may stop outputting results other than 0. Therefore, the embodiment also uses nonlinear functions Mish and tanhEXp with similar shapes as the activation functions, and selectively switches when in use, so that the overall accuracy of the custom MobileNet network is enhanced.
In the custom MobileNet network, on the premise of not affecting the accuracy, all duplicate and unnecessary layers, such as unnecessary layers with the same output shape as (4,4,512), can be deleted, namely, 9-13 layers of the standard MobileNet network are deleted, and 5 layers are deleted. These layers with insignificant effects are eliminated, and the computational effort and the number of parameters of the network can be significantly reduced. The slight performance penalty caused by deleting layers is offset by the replacement of the hybrid depth convolution unit and the addition of the mix, tanhExp activation function. Finally, to further reduce the computational effort, the channel depth of the last depth convolution of the hybrid depth convolution unit may be reduced from 1024 to 780 to optimize model performance and hardware resource usage.
The edge video recognition module 108 specifically operates on the mobile device body, performs recognition locally in real time, uses a trained recognition model based on a custom MobileNet network, can track a face in a video stream, and the training process of the recognition model can refer to the existing CNN training, namely, marking the face label on training data, and inputs the face label to the custom MobileNet network for training to obtain the recognition model. The application scenario of the edge video recognition module 108 is generally aimed at that after a user arrives at a specified working area of a client, a work order is validated, the user communicates face to face with the client, then the camera module 101 records a video stream including the face of the client, the edge video recognition module 108 recognizes the face in the video stream, and then the video stream is fed back to the control module 103, and the control module 103 allows the work order to be validated. Alternatively, the edge video recognition module 108 may be configured to recognize the face of the user, and allow the card to be punched after the recognition is passed.
The speech recognition module 109 is configured to: based on a pre-trained classification model, carrying out voice recognition and emotion classification on mixed audio characteristics in real time to obtain voice content and emotion of the audio, and recording the voice content and emotion, wherein the mixed audio characteristics are formed by fusing extracted MFCC characteristics and corresponding time domain characteristics in the preprocessed audio, the MFCC characteristics are obtained by carrying out frame segmentation, hamming window processing, discrete Fourier transformation, obtaining a Mel frequency spectrum and applying discrete cosine transformation on the preprocessed audio, and the time domain characteristics are a plurality of maximum value, minimum value, average value, median, mode, standard deviation, variance, covariance, root mean square and percentile of a partial area of a matrix formed by the MFCC characteristics.
The preferred application scenario of the invention is a user (technician, household personnel) to go to the gate for service, and the service attitude of the user is particularly important. However, the service action of the user is difficult to monitor, so that from the standpoint of voice recognition, whether the user works according to a standard flow, such as whether the user's start time and end time in the service process are read out according to a speaking template, whether the user has a non-civilized expression, whether the conversation with an employer has a negative emotion, and the like, can be monitored.
The voice recognition module 109 may be implemented on the mobile device body, perform voice recognition locally, or be implemented in an external server, and may implement real-time performance due to the very small volume of the audio stream relative to the video stream, and the very small time delay in data remote transmission and analysis. The classification model contained in the speech recognition module 109 needs to be pre-trained, and the classification model may be one of CNNs. When the classification model is trained, a data set representing different speech templates, a data set representing different emotions and a data set representing a non-civilized expression are adopted, the speech templates can be "you good, bye, thank you and the like", the emotions can be anger, boring, aversion, fear, happiness, neutrality, sadness and the like, then the data set is preprocessed, MFCC features and time domain features are extracted, after the MFCC features and the time domain features are fused, the training is carried out by matching with labels of the data sets, and the classification model is obtained. When the classification model performs voice recognition, the MFCC features and the time domain features in the preprocessed audio are firstly extracted, then the mixed audio features fused with the MFCC features and the time domain features are recognized, and the labels corresponding to the audio are output, wherein the labels exist simultaneously, such as a first speech template and happiness.
Specifically, in extracting the MFCC feature, in order to prevent information loss, each piece of audio is divided into frames of n milliseconds, there is an overlap of m milliseconds between consecutive frames, and the number of frames of a single audio can be expressed asT is the total frame number, N is the number of samples, and S is the sampling rate. After frame segmentation is completed, each individual frame is subjected to Hamming window processing to ensure the edge smoothness of the frame, and the following formula is specifically calculated:
; Is the total number of samples per frame.
The amplitude spectra of all frames are then calculated by discrete fourier transformation and transmitted to a mel filter bank to perform the calculation to represent the perceived frequency. In the calculation, the scale frequency warping is performed based on the following formula:
;
;
Wherein, Which represents the actual frequency of the signal,Representing the corresponding perceived frequency of the light emitted by the light source,A triangular filter is shown as a function of the triangular filter,Representing a weight assigned to a kth region of the spectrum, the weight being specific to the kth regionThe contribution of the individual output bands is made,The spectrum of the amplitude is represented and,Representing the mel spectrum obtained by multiplying all amplitude spectra by a triangular filter.
Finally, MFCC characteristics are obtained by performing discrete cosine transform on mel spectrum of each frame, and are calculated as follows:
;
c is the MFCC feature.
After obtaining the MFCC features, time domain features need to be calculated, and the time domain features need to be processed by a box-dividing method first, and each box contains X rows of data of each single column. Different time domain features are extracted from bins of all MFCC features, including minimum, maximum, average, median, mode, standard deviation, variance, covariance, root mean square, 25%, 50% and 75% quantiles per bin, etc. statistics. The MFCC features and time domain features may be organized as principal feature vectors as inputs to a CNN-based classification model that will output corresponding emotion classifications and speech labels.
The prompting module 106 is configured to: and after the edge video recognition module recognizes the face, and the voice recognition module does not recognize the first specific voice, or recognizes the second specific voice, or detects the bad emotion, the control module triggers the voice recognition module to interactively remind the user.
Wherein the first specific voice refers to the expression in the speech template, the second specific voice refers to the non-civilized expression, and the bad emotion refers to anger, sadness, aversion and the like. After the user arrives at the working area (preset position of the employer), the user starts the system and triggers the interactive buttons such as 'start service', 'start work', 'start record', the camera module, the microphone module and the like start working, the user communicates face to face with the employer, the face of the employer is identified and tracked, the user uses the conversation template in the communication process, the conversation template belongs to the standard flow, no prompt is needed, and if the classification model in the voice recognition module does not recognize the content of the conversation template or recognize the unclear words of the user or recognize bad emotion, the user needs to be prompted to maintain good service quality until the face disappears or the preset time.
The prompting module 106, when interactively prompting the user, is based on one or more of the following: broadcasting by a loudspeaker; sending a short message to the mobile phone of the user; sending an alert to an administrator platform of the external server to cause the administrator to alert the user; sending prompt information to a headset connected in a wired or wireless way; sending prompt information to a display screen to drive a vibrator to vibrate; the LED is driven to flash.
In an embodiment, the preprocessing of the video stream by the control module includes: adjusting the frame rate of the video stream, adjusting each frame of the video stream to n x m pixels, and normalizing the pixel values of each frame of the video stream. The frame rate and the specific value of the pixel are different according to the mobile devices with different computing power, for example, the pixel may be 300×300, and the frame rate may be 10-30.
In an embodiment, the preprocessing of the audio stream by the control module includes: removing a mute portion in the audio stream; enhancing high frequency components in the audio stream based on a finite impulse response high pass filter; and carrying out signal normalization on the audio stream.
Where speech preprocessing is an extremely important stage for classification models that cannot tolerate background noise or silence, features are subsequently extracted from the audio, with a large portion of the spoken portion of the audio containing features related to emotion. To achieve this objective, the present embodiment employs a mute cancellation and pre-emphasis method. Pre-emphasis is to increase the signal-to-noise ratio by increasing the power of the high frequency part of the audio while keeping the low frequency unchanged. Pre-emphasis can be achieved by using a high pass filter of finite impulse response. The following formula is given:
;
Pre-emphasis using a finite impulse response high pass filter may result in a change in the inter-frequency energy distribution and a change in the overall energy level, which can have a significant impact on the energy-dependent characteristics. Therefore, signal normalization of the audio stream is required to ensure that there is comparability between the audio regardless of amplitude variations, and the normalization formula is as follows:
;
Wherein, Representing the first of the audioIn part,AndThe mean value and the standard deviation are indicated respectively,Normalized first representing audioPart(s).
In an embodiment, the system further includes an encryption module, where the encryption module is configured to extract frames in the video stream with the face identified, encrypt the extracted frames of the video stream with the face, and return the encrypted frames to the video stream to obtain the encrypted video stream.
Although the above embodiment proposes that the edge video recognition module performs local analysis on the video stream and stores the video stream locally, the privacy leakage caused by interception of the video stream by other people after the video stream is uploaded to the server is avoided. But where video is required as evidence, there is still a need to extract video from mobile devices, such as events with employers such as damage to items, theft of property, service reviews, etc. The face in the video stream is encrypted, so that privacy disclosure can be effectively prevented.
The encryption module specifically comprises a face region detection unit, a virtual face feature generation unit, a virtual face feature vector and a data generation unit. Specifically, the face region detection unit performs face detection on frames of the video stream in which faces are recognized, and sequentially selects the video streams in which faces are detected to form an image dataset.
The virtual face feature generation unit is used for generating a virtual random number for the image data set based on a random number generation function, the virtual random number corresponds to the recording sequence appointed by the face region detection unit, and the virtual face feature generation unit is used for generating virtual face feature parameters according to original face feature parameters of real face information extracted from a face region by adding a noise value based on the virtual random number.
Wherein the pseudo-random number may be expressed as,A function is generated for the random number and,Is a smooth release or sequence of random numbers, which may be software-simulated,And generating a series of random numbers for the predefined seed value under the condition that the predefined seed value and the random number have the same generation range, and obtaining the virtual random number. The generation period is assumed to be twice, assuming that the random number generation range is between 0 and 100. It is also assumed that the "S" for the first and second rounds are 0.2 and 0.5, respectively. In this embodiment, the first round randomly generates a random number in the range of 0 to 100 and multiplies the number by a fixed seed to generate the random number in the range expected by the user. Thus, the random number generated in the first round will be generated within the range desired by the user by multiplying the fixed seed value by 0.2, and the range will be limited to 0 to 20. Thus, in the second round, random numbers will be generated in the range from 0 to 50. The obtained virtual random number is used for generating a noise value, and the method can be that a difference value between median values of a predefined virtual random number generation range is added to a predefined similarity value, and the following formula is adopted:
;
Representing the value of the noise generated, Representing a pre-defined similarity value that is used to determine the similarity value,Representing the virtual random number (pseudorandom number),Representing a preset generation range of the pseudo random number. The higher the similarity value is, the closer the reproducibility of the original data is. Conversely, the lower the similarity value, the more the reproducibility of the original data is reversed.
The virtual face feature vector and data generating unit is used for searching virtual face information from a database based on principal component analysis and linear discriminant analysis technology after extracting the virtual face feature vector from the virtual face feature parameters, inserting the virtual face information into a face region of an original image corresponding to the image dataset to obtain an encrypted face image, encrypting the real face information and recording the encrypted real face information in a database of a blockchain.
The system further comprises a decryption module, wherein the decryption module comprises a face feature recovery unit and a face image recovery unit and is used for decrypting the encrypted face image.
Specifically, the face feature recovery unit is configured to add a recovery random number obtained based on a recovery random generation function to a predefined similarity value, so as to reversely calculate the noise value used to generate the virtual face feature parameter, and after subtracting the reversely calculated noise value, generate a recovery face feature parameter corresponding to the original face feature parameter.
The recovery random number may be expressed as,A function is generated for the random number and,Is the smooth release or sequence of the random numbers,Is a predefined seed value.Calculating noise value by adding similarity value for generating encrypted face image to recovered random number, using random number function for encryptionIn (a) and (b)Allowing the recovery of the random number from the value of (a)The random numbers generated in (a) have the same value. The restored face feature parameters may be generated by adding the virtual face feature parameters to the noise value. The process of calculating the noise value in the reverse direction is as follows:
;
Wherein, And recovering the noise value of the human face.
The face image restoration unit is used for carrying out vectorization on the restoration face characteristic parameters, generating restoration face characteristic vectors (namely restoration face characteristic parameter vectorization results) by utilizing the vectorized values, searching the real face information in a database of a block chain according to the restoration face characteristic vectors, and inserting the search results (real face information) into the face area of the encrypted face image to obtain the restored original image.
The encryption module and the decryption module can be deployed on the mobile device, and the encryption module encrypts the face of the video stream on the storage module and replaces the original video stream. The encryption module and the decryption module can also be deployed on the server, and encrypt the face in the server when the server receives the video stream. When a user or other third party needs to access the video stream on the server, what is seen is the video after face encryption. If the video needs to be subjected to face decryption, the video is decrypted after being authorized by an administrator.
Based on the encryption module and the decryption module, the face in the video stream can be encrypted, privacy disclosure can be effectively prevented, and in the occasion that the true identity is required to be provided, the face is decrypted and restored, so that the privacy of a customer can be protected, and the authenticity of the video is not lost.
The beneficial effects are that:
the invention is based on the edge video recognition module and the voice recognition module, can locally carry out edge analysis on video and audio without uploading the video and the audio, thereby greatly guaranteeing the video safety; meanwhile, by analyzing the video and the audio, the analysis of the face can prevent technicians from falsifying, the analysis of the audio can monitor the service attitude of the technicians, and the prompt module is used for correcting, so that the service quality is greatly improved; in face recognition, based on a simplified self-defined MobileNet network, the computing pressure of the equipment can be greatly reduced, so that the equipment carrying the edge video recognition module and the voice recognition module can be smaller and more portable, and is more suitable for portable mobile field monitoring
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Claims (9)
1. The utility model provides a digital control scene suggestion system, its characterized in that, the system includes camera module, microphone module, control module, positioning module, wireless module, suggestion module, storage module, edge video recognition module and speech recognition module, positioning module is used for acquireing current location, wireless module is used for being connected with external server, wherein:
The control module is used for acquiring a video stream from the camera module and an audio stream of a user from the microphone module, preprocessing the video stream and the audio stream and then respectively sending the preprocessed video stream and the preprocessed audio stream to the edge video recognition module and the voice recognition module;
the storage module is used for locally storing the video stream and the audio stream;
The edge video identification module is used for: based on a local built-in pre-trained custom MobileNet network, recognizing the human face in the preprocessed video stream in real time, wherein the custom MobileNet network comprises a mixed depth convolution unit, a full convolution layer, a global average pooling layer, a full connection layer and a classification layer formed by an activation function, the mixed depth convolution unit consists of two point-by-point convolutions and one depth convolution, and the activation function of the classification layer comprises Mish, tanhEXp and ReLU;
The voice recognition module is used for: based on a pre-trained classification model, carrying out voice recognition and emotion classification on mixed audio characteristics in real time to obtain voice content and emotion of the audio, and recording the voice content and emotion, wherein the mixed audio characteristics are formed by fusing extracted MFCC characteristics and corresponding time domain characteristics in the preprocessed audio, the MFCC characteristics are obtained by carrying out frame segmentation, hamming window processing, discrete Fourier transformation on the preprocessed audio, obtaining a Mel frequency spectrum and applying discrete cosine transformation, and the time domain characteristics are a plurality of maximum value, minimum value, average value, median, mode, standard deviation, variance, covariance, root mean square and percentile of a partial area of a matrix formed by the MFCC characteristics;
The prompting module is used for: and after the edge video recognition module recognizes the face, and the voice recognition module does not recognize the first specific voice, or recognizes the second specific voice, or detects the bad emotion, the control module triggers the voice recognition module to interactively remind the user.
2. The digital surveillance field prompt system of claim 1, wherein the preprocessing of the video stream by the control module comprises: adjusting the frame rate of the video stream, adjusting each frame of the video stream to n x m pixels, and normalizing the pixel values of each frame of the video stream.
3. The digital monitoring field prompting system according to claim 1, wherein said control module preprocesses said audio stream, comprising:
Removing a mute portion in the audio stream;
Enhancing high frequency components in the audio stream based on a finite impulse response high pass filter;
and carrying out signal normalization on the audio stream.
4. The digital monitoring field prompting system according to claim 1, wherein the custom MobileNet network has fewer layers than a standard MobileNet V1 network.
5. The digital monitoring on-site prompting system according to claim 1, further comprising an encryption module, wherein the encryption module is configured to extract frames in the video stream with the face identified, encrypt the extracted frames of the video stream with the face, and return the frames to the video stream to obtain the encrypted video stream.
6. The digital monitoring field prompting system according to claim 5, wherein said encryption module specifically comprises:
A face region detection unit that performs face detection on frames of the video stream in which faces are recognized, and sequentially selects the video streams in which faces are detected to form an image dataset;
the virtual face feature generation unit is used for generating virtual random numbers for the image data set based on a random number generation function, the virtual random numbers correspond to the recording sequence appointed by the face region detection unit, and the virtual face feature generation unit is used for generating virtual face feature parameters according to original face feature parameters of real face information extracted from a face region by adding noise values based on the virtual random numbers;
The virtual face feature vector and data generation unit is used for searching virtual face information from a database based on principal component analysis and linear discriminant analysis technology after the virtual face feature vector is extracted from the virtual face feature parameters, inserting the virtual face information into a face area of an original image corresponding to the image dataset to obtain an encrypted face image, encrypting the real face information and recording the encrypted real face information in a database of a blockchain.
7. The digital monitoring field prompting system according to claim 6, further comprising a decryption module, said decryption module comprising:
The human face feature recovery unit is used for adding the recovery random number obtained based on the recovery random generation function and a predefined similarity value to reversely calculate the noise value used for generating the virtual human face feature parameter, and generating the recovery human face feature parameter corresponding to the original human face feature parameter after deducting the reversely calculated noise value;
The face image restoration unit is used for carrying out vectorization on the restoration face characteristic parameters, generating restoration face characteristic vectors by utilizing the vectorized values, searching the real face information in a database of a block chain according to the restoration face characteristic vectors, and inserting the real face information into a face area of the encrypted face image to obtain the restored original image.
8. The digital monitoring field prompting system according to claim 1, wherein said classification model is a CNN model.
9. The digital monitoring field prompting system according to claim 1, wherein said prompting module, when interactively prompting a user, is based on one or more of:
broadcasting by a loudspeaker;
sending a short message to a mobile phone of a user;
sending an alert to an administrator platform of the external server to cause the administrator to alert the user;
sending prompt information to a headset connected in a wired or wireless way;
sending prompt information to a display screen;
Driving the vibrator to vibrate;
the LED is driven to flash.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410747730.1A CN118570979A (en) | 2024-06-11 | 2024-06-11 | Digital monitoring on-site prompt system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410747730.1A CN118570979A (en) | 2024-06-11 | 2024-06-11 | Digital monitoring on-site prompt system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118570979A true CN118570979A (en) | 2024-08-30 |
Family
ID=92470749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410747730.1A Pending CN118570979A (en) | 2024-06-11 | 2024-06-11 | Digital monitoring on-site prompt system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118570979A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860046A (en) * | 2019-04-26 | 2020-10-30 | 四川大学 | Facial expression recognition method for improving MobileNet model |
CN115497509A (en) * | 2022-08-26 | 2022-12-20 | 昆明理工大学 | Speech emotion recognition method based on MFCC differential mixed frequency spectrum |
CN115601904A (en) * | 2022-09-30 | 2023-01-13 | 中通服和信科技有限公司(Cn) | AI intelligent edge computing system and method based on 5G |
CN115690653A (en) * | 2022-10-28 | 2023-02-03 | 福寿康智慧(上海)医疗养老服务有限公司 | Monitoring and early warning for realizing abnormal nursing behaviors of nursing staff based on AI behavior recognition |
US20230410519A1 (en) * | 2021-03-08 | 2023-12-21 | i-PRO Co., Ltd. | Suspicious person alarm notification system and suspicious person alarm notification method |
-
2024
- 2024-06-11 CN CN202410747730.1A patent/CN118570979A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860046A (en) * | 2019-04-26 | 2020-10-30 | 四川大学 | Facial expression recognition method for improving MobileNet model |
US20230410519A1 (en) * | 2021-03-08 | 2023-12-21 | i-PRO Co., Ltd. | Suspicious person alarm notification system and suspicious person alarm notification method |
CN115497509A (en) * | 2022-08-26 | 2022-12-20 | 昆明理工大学 | Speech emotion recognition method based on MFCC differential mixed frequency spectrum |
CN115601904A (en) * | 2022-09-30 | 2023-01-13 | 中通服和信科技有限公司(Cn) | AI intelligent edge computing system and method based on 5G |
CN115690653A (en) * | 2022-10-28 | 2023-02-03 | 福寿康智慧(上海)医疗养老服务有限公司 | Monitoring and early warning for realizing abnormal nursing behaviors of nursing staff based on AI behavior recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kolagati et al. | Exposing deepfakes using a deep multilayer perceptron–convolutional neural network model | |
CN112862001B (en) | Privacy protection method and system for decentralizing data modeling under federal learning | |
CN109658352A (en) | Optimization method and device, electronic equipment and the storage medium of image information | |
JP3584458B2 (en) | Pattern recognition device and pattern recognition method | |
CN109871883A (en) | Neural network training method and device, electronic equipment and storage medium | |
US12039024B2 (en) | Liveness detection using audio-visual inconsistencies | |
Chen et al. | Learning to generate steganographic cover for audio steganography using gan | |
CN112132996A (en) | Door lock control method, mobile terminal, door control terminal and storage medium | |
CN101409819A (en) | Method for encrypting and deciphering digital camera picture based on voiceprint | |
CN113053400B (en) | Training method of audio signal noise reduction model, audio signal noise reduction method and equipment | |
Feng et al. | Enhancing privacy through domain adaptive noise injection for speech emotion recognition | |
WO2024140430A1 (en) | Text classification method based on multimodal deep learning, device, and storage medium | |
CN109451297A (en) | Voice and video telephone mass analysis method and device, electronic equipment, storage medium | |
CN112491844A (en) | Voiceprint and face recognition verification system and method based on trusted execution environment | |
CN106941506A (en) | Data processing method and device based on biological characteristic | |
Malik | Fighting AI with AI: fake speech detection using deep learning | |
CN116564339A (en) | Safe and efficient vehicle-mounted voice recognition method and system based on federal learning | |
CN105138886A (en) | Robot biometric identification system | |
CN118570979A (en) | Digital monitoring on-site prompt system | |
CN113035176A (en) | Voice data processing method and device, computer equipment and storage medium | |
US10748554B2 (en) | Audio source identification | |
CN111009262A (en) | Voice gender identification method and system | |
Jethanandani et al. | Adversarial attacks against LipNet: End-to-end sentence level lipreading | |
Alam et al. | An ensemble approach to unsupervised anomalous sound detection | |
US20240054235A1 (en) | Systems and methods for encrypting dialogue based data in a data storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |