[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117456405A - Audio and video processing method and device for handheld terminal and law enforcement recorder - Google Patents

Audio and video processing method and device for handheld terminal and law enforcement recorder Download PDF

Info

Publication number
CN117456405A
CN117456405A CN202311262828.XA CN202311262828A CN117456405A CN 117456405 A CN117456405 A CN 117456405A CN 202311262828 A CN202311262828 A CN 202311262828A CN 117456405 A CN117456405 A CN 117456405A
Authority
CN
China
Prior art keywords
audio
video
recorded
typical
video processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311262828.XA
Other languages
Chinese (zh)
Inventor
张志达
张科伟
黄鹏
毛翔宇
孙永文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aisino Corp
Original Assignee
Aisino Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aisino Corp filed Critical Aisino Corp
Priority to CN202311262828.XA priority Critical patent/CN117456405A/en
Publication of CN117456405A publication Critical patent/CN117456405A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Alarm Systems (AREA)

Abstract

The invention discloses an audio and video processing method and device for a handheld terminal and a law enforcement recorder. The audio and video processing method comprises the following steps: acquiring audio and storing the audio in a temporary buffer area; identifying the audio using a speech recognition model based on the audio self-attention model; when the typical voice characteristics are recorded in the audio, storing the audio in a safe storage area, and recording the recognized typical voice characteristics in a file name corresponding to the audio; and/or acquiring the video and storing the video in a temporary buffer area, and extracting at least one frame of image from the acquired video; identifying the at least one frame of image using an ArcFace model based face recognition model; when the characteristic of the typical face is recorded in the at least one frame of image, the acquired video is stored in a safe storage area, and the characteristic of the identified typical face is recorded in a file name corresponding to the video. Thus, the method is beneficial to improving the law enforcement efficiency and reducing the law enforcement difficulty of law enforcement personnel.

Description

Audio and video processing method and device for handheld terminal and law enforcement recorder
Technical Field
The invention belongs to the technical field of audio and video processing, and particularly relates to an audio and video processing method and device for a handheld terminal and a law enforcement recorder.
Background
In law enforcement fields such as police and customs, a law enforcement recorder is generally used for recording the situations of on-site duty or law enforcement of staff.
In order to facilitate the use and simplify the operation flow, the recorder is usually provided with a one-key start function, for example, when the switch key is pressed down, the recording can be started, and when the switch is pressed down again, the recording can be stopped or stopped. Once a malfunction occurs, a situation may occur in which no video recording is performed and critical evidence is missed.
Usually, after the on-duty or law enforcement is finished, recorded video or audio needs to be exported, manually checked and audited, and file names are modified or audio and video files are cut, so that the processing efficiency is low and errors are easy to occur. And, once a malfunction occurs, a situation may occur in which key evidence is lost or missed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an audio and video processing method, a handheld terminal and an audio and video processing system, so as to solve the problems of a law enforcement recorder in the prior art.
In a first aspect, the present invention provides an audio/video processing method, including: acquiring audio and storing the audio in a temporary buffer area;
Identifying the audio using a speech recognition model based on the audio self-attention model;
when the typical voice characteristics are recorded in the audio, storing the audio in a safe storage area, and recording the recognized typical voice characteristics in a file name corresponding to the audio; and/or
Acquiring a video, storing the video in a temporary buffer area, and extracting at least one frame of image from the acquired video;
identifying the at least one frame of image using an ArcFace model based face recognition model;
when the characteristic of the typical face is recorded in the at least one frame of image, storing the acquired video in a safe storage area, and recording the characteristic of the identified typical face in a file name corresponding to the video;
the temporary buffer area is arranged at the handheld terminal, and the safe storage area is arranged at the handheld terminal or the non-portable device.
Further, the method further comprises the following steps: identifying the video based on a human motion classification model of ST-GCN;
when the typical human body action is recorded in the video, the acquired video is stored in a safe storage area, and the recognized typical human body action is recorded in a file name corresponding to the video.
Further, the method further comprises the following steps: extracting at least one frame of image from the acquired video;
detecting the at least one frame of image using a YOLOV 8-based typical item detection model;
when the characteristic of the typical object is recorded in the at least one frame of image, the acquired video is stored in a safe storage area, and the detected characteristic of the typical object is recorded in a file name corresponding to the video.
Further, the ST-GCN based human action classification model identifies the video, including:
using an RTMPose model to detect key points of a human body on a whole body image of a person recorded in the acquired video, and generating a key point time-space diagram;
inputting the key point space-time diagram into the human body action classification model based on ST-GCN, and identifying whether typical human body actions are recorded in the video.
Further, the method further comprises the following steps: acquiring identification information identified by an environmental information identification unit, wherein the environmental information identification unit comprises a wireless communication tag reading and writing module;
when the identification information is detected to belong to a preset target identification, the acquired video or audio is stored in a safe storage area, and the information of the detected preset target identification is recorded in a file name corresponding to the video or the audio.
Further, the method further comprises the following steps: acquiring geographic position information identified by a geographic position determining unit;
when the geographic position information is detected to belong to a preset target position, the acquired video or audio is stored in a safe storage area, and the detected geographic position information is recorded in a file name corresponding to the video.
Further, the method further comprises the following steps: with the encoder-decoder architecture, summary information is extracted from the acquired video using a video-rocking self-attention model.
Further, the method further comprises the following steps: broadcasting the typical face features, the typical object features, the geographic position information or the information of the preset target mark.
In a second aspect, the present invention provides a handheld terminal comprising: the system comprises an audio and video processing device, an environment information identification unit and a geographic position determination unit;
the audio/video processing device is connected with the environment information identification unit and the geographic position determining unit and is used for executing the audio/video processing method as described in the first aspect.
In a third aspect, the present invention provides an audio/video processing system, including: a handheld terminal, a non-portable device;
the handheld terminal is communicatively coupled to the non-portable device,
The handheld terminal is used for executing the audio and video processing method as described in the first aspect;
the non-portable device is configured to perform the audio-video processing method as described in the first aspect.
The invention is further described below with reference to the drawings and examples.
Drawings
Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:
fig. 1 is a flow chart of an audio/video processing method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a law enforcement recorder according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of processing video by the audio/video processing method according to the embodiment of the invention;
fig. 4 is a schematic diagram of a human body key point marker extracted in the audio/video processing method according to the embodiment of the present invention;
fig. 5 is a schematic diagram of a video motion classification flow in an audio/video processing method according to an embodiment of the present invention;
fig. 6 is a flow chart of processing audio by the audio/video processing method according to the embodiment of the invention;
FIG. 7 is a schematic diagram of the components of a handheld terminal according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an audio/video processing system according to an embodiment of the invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
The technical scheme of the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, 3, 6 and 7, the audio/video processing method for a handheld terminal 1000 according to an embodiment of the present invention includes:
s11: acquiring audio and storing the audio in a temporary buffer area;
s12: identifying the audio using a speech recognition model based on the audio self-attention model;
s13: when the typical voice characteristics are recorded in the audio, storing the audio in a safe storage area, and recording the recognized typical voice characteristics in a file name corresponding to the audio; and/or
S21: acquiring a video, storing the video in a temporary buffer area, and extracting at least one frame of image from the acquired video;
s22: identifying the at least one frame of image using an ArcFace model based face recognition model;
S23: when the characteristic of the typical face is recorded in the at least one frame of image, storing the acquired video in a safe storage area, and recording the characteristic of the identified typical face in a file name corresponding to the video;
the temporary buffer area is arranged at the handheld terminal, the safe storage area is arranged at the handheld terminal or the non-portable equipment, and the non-portable equipment can be a background master control center or a workstation which is in communication connection with a plurality of handheld terminals and is provided with a server.
Specifically, the handheld terminal is provided with a communication unit, such as a 4G/5G communication module or a Bluetooth module used for outdoor communication, such as a WIFI communication module or a Bluetooth module used for indoor communication, and transmits stored audio or video to the background master control center through the communication unit.
Specifically, the temporary buffer area is randomly created and dynamically refreshed by the audio/video processing device, and can store audio or video with preset duration in real time for realizing quick storage. The safe storage area is created by the audio/video processing device or the non-portable equipment, and can be refreshed after the preset permission is acquired, so that the safe storage area is prevented from being deleted or covered by mistake, is used for safely and reliably storing the audio or video, avoids data loss, and realizes safe storage.
Thus, after the handheld terminal is started, the audio recording unit records the live sound, and the audio and video processing device acquires the audio recorded by the audio recording unit and stores the audio in the temporary buffer; the audio/video processing device recognizes the audio using a speech recognition model based on an audio self-attention model, stores the audio in a secure storage area when recognizing that typical speech features are described in the audio, and records the recognized typical speech features in a file name corresponding to the audio.
Specifically, the typical speech features include: screaming, life-saving calls or gunshot.
When the handheld terminal is applied to the fields of public security, customs and the like, after entering a law enforcement site, the handheld terminal can automatically record audio after abnormal sounds are identified, namely the audio is stored in a safe storage area, and the identified typical voice characteristics are recorded in file names corresponding to the audio. Therefore, personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
Thus, after the handheld terminal is started, an image acquisition unit, such as an RGB camera or a high-definition camera, records live video; the audio and video processing device acquires videos recorded by the image acquisition unit and stores the videos in the temporary buffer area, extracts at least one frame of images from the acquired videos, identifies the at least one frame of images by utilizing a face recognition model based on an ArcFace model, stores the acquired videos in the safe storage area when the characteristic of a typical face is recorded in the at least one frame of images, and records the identified characteristic of the typical face in a file name corresponding to the videos.
Specifically, the representative face features correspond to face features of key persons described later.
Above, the handheld terminal has face recognition function, can detect blacklist personnel. When the method is applied to the fields of public security, customs and the like, after entering a law enforcement site, videos are automatically recorded after key personnel are identified, namely, the obtained videos are stored in a safe storage area, and the identified typical face features are recorded in file names corresponding to the videos. Therefore, personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
extracting at least one frame of image from the acquired video;
detecting the at least one frame of image using a YOLOV 8-based typical item detection model;
when the characteristic of the typical object is recorded in the at least one frame of image, the acquired video is stored in a safe storage area, and the detected characteristic of the typical object is recorded in a file name corresponding to the video.
Thus, after the handheld terminal is started, the image acquisition unit records live video; the audio and video processing device acquires the video recorded by the image acquisition unit and stores the video in the temporary buffer area, extracts at least one frame of image from the acquired video, and detects the at least one frame of image by using a typical object detection model based on YOLOV 8; when the characteristic of the typical object is recorded in the at least one frame of image, the acquired video is stored in a safe storage area, and the detected characteristic of the typical object is recorded in a file name corresponding to the video.
Specifically, typical items include contraband and special tags. Generally, the forbidden articles are articles which are forbidden to be carried or controlled by public security and customs, including articles which are forbidden to be carried or controlled by public security and customs such as a control cutter, a gun, or ivory. The special label refers to a special luggage label used by customs department, is stuck or attached outside the luggage in the primary checking link, and is provided with a mark for indicating that the luggage needs additional processing, such as a text, a picture and the like.
Above, the handheld terminal has contraband and special label recognition function, can detect contraband or special label. When the method is applied to the fields of public security, customs and the like, after entering a law enforcement site, videos are automatically recorded after contraband or special tags are identified, namely, the obtained videos are stored in a safe storage area, and the detected characteristic of the typical object is recorded in a file name corresponding to the videos. Therefore, personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
identifying the video based on a human motion classification model of ST-GCN;
When the typical human body action is recorded in the video, the acquired video is stored in a safe storage area, and the recognized typical human body action is recorded in a file name corresponding to the video.
Thus, after the handheld terminal is started, the image acquisition unit records live video; the audio and video processing device acquires the video recorded by the image acquisition unit, and identifies the video based on the human body action classification model of the ST-GCN; when the typical human body action is recorded in the video, the acquired video is stored in a safe storage area, and the recognized typical human body action is recorded in a file name corresponding to the video.
Above, the handheld terminal has the unusual action recognition function of personnel, can discern the unusual action of personnel. When the method is applied to the fields of public security, customs and the like, after entering a law enforcement site, the method automatically records videos after identifying abnormal actions of identified personnel, namely, the obtained videos are stored in a safe storage area, and the identified typical human actions are recorded in file names corresponding to the videos. Therefore, personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
In some embodiments, in the audio/video processing method for the handheld terminal 1000, the classifying model for human motion based on ST-GCN identifies the video, including:
using an RTMPose model to detect key points of a human body on a whole body image of a person recorded in the acquired video, and generating a key point time-space diagram;
inputting the key point space-time diagram into the human body action classification model based on ST-GCN, and identifying whether typical human body actions are recorded in the video.
Specifically, the typical human actions include: fight, fall, or wander.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
acquiring identification information identified by an environmental information identification unit, wherein the environmental information identification unit comprises a wireless communication tag reading and writing module;
when the identification information is detected to belong to a preset target identification, the acquired video or audio is stored in a safe storage area, and the information of the detected preset target identification is recorded in a file name corresponding to the video or the audio.
Specifically, the identification information includes: identification information recorded by the RFID tag or identification information recorded by the NFC tag.
Thus, after the handheld terminal is started, the environment information identification unit identifies the identification information in the environment; the audio/video processing device acquires the identification information identified by the environment information identification unit, stores the acquired video or audio in a safe storage area when detecting that the identification information belongs to a preset target identification, and records the information of the detected preset target identification in a file name corresponding to the video or the audio.
The handheld terminal has a special tag reading function and can read the identification information recorded by the special tag. When the tag is applied to the fields of public security, customs and the like, after entering a law enforcement site, when the tag approaches, the tag can detect that the identification information belongs to a preset target identification and automatically record video, namely, the acquired video or audio is stored in a safe storage area, and the information of the detected preset target identification is recorded in a file name corresponding to the video or the audio. Therefore, when the labels are close to each other, the video is automatically started, personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
Acquiring geographic position information identified by a geographic position determining unit;
when the geographic position information is detected to belong to a preset target position, the acquired video or audio is stored in a safe storage area, and the detected geographic position information is recorded in a file name corresponding to the video.
Specifically, the geographic location information includes: : the absolute geographical position of the own machine expressed by longitude, latitude and elevation, or the semantic geographical position of the own machine expressed by semantic tags are described in each inspection room.
Thus, after the handheld terminal is started, the geographic position determining unit identifies geographic position information; the audio and video processing device acquires the geographic position information identified by the geographic position determining unit, stores the acquired video or audio in a safe storage area when detecting that the geographic position information belongs to a preset target position, and records the detected geographic position information in a file name corresponding to the video.
Above, the hand-held terminal has a positioning function. When the method is applied to the fields of public security, customs and the like, when a special position (such as an electronic fence) is entered, if the geographic position information is detected to belong to a preset target position, the acquired video or audio is stored in a safe storage area, and the detected geographic position information is recorded in a file name corresponding to the video. Therefore, when entering a special position (such as an electronic fence), the video is automatically started, so that personnel misoperation can be avoided, law enforcement evidence is omitted, law enforcement efficiency is improved, and law enforcement difficulty of law enforcement personnel is reduced.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
broadcasting the typical face features, the typical object features, the geographic position information or the information of the preset target mark.
It should be appreciated that handheld terminal 1000 is provided with an audio playback unit, such as a microphone or radio or speaker, capable of voice intercom with the background hosting center.
Accordingly, when the handheld terminal is applied to the fields of public security, customs and the like, the corresponding sound can be played in time to prompt related personnel when abnormal sound is identified, contraband or special labels are detected, key personnel are identified, abnormal actions of the personnel are identified or special geographic positions are identified.
Therefore, when the handheld terminal is applied to the fields of public security, customs and the like, the video recording function is automatically started when abnormal actions, abnormal sounds, forbidden articles and special labels of personnel are detected and the handheld terminal enters a special position, so that the loss of key evidence can be avoided.
Therefore, when the handheld terminal is applied to the fields of public security, customs and the like, the identified abnormal sound information, the detected contraband or special tag information, the identified key personnel information, the identified personnel abnormal action information or the identified special geographic position and the like can be added into the file name of the audio file or the video file, the workload of manually checking the video content to name the video can be reduced, the law enforcement personnel can conveniently search the video or the audio as evidence information, and the working efficiency of personnel checking and video searching can be effectively improved.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
with the encoder-decoder architecture, summary information is extracted from the acquired Video using a Video swing (Video Swin) self-attention model.
The handheld terminal can abstract the content of the video file and rename the video file. Therefore, the workload of manually checking the video content to name the video can be reduced, law enforcement personnel can conveniently search the video or audio as evidence information, the working efficiency of checking and video searching by the personnel can be effectively improved, and the handheld terminal has a positioning function.
As shown in fig. 2, a law enforcement recorder for executing the foregoing audio/video processing method according to an embodiment of the present invention includes:
a camera; a memory, which may be a Random Access Memory (RAM), a Read Only Memory (ROM); the voice intercom module comprises an audio recording unit and is used for recording on-site sound, and when special conditions occur, the voice intercom module can automatically record in time; an audio playing unit (not shown), such as a speaker, a sound reproducing device, a loudspeaker, etc., for playing an alarm or warning sound; an environmental information identification unit such as a wireless communication tag reader; the communication unit is used for transmitting videos to a background master control center through a law enforcement recorder or acquiring key personnel information (including face information, identity information and the like) from the background master control center, such as real-time reporting of identified key personnel, such as real-time reporting of identified personnel positions, such as a 4G/5G communication module or a Bluetooth module used outdoors, such as a WIFI communication module or a Bluetooth module used indoors; the positioning module comprises the geographic position determining unit, and the positioning module acquires the absolute geographic position of the mobile phone if the Beidou module is used; the audio and video processing device is used for detecting abnormal sound, adding positioning information in the video file name, and automatically starting video recording when entering a special position; and the lamplight alarm unit (not shown in the figure) is used for alarming according to a preset lamplight flashing strategy, wherein the lamplight flashing strategy can be a police law enforcement lamp flashing scheme or a customs law enforcement lamp flashing scheme.
Referring to fig. 7, the foregoing non-portable device 2000 may be used as a background master control center to form an audio/video processing system with the handheld terminals 1000, such as law enforcement recorders, dispersed in various law enforcement sites, that is, a law enforcement platform based on audio/video processing.
Specifically, a typical object detection model based on YOLOV8, namely a YOLOV8 multi-object detection model which is described later is used for object detection, and contraband or customs special labels are identified; the face recognition model based on the Arcface model, namely the Arcface model described later is used for carrying out face recognition, key personnel are accurately identified, the key personnel are compared with key personnel recorded in a database, and the identity of the personnel is checked; detecting human body key points by using an RTMPose model, classifying actions by using the detected human body key points based on a human body action classification model of ST-GCN, distinguishing typical human body actions such as frame taking, falling, loitering and the like, and starting automatic video recording; the identified typical human actions may also be added to summary information of the video file; the summary information is generated using the self-attention model for video and audio, respectively, using the encoder-decoder architecture.
Specifically, the aforementioned non-portable device 2000 extracts summary information from video files obtained by the law enforcement recorder at each law enforcement site, and intercepts the video according to the difference of the summary information. If the video of the prohibited articles, the abnormal actions of the personnel and the abnormal sounds are intercepted, a new video is generated, and the video is named again according to the abstract information of the prohibited articles, the abnormal actions of the personnel, the abnormal sounds and the like, so that law enforcement personnel can find evidence conveniently.
As shown in fig. 3, after extracting the video image as a frame image, the following processing steps may be performed: target detection and face recognition, human body key point detection and human body action classification.
When the object is detected, contraband in the luggage or a special tag attached to the luggage is detected by using the acquired luggage image carried by the person, or the contraband carried by the human body is detected by using the acquired personnel image.
Specifically, the handheld terminal or the law enforcement recorder stores a pre-trained YOLOv8 multi-target detection model in the local machine, and can detect images shot by the camera. When the camera is detected to shoot contraband or special labels, the handheld terminal or the law enforcement recorder can automatically start video recording, and when video files are stored, the labels of the contraband and early warning or alarming are recorded in file names.
Naming the acquired video file may result in missing law enforcement evidence when the video is opened, or personnel mishandling may result in missing law enforcement evidence. The handheld terminal or the law enforcement recorder of the application name the obtained video file when the video recording is finished, is favorable for improving the law enforcement efficiency and reduces the law enforcement difficulty of law enforcement personnel.
The pre-warning or alarm may include: the light alarm unit arranged on the machine is controlled to carry out light alarm; controlling an audio playing unit arranged on the machine to carry out sound alarm (such as beeping sound and voice prompt); and sending a message of detecting the contraband or the special tag to a background master control center.
Specifically, when training the YOLOv8 multi-target detection model, the number of network output channels of the YOLOv8 multi-target detection model is corrected according to the number of detection targets, such as the number of types of contraband and/or the number of types of special labels. After the training set is identified with the preset accuracy, determining the YOLOv8 model with the current parameters and the network structure as a pre-trained YOLOv8 multi-target detection model.
During face recognition, the acquired face images of the personnel and a pre-trained Arcface face recognition model stored in the personal computer are utilized to verify the identity of the personnel, and video recording is started and alarm or early warning is carried out when the identity of the personnel is determined to be important personnel.
Specifically, a pre-trained ArcFace face recognition model is used for carrying out face feature recognition on the obtained face image of the person, the face feature recognition result is compared with key person features recorded in a key person list database, video recording is started when the key person is determined, and the key person information is early-warned or alarmed or generated.
Above, the key personnel include pedestrians and passengers in monitoring areas such as subways or intersections managed and controlled by public security departments and pedestrians or passengers in need of clearance managed and controlled by customs departments. The key person information may include: text information such as a name, such as a special code, etc.
The pre-warning or alarm may include: the light alarm unit arranged on the machine is controlled to carry out light alarm; controlling an audio playing unit arranged on the machine to carry out sound alarm (such as beeping sound and voice prompt); and sending a message of detecting the key personnel to a background master control center.
The video recording starting method comprises the following steps: recording the video with preset duration, and recording the identified key personnel characteristics or key personnel information into the file name of the recorded video file; or after receiving the instruction of finishing video recording, recording the identified key personnel information into the file name of the video file obtained by recording.
When training the ArcFace face recognition model in advance, the method comprises the following steps: acquiring a face feature database registered with M key people, and determining the face feature database of the key people according to a key people list; and training the face feature database of the M key persons as a training set to obtain a face recognition model based on the ArcFace model. Specifically, the network architecture of the ArcFace face recognition model may use the residual neural network res net.
The pre-trained Arcface face recognition model can recognize whether the face image records one of the face features of the N key persons after the face image is input, can accurately recognize the key persons, and has high recognition accuracy and small recognition error.
Specifically, when the ArcFace face recognition model is trained in advance, a plurality of face feature values are extracted from an ArcFace network for each face image input, the face feature values are compared with a face feature database of key personnel, and if the comparison similarity is higher than a threshold value, the key personnel features are determined. In particular, the database of facial features of key personnel may be stored in a law enforcement recorder or in a background hosting center. And the law enforcement recorder recognizes the face features and then sends the face features to the background master control center, and the background master control center compares whether the face features are in the face feature database, generates key personnel information and determines whether the key personnel information is recorded in a key personnel list.
When the human body key points are detected, a Real-Time Multi-Person Pose Estimation (RTMPose) model is used for detecting the human body key points of the acquired human body whole body images.
The whole body posture estimation includes: key points of the hands, face, trunk, limbs and feet of the human body, that is, skeletal points described later, are located in the image. Due to the involvement of multi-scale body parts, fine-grained localization of low-resolution regions, and data scarcity, whole-body pose estimation is very challenging in dealing with human body layering, low resolution of hands and faces, complex body part matching in multi-person images (especially occlusion and complex hand gestures), and the like.
In order to improve the recognition speed and meet the real-time requirement, in the embodiment of the invention, the backbone network of the RTMPose model selects lightweight MobileNet V2. The MobileNet network is a backbone network of an RTMPose model designed for mobile or embedded systems. Compared with the MobileNet V1 network, the MobileNet V2 adopts an inverted residual structure (Inverted residual block, comprising modules connected with residual, projection convolution is firstly carried out to increase dimension, then the dimension is reduced through depth convolution, and finally projection convolution is used to reduce dimension), so that the model is smaller, and the recognition accuracy is higher.
Specifically, when detecting key points of human bodies, identifying at least one human body in the image, and identifying the key points of the human bodies; or identifying the respective human body key points of at least one human body recorded in the image. Therefore, aiming at the whole body image of any person, the human body key points of the single person can be identified and obtained, and the human body key points of a plurality of persons can also be identified and obtained. As shown in fig. 4, the detected human body key point diagram, which faces or faces away from the observer or the image acquisition unit, has a total of 17 key points from 0 to 16, wherein 0 represents the nose, 1 represents the right eye, 2 represents the left eye, 3 represents the right ear, 4 represents the left ear, 5 represents the right shoulder, 6 represents the left shoulder, 7 represents the right elbow, 8 represents the left elbow, 9 represents the right wrist, 10 represents the left wrist, 11 represents the right hip, 12 represents the left hip, 13 represents the right knee, 14 represents the left knee, 15 represents the right ankle, and 16 represents the left ankle.
As shown in fig. 3, after the key points of the human body are detected, the attention mechanism is fused, and a key point time-space diagram is constructed. Therefore, the skeleton sequence is represented by using the key point time-space diagram, and the problem of positioning of key points of a human body is solved.
The detected key points of the human body can be used later, and the action classification of the human body can be analyzed by combining the time sequence video. Specifically, constructing a key point space-time diagram for a skeleton sequence (i.e., a sequence of key points having skeleton connection relationships or identifying the node points) including N node points and T frames may be performed in two steps. Firstly, according to the physical connection relation of human body structures, joint points belonging to two ends of the same skeleton in the same frame are connected by edges and serve as key point edges; and then tracking the same node point in the continuous frames, and connecting by using the key point space-time diagram to finish the construction of the key point space-time diagram.
Correspondingly, constructing a key point space-time diagram of the skeleton sequence comprises the following steps: spatially, for each human skeletal point in a single frame, forming a spatially key point by connecting two adjacent skeletal points; in time, for each human skeleton point of two adjacent frames, the same two skeleton points are connected to form a key point in time. For example, a key point time space diagram g= (V, E) can be constructed by selecting a key point sequence of n=17 skeleton points and totaling T frames. Wherein, the aggregate of the skeleton points is V= { V ti T=1,.. i=1..n }, it consists of all skeletal point coordinate information in the human body key point sequence. E consists of a key point edge in a key point time-space diagram, which comprises two parts, wherein the two parts consist of two subsets, and the first part is a key point edge formed by two spatially adjacent bone points and is marked as Es= { v ti v tj (i, j) ∈h }, H representing the connected set of human joints; the second part is a key point side formed by two identical bone points adjacent in time and is marked as EF= { v ti v (t+1)i }. In this way, the constructed key point time-space diagram can well contain skeletal point change information in an action sequence. It should be understood that for any one bone point i, its time profile is all key point edges in EF.
Further, the ST-GCN (Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition) model can be constructed by using space-time diagram convolutions on the keypoint space-time diagrams. The network structure of the constructed ST-GC model is shown in figure 5. The ST-GCN network structure comprises a batch normalization layer, 9 ST-GCN units, a pooling layer, a full connection layer and a K-Means classifier, wherein the 1 ST ST-GCN unit, the 2 nd ST-GCN unit and the 9 th ST-GCN unit are schematically shown in FIG. 5, and the 3 rd ST-GCN unit to the 8 th ST-GCN unit are omitted and are not shown.
As shown in FIG. 5, each ST-GCN unit uses a feature residual fusion mode, so that cross-region fusion of features can be realized, and the learning capacity of a model is improved. Each ST-GCN unit is provided with an attention mechanism module, a graph neural network module and a time convolution network module to increase the attention of the network. A dropout probability of 0.4 was used after each ST-GCN unit to reduce the risk of model overfitting. While features in the network are downsampled using a Pooling layer (Pooling) operation with a step size of 2 after the 4 th and 7 th convolutional layers.
Thus, when classifying human body actions, the detected human body key points are obtained, and the human body actions are classified by using a pre-trained ST-GCN action classification model, so that typical human body postures and actions such as frame beating, falling, loitering and the like are distinguished.
The 1 ST to 3 rd ST-GCN units of the network structure of the ST-GCN model contain 64 output dimensions, the 4 th to 6 th ST-GCN units contain 128 output dimensions, and the 7 th to 9 th ST-GCN units contain 256 output dimensions for the aforementioned key point space-time diagram constructed of the n=17 skeleton points and the key point sequence of the total T frames. For the last 256 output dimensions, the K-Means algorithm is further used to classify the input keypoint sequence actions. Thus, by using the constructed ST-GCN action classification model, typical human actions such as frame beating, falling, loitering and the like can be identified.
Specifically, the key point space-time diagram construction and the ST-GCN action classification model can be executed by a law enforcement recorder or a background master control center. The law enforcement recorder can send the key point time space diagram of constructing to the backstage master control center, and the backstage master control center utilizes ST-GCN action classification model to discern typical human actions such as taking a frame, tumbleing, loitering and then sends to the law enforcement recorder, and the law enforcement recorder starts the video and reports to the police or early warning. And starting video recording and giving an alarm or early warning, and referring to the foregoing description, no further description is given.
In some embodiments, obtaining identification information identified by an environmental information identification unit, the environmental information identification unit comprising a wireless communication tag read-write module;
when the identification information is detected to belong to a preset target identification, the acquired video or audio is stored in a safe storage area, and the information of the detected preset target identification is recorded in a file name corresponding to the video or the audio.
Specifically, the identification information includes: identification information recorded by the RFID tag or identification information recorded by the NFC tag. It should be understood that the identification information recorded by the RFID tag or the identification information recorded by the NFC tag is customized according to a law enforcement scenario of customs or public security, and will not be described again.
When the tag is read, when the environment information identification unit or the tag reader/writer identifies or detects that a special tag is recorded in the surrounding environment (for example, the special tag is attached to the luggage, and an active or passive tag for specific frequency band communication is adopted), tag information is acquired, video recording is started, and an alarm or early warning is carried out, for example, a user of the handheld terminal is notified in an acousto-optic mode.
Further, the file name of the video recorded is recorded with the information of the extracted special tag.
Specifically, the active or passive tag for the specific frequency band communication may be a 433M active RFID tag, or a 868/915MHz/2.4G/923M RFID tag, or the like. The active RFID tag may be an active tag capable of actively signaling, with a small transmit power, and a long identification distance, on the order of 100 meters to 1500 meters. The active RFID tag can be provided with a power supply battery, and has good reliability and compatibility and large transmission data volume. For application scenarios with a short identification distance, the tag may also be a near field communication (Near Field Communication, NFC) tag or other short-range high frequency wireless communication tag.
In some embodiments, the audio/video processing method for the handheld terminal 1000 further includes:
Acquiring geographic position information identified by a geographic position determining unit;
when the geographic position information is detected to belong to a preset target position, the acquired video or audio is stored in a safe storage area, and the detected geographic position information is recorded in a file name corresponding to the video.
Specifically, the geographic location information includes: the absolute geographical position of the own machine expressed by longitude, latitude and elevation, or the semantic geographical position of the own machine expressed by semantic tags are described in each inspection room.
And when the positioning information is identified, the local machine is positioned according to the absolute geographic position of the local machine acquired by the geographic position determining unit so as to determine the shooting position of the video.
Specifically, in order to realize accurate indoor area positioning, RFID tags such as 433M can be usually placed in indoor and outdoor important places or important indoor and outdoor places, and the 433M tags are positioned by a signal intensity method, so that the local machine is positioned. Specifically, in order to realize accurate outdoor area positioning, the local machine can be positioned according to the absolute geographic position generated by the Beidou module.
Specifically, the indoor and outdoor emphasis places or emphasis indoor and outdoor places may include: indoor or outdoor areas, such as machine check rooms, manual check tables, provided for customs to perform check works; an import automobile inspection area, an animal and plant product (including food) inspection area, an import waste material inspection area, a port australian fresh and alive product inspection area, a health quarantine inspection area, or a highway port bus inspection area.
When the local machine is determined to be located or enter the pre-marked position according to the identified positioning information, if the local machine is determined to be located in an indoor area such as a checking room, the generated video is acquired and stored, if video recording is started and an alarm or early warning is carried out, for example, a user of the handheld terminal is notified in an acousto-optic mode. For example, the recorded location information is added to the file name of the video based on the identified location information. In particular, it may be an audible and visual indication, such as a warning light or a message presented on a display device.
In some embodiments, audio self-attention models are used to identify sounds, and when screaming, life-saving calls, gunshot are detected, video recording is turned on and an alarm or pre-warning is given.
In some embodiments, the background hosting center uses an audio self-attention model to identify sounds, and when screaming, life-saving calls, gunsounds are detected, information such as screaming, life-saving calls, gunsounds, etc. is added to the summary information of the audio file or video file.
In some embodiments, the handheld terminal or the background hosting center uses a self-attention model for video and audio, respectively, to generate summary information, such as with an encoder-decoder architecture, and uses a video-rocking self-attention model to extract summary information from the acquired video.
In particular, when generating summary information for video using a self-attention model, an encoder-decoder structure is employed. In the encoder-decoder architecture, audio and video are processed by separate encoders, such as a video encoder, an audio encoder, respectively; video decoder, audio decoder. The video encoder models the spatial information of the video, models the time relation of the video related tasks by using an additional local time modeling module, passes through a spatial self-attention network layer, and finally generates abstract information of the audio and video through a decoder.
Specifically, the video encoder adopts a video-rocking self-attention model and performs pre-training on the Pose dataset of the BBC. The Audio encoder uses an Audio self-attention model and performs a pre-training on the Audio Set data Set. Accordingly, the trained audio encoder and the trained video encoder are utilized to generate abstract information for the audio and the video.
Specifically, the audio information is defined as a sequence of fixed length T as input and a sequence of the same length but with selected dimensions is generated, denoted E, representing the size of the latent space. In this way, one can be Sequence x= (x) 1 ;x 2 ;...x T ) Mapping to a sequence of identical length T, i.e. z (z) 1 ;z 2 ;...z T ) Wherein (z) 1 ;z 2 ;...z T ) Is the selected hyper-parameter E, i.e. the size of the embedding or embedding layer of ebadd is 64. The network structure of the audio self-attention model is shown in fig. 6. In the network architecture, the self-attention model has 6 layers, a 64-dimensional embedding layer, a 3-layer 128 neuron feedforward layer and 8 attention heads. The audio is processed in seconds and split into 10 100 millisecond slices per second and input to the front-end module, which includes a fully connected layer of 2048 neurons, a layer of 64 neurons, to accommodate the size of the self-care module embedded layer, so that a front-end representation can be produced that contains 10 time steps, where each time step is 64 dimensions (64 being the super parameter described above). Specifically, the convolution kernel of the average pooling layer is 3. For a layer number L self-attention encoder model, each stack is superimposed on the other stack. Each self-attention module comprises an attention module F a And a feed-forward module F ff . The output of each module passes through a normalization layer and a residual layer, respectively. Thus, after the attention module and the feedforward module, if one sub-module (attention module F a Or feedforward module F ff ) The input of (a) is a sequence x b The output is not passed directly to the next module or sub-module, but rather the module layer specification and residual output x bo Pass on to the next module/sub-module, i.e. x bo =LayerNor m(x b +F a/ff (x b )). As such, the normalization layer helps to better converge and improve performance. Specifically, the output dimension of the last full-connection layer is 200, which is the same as the number of items of the output audio summary information, and finally the audio summary information is obtained.
Specifically, the video encoder first extracts image features of each video frame and calculates representative features of shots through a video-rocking self-attention model; the importance levels of the different shots are then dynamically adjusted by an attention mechanism, wherein the attention is made up of two parts, the first part first computing representative features from the image features of each shot and computing feature-based spatial attention. Then, determining the influence factors of the local features of the current video on the whole video shot, extracting time sequence features by using a Recursive long-short-term memory network (ReLSTM), firstly setting a shot sequence storage structure, storing the current shot sequence in the storage structure by taking the first shot representative feature before the ReLSTM study, calculating the attention distribution of shots in the storage structure by using an attention mechanism, weighting the shots, and finally inputting the weighted shot features into the ReLSTM network to finish single-layer time sequence feature extraction. Similarly, each layer adds the representative feature of the first shot of the shot sequence to the storage structure, and completes the extraction of the time sequence feature according to the steps until all shots are input into the storage structure. And finally, carrying out feature fusion by using a fusion mechanism, fusing the representative features, the shot features and the global time sequence features by using the fusion mechanism, calculating the score of each shot, and finally obtaining the video abstract.
In some embodiments, the background master control center generates summary information of the audio file or the video file, where the summary information may include the foregoing text information, such as semantic description, special code, etc., corresponding to the typical human motion, the typical voice feature, the typical face feature, the typical object feature, the identification information, and the geographic location information, respectively.
In conclusion, the law enforcement recorder has a target detection function based on a YOLOV8 model, and can automatically detect contraband. The law enforcement recorder can also use the ST-GCN model to classify the actions, automatically identify abnormal actions of personnel, automatically identify abnormal sounds and prompt a user in time. The law enforcement recorder or the background master control center can also automatically intercept videos with contraband, abnormal actions of personnel and abnormal sounds to generate new videos, rename the intercepted videos according to abstract information such as the contraband, the abnormal actions of personnel and the abnormal sounds, intercept the new videos to generate new videos, facilitate users to search videos and find special events, namely, facilitate personnel to find law enforcement evidences, and effectively improve working efficiency of personnel inspection and video search.
In one embodiment, as shown in fig. 3, the law enforcement recorder is used to continuously acquire image frames via a camera or a high definition camera after power-on, but does not save the image frames, i.e., does not record video. After the target is detected by adopting the audio and video detection method, the video is automatically recorded, abstract information is generated or a label is generated, and after the video is recorded, the recorded video is automatically named after a plurality of abstract information is recorded, so that the search is convenient.
When the law enforcement recorder is used, the field is photographed by a high-definition camera of the recorder, and a YOLOv8 multi-target detection model is used for detecting contraband targets; combining the ArcFace model to carry out face recognition on passengers and other personnel, accurately identifying pre-recorded key personnel, and checking the identity of the passengers; by using an RTMPose model, an algorithm fuses the key parts such as the attention mechanism and focuses on the hands, the upper limbs and the like, a time-space diagram of key points of a human body is constructed, the key points of the human body can be better detected, and the human body posture is estimated; classifying human body actions based on the ST-GCN model, and identifying action types such as falling, fighting and the like and giving an alarm in time; finally, the video information is automatically extracted to obtain abstract information and position information for naming, so that the working intensity of law enforcement personnel for searching and checking videos is reduced, and the working efficiency of the law enforcement personnel is effectively improved.
As shown in fig. 6 and 7, the handheld terminal according to the embodiment of the present invention includes: an audio/video processing apparatus 100, an environmental information identification unit 300, and a geographic position determination unit 200;
the audio and video processing device is connected with the environment information identification unit and the geographic position determination unit and is used for executing the audio and video processing method.
As shown in fig. 6 and 7, an audio/video processing system according to an embodiment of the present invention includes:
a handheld terminal 1000, a non-portable device 2000;
the handheld terminal is communicatively coupled to the non-portable device,
the handheld terminal is used for executing the audio and video processing method;
the non-portable device is used for executing the audio and video processing method.
The audio and video processing device, the handheld terminal and the audio and video processing system in the embodiment of the invention have the same beneficial effects as the method adopted, operated or realized by the audio and video processing device, the handheld terminal and the audio and video processing system in the embodiment of the invention due to the same inventive concept, and can be referred to for implementation and are not repeated.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
The above description is illustrative of the invention and is not to be construed as limiting, and it will be understood by those skilled in the art that many modifications, variations or equivalents may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. An audio/video processing method, comprising:
Acquiring audio and storing the audio in a temporary buffer area;
identifying the audio using a speech recognition model based on the audio self-attention model;
when the typical voice characteristics are recorded in the audio, storing the audio in a safe storage area, and recording the recognized typical voice characteristics in a file name corresponding to the audio; and/or
Acquiring a video, storing the video in a temporary buffer area, and extracting at least one frame of image from the acquired video;
identifying the at least one frame of image using an ArcFace model based face recognition model;
when the characteristic of the typical face is recorded in the at least one frame of image, storing the acquired video in a safe storage area, and recording the characteristic of the identified typical face in a file name corresponding to the video;
the temporary buffer area is arranged at the handheld terminal, and the safe storage area is arranged at the handheld terminal or the non-portable device.
2. The audio-video processing method according to claim 1, further comprising:
extracting at least one frame of image from the acquired video;
detecting the at least one frame of image using a YOLOV 8-based typical item detection model;
when the characteristic of the typical object is recorded in the at least one frame of image, the acquired video is stored in a safe storage area, and the detected characteristic of the typical object is recorded in a file name corresponding to the video.
3. The audio-video processing method according to claim 1, further comprising:
identifying the video based on a human motion classification model of ST-GCN;
when the typical human body action is recorded in the video, the acquired video is stored in a safe storage area, and the recognized typical human body action is recorded in a file name corresponding to the video.
4. An audio-video processing method according to claim 3, wherein the ST-GCN based human action classification model identifies the video, comprising:
using an RTMPose model to detect key points of a human body on a whole body image of a person recorded in the acquired video, and generating a key point time-space diagram;
inputting the key point space-time diagram into the human body action classification model based on ST-GCN, and identifying whether typical human body actions are recorded in the video.
5. The audio-video processing method according to claim 1, further comprising:
acquiring identification information identified by an environmental information identification unit, wherein the environmental information identification unit comprises a wireless communication tag reading and writing module;
when the identification information is detected to belong to a preset target identification, the acquired video or audio is stored in a safe storage area, and the information of the detected preset target identification is recorded in a file name corresponding to the video or the audio.
6. The audio-video processing method according to claim 1, further comprising:
acquiring geographic position information identified by a geographic position determining unit;
when the geographic position information is detected to belong to a preset target position, the acquired video or audio is stored in a safe storage area, and the detected geographic position information is recorded in a file name corresponding to the video.
7. The audio-video processing method according to any one of claims 1 to 6, characterized by further comprising:
with the encoder-decoder architecture, summary information is extracted from the acquired video using a video-rocking self-attention model.
8. The audio-video processing method according to any one of claims 1 to 6, characterized by further comprising:
broadcasting the typical face features, the typical object features, the geographic position information or the information of the preset target mark.
9. A hand-held terminal, comprising: the system comprises an audio and video processing device, an environment information identification unit and a geographic position determination unit;
the audio/video processing device is connected to the environmental information identifying unit and the geographic position determining unit, and is configured to perform the audio/video processing method according to any one of claims 1 to 8.
10. An audio video processing system, comprising:
a handheld terminal, a non-portable device;
the handheld terminal is communicatively coupled to the non-portable device,
the handheld terminal is used for executing the audio and video processing method according to any one of claims 1 to 8;
the non-portable device is configured to perform the audio-video processing method according to any one of claims 1 to 8.
CN202311262828.XA 2023-09-27 2023-09-27 Audio and video processing method and device for handheld terminal and law enforcement recorder Pending CN117456405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311262828.XA CN117456405A (en) 2023-09-27 2023-09-27 Audio and video processing method and device for handheld terminal and law enforcement recorder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311262828.XA CN117456405A (en) 2023-09-27 2023-09-27 Audio and video processing method and device for handheld terminal and law enforcement recorder

Publications (1)

Publication Number Publication Date
CN117456405A true CN117456405A (en) 2024-01-26

Family

ID=89584438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311262828.XA Pending CN117456405A (en) 2023-09-27 2023-09-27 Audio and video processing method and device for handheld terminal and law enforcement recorder

Country Status (1)

Country Link
CN (1) CN117456405A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118317124A (en) * 2024-03-29 2024-07-09 重庆赛力斯凤凰智创科技有限公司 Video data storage method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118317124A (en) * 2024-03-29 2024-07-09 重庆赛力斯凤凰智创科技有限公司 Video data storage method and device

Similar Documents

Publication Publication Date Title
US11735018B2 (en) Security system with face recognition
US11669979B2 (en) Method of searching data to identify images of an object captured by a camera system
US10152858B2 (en) Systems, apparatuses and methods for triggering actions based on data capture and characterization
CN105404860A (en) Method and device for managing information of lost person
CN116563797B (en) Monitoring management system for intelligent campus
CN110458489A (en) Chief storekeeper's method, system, storage medium and its intelligent terminal
CN111127830A (en) Alarm method, alarm system and readable storage medium based on monitoring equipment
CN111753594A (en) Danger identification method, device and system
CN115223246A (en) Personnel violation identification method, device, equipment and storage medium
CN117456405A (en) Audio and video processing method and device for handheld terminal and law enforcement recorder
CN116246424B (en) Old people's behavioral safety monitored control system
Cucchiara et al. An intelligent surveillance system for dangerous situation detection in home environments
CN115862138A (en) Personnel tumbling behavior detection method, device, equipment and storage medium
Modi et al. Automated Attendance Monitoring System for Cattle through CCTV
CN116152745A (en) Smoking behavior detection method, device, equipment and storage medium
CN113903003B (en) Event occurrence probability determination method, storage medium, and electronic apparatus
CN109327681A (en) A kind of specific people identifies warning system and its method
CN110148234A (en) Campus brush face picks exchange method, storage medium and system
Firmasyah et al. Preventing Child Kidnaping at Home Using CCTV that Utilizes Face Recognition with You Only Look Once (YOLO) Algorithm
CN111753665A (en) Park abnormal behavior identification method and device based on attitude estimation
US11423760B2 (en) Device for detecting drowning individuals or individuals in a situation presenting a risk of drowning
KR102635351B1 (en) Crime prevention system
CN118116084B (en) Behavior information detection method, apparatus and storage medium
Zaman et al. Execution of coordinate based classifier system to predict specific criminal behavior using regional multi person pose estimator
US20240037761A1 (en) Multimedia object tracking and merging

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication