CN103546662A

CN103546662A - A method for synchronizing audio and video in a network monitoring system

Info

Publication number: CN103546662A
Application number: CN201310437082.1A
Authority: CN
Inventors: 孟利民; 蒋维; 周凯; 司徒涨勇
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2013-09-23
Filing date: 2013-09-23
Publication date: 2014-01-29

Abstract

A method for synchronizing audio and video in a network monitoring system, comprising the following steps: (1) The video is compressed by the built-in hardware of the TW2835 chip to achieve data compression to generate H.264 data, and the audio is compressed by software to generate G.729 data format. It is sent to the RTP library for encapsulation and transmission; (2) After removing the IP header and UDP header from the data packet received from the network, it is determined to put the RTP packet into the audio or video buffer according to the load type in the RTP header, and then Insert the load data in the RTP packet into the correct position of the buffer according to the order of the serial number fields in the RTP packet; after the network data reception starts, the audio and video buffers will pre-store an appropriate amount of data packets; when the audio and video stream data After the pre-stored area is filled, the playback starts at the same time; the synchronization takes the audio stream as the time guide, and the audio and video synchronization is realized by adjusting the video object. The invention is simple to implement and has high synchronization accuracy.

Description

Audio and video synchronization method in a kind of network monitoring system

Technical field

The present invention relates to network monitoring field, especially a kind of Voice & Video synchronous method of network monitoring system.

Background technology

Be accompanied by the raising of Video Supervision Technique and monitor the extensive use in every field, in a lot of occasions, people start to pay attention to Voice Surveillance.No matter be mechanisms of public security organs, or the key unit such as airport, railway, bank, increasing high-quality safe protection engineering is badly in need of audio-visual simultaneous monitoring system clear, true to nature, and Voice Surveillance field has become the new highlight of security protection industry.There is adding of Voice Surveillance, just can take leave of " silent movie " epoch of simple video monitoring, be conducive to the comprehensive control to accident, carried out accurate evaluation and disposal.Voice applications, in monitoring, has been filled up to a large blank of safety-security area, is a great development direction of network monitoring in recent years.Yet the objects such as the Voice & Video in multi-medium data itself have strict time relationship, and Internet Transmission can cause this original time domain relation destroyed, and then causes sound and image in system cannot realize synchronized playback.Therefore, research and the realization of network monitoring system sound intermediate frequency and audio video synchronization technology, just seem especially important.But there is the problems such as complicated, synchronous accuracy is not high that realize in existing simultaneous techniques.

Summary of the invention

In order to overcome synchronous complicated, the synchronous not high deficiency of accuracy of realization of Voice & Video of existing network monitoring system, the invention provides a kind of audio and video synchronization method in the network monitoring system that simple, synchronous accuracy is higher of realizing.

The technical solution adopted for the present invention to solve the technical problems is:

In an audio and video synchronization method, described synchronous method comprises the following steps:

(1) hardware-compressed that video carries by TW2835 chip realizes data compression, generates H.264 data, and audio frequency generates G.729 data format by Software Compression; Then being sent into RTP storehouse encapsulates, sends;

Described RTP packet header comprises sequence number and timestamp, and in the process of transmitting of data, the sequence number in the RTP packet that each sends increases one by one, and timestamp is identifying the collection moment of audio, video data;

(2) after the packet receiving from network removes IP packet header and UDP packet header, first according to the loadtype in RTP packet header, determine RTP bag to put into audio frequency or screen buffer, and then according to the order of the sequence-number field in RTP bag, the load data in RTP bag is inserted in tram, buffering area;

After network data receives and to start, sound, the screen buffer appropriate packet that all can prestore; When the data of sound, video flowing, fill up and prestore behind district, start to play simultaneously; Synchronously take audio stream as the time leading, by adjusting object video, realize the synchronous of audio frequency and video.

Further, in described step (2), using the timestamp of audio frequency as the relative reference time, after playback starts, with constant speed, voice data is taken out and sends into decoder from buffering area, and write the timestamp A of buffering area the first blocks of data _t, then by A _ttimestamp V with screen buffer the first blocks of data _tcompare, according to both difference A _t-V _tdetermine the propelling movement speed of video data and the playback rate of video, specific implementation is:

2.1) as-100ms≤A _t-V _tduring≤100ms, audio frequency and video buffering is pressed normal speed propelling data, and the speed of broadcasting also remains unchanged;

2.2) when 100ms≤| A _t-V _t| during≤160ms, need to synchronously adjust;

If 1. 100ms≤A _t-V _t≤ 160ms, the leading video of audio frequency, accelerates the data in pushing video buffering, accelerates the playback rate of video, audio frequency and video timestamp is trended towards identical;

If 2.-160ms≤A _t-V _t≤-100ms, i.e. audio frequency hysteresis video, the propelling movement speed of the video buffer that slows down, the playback rate of reduction video, trends towards audio frequency and video timestamp identical;

2.3) as | A _t-V _t| during>=160ms, need to re-start synchronous;

If 1. A _t-V _t>=160ms, the serious leading video of audio frequency, abandons video packets the oldest in screen buffer, until A _t=V _t, start to start playback with normal speed;

If 2. A _t-V _t≤-160ms, the audio frequency video that seriously lags behind, abandons audio pack the oldest in audio buffer, until A _t=V _t, start to start playback with normal speed.

Further again, described buffering area is a data link table, and described data link table comprises back end, and every kind of Media Stream has two kinds of back end, and a kind of is idle back end FreeDatanode, and a kind of is back end BusyDatanode in using; When there being new RTP bag to receive, just apply for that a FreeDatanode is as BusyDatanode, write the media data of load in RTP bag and the sequence number of RTP bag, and according to this sequence number, this BusyDatanode is inserted into the tram of buffering area, to recover original time relationship of media data in buffering area; After data in BusyDatanode back end are sent into decoder and play, BusyDatanode will become FreeDatanode.When FreeDatanode uses, when has expired buffering area, the data in the oldest BusyDatanode are by deleted, and itself can change into FreeDatanode automatically.

Technical conceive of the present invention is: a complete network monitoring system comprises the collection compression of audio, video data, the links such as transmission, Internet Transmission, network reception, real-time synchronization realization of packing, different according to its residing position, system can be divided into 3 parts: equipment end (comprise and gather compression and packing transmission), the webserver (Internet Transmission), receiving terminal (comprising network reception and synchronous realization), as shown in Figure 1, its basic implementation procedure is as follows: receiving terminal sends control command, the audio, video data of notification server forwarding unit end; Transmitting terminal receives after instruction, by audio-video collection chip, gathers audio, video data, then compresses respectively (video format for H.264, audio format is for G.729), is packaged into RTP Packet Generation to the webserver afterwards according to Real-time Transport Protocol standard; The webserver is transmitted to receiving terminal after receiving these data, at receiving terminal, by dynamic buffering technology, Directshow technology etc., realizes the real-time synchronization playback to monitoring site audio, video data.

Beneficial effect of the present invention is mainly manifested in: 1) H.264 the compression of audio frequency and video adopts respectively and G.729, these two kinds of algorithms are all the algorithms with high compression rate of current popular, can effectively save bandwidth, improve the efficiency of transmission of network.2) adopt RTP host-host protocol, there is very strong live effect, can reproduce in real time field scene by network remote.3) it is short that audio-visual synchronization is adjusted the time, and synchronization accuracy is high, and the maximum step-out interval of system is 160ms.4) synchronous implementation complexity is low, has greatly simplified audio-visual synchronization algorithm, has improved efficiency.

Accompanying drawing explanation

Fig. 1 is the Organization Chart of network monitoring system.

Fig. 2 is the schematic diagram of audio, video data encapsulation process.

The schematic diagram in Tu3Shi dynamic buffering district.

Fig. 4 is the schematic diagram that audio frequency is play thread.

Fig. 5 is the schematic diagram of video playback thread.

Embodiment

Below in conjunction with accompanying drawing, the invention will be further described.

With reference to Fig. 1～Fig. 5, audio and video synchronization method in a kind of network monitoring system, described synchronous method comprises the following steps:

In monitoring site, the audio, video data of equipment end collects by WM8731 chip and TW2835 chip respectively.Afterwards, the hardware-compressed that video carries by TW2835 chip realizes data compression, generates H.264 data, and audio frequency generates G.729 data format by Software Compression.Then sent into RTP storehouse and encapsulate, send, encapsulation process as shown in Figure 2.

In synchronous implementation procedure, these two fields of the sequence number in RTP packet header and timestamp are particularly important.In the process of transmitting of data, sequence number in the RTP packet that each sends increases one by one, to facilitate at receiving terminal, packet is sorted, to recover the original time relation of packet, and then overcome the impact that network congestion, server time delay etc. cause packet.Timestamp is even more important, and it is identifying the collection moment of audio, video data, is an of paramount importance scalar in Synchronization Control.These information are all to seal in the process of dressing up RTP bag and realize at audio, video data.

The webserver main effect in synchronous implementation procedure is exactly the transfer that realizes signaling and data.

Under the impact of the various factorss such as network delay, packet loss, the RTP bag front and back order that arrives receiving terminal can be entanglement, and even certain video packets has arrived receiving terminal, and corresponding audio pack is also in network with it.For this reason, need to be respectively Voice & Video at receiving terminal and be provided with one section of dynamic buffer, as shown in Figure 3.After the packet receiving from network removes IP packet header and UDP packet header, first according to the loadtype in RTP packet header (PT value), determine RTP bag to put into audio frequency or screen buffer.And then according to the order of the sequence-number field in RTP bag, the load data in RTP bag is inserted in tram, buffering area.In actual design, buffering area is a data link table, some nodes, consists of, and specific implementation is: 1) every kind of Media Stream has two kinds of back end, an idle back end FreeDatanode, a kind of is back end BusyDatanode in using.2) there is new RTP bag to receive, just apply for that a FreeDatanode is as BusyDatanode, write the media data of load in RTP bag and the sequence number of RTP bag, and according to this sequence number, this BusyDatanode is inserted into the tram of buffering area, to recover original time relationship of media data in buffering area.3) after the data in BusyDatanode back end are sent into decoder and play, BusyDatanode will become FreeDatanode.When FreeDatanode uses, when has expired buffering area, the data in the oldest BusyDatanode are by deleted, and itself can change into FreeDatanode automatically.Fig. 3 has shown the concrete structure of Datanode simultaneously, except data data, also have four signs Len, Key, SequNum, Timestamp, whether representative data section length, data are the part of key frame of video, the sequence number of packet, timestamp respectively.

The design of dynamic buffer is that receiving terminal is realized synchronous committed step, and it not only recovers the normal play order in Media Stream, is controlling realization synchronous between media simultaneously.After network data receives and to start, sound, the screen buffer appropriate packet that all can prestore.The length in district of prestoring not only wants to meet the demand of shaking in Media Stream that makes up, and the while meets the requirement of real-time again, is unlikely to allow period of reservation of number long.So the length general control in the district that prestores is in 500ms.When the data of sound, video flowing, fill up and prestore behind district, start to play simultaneously.When processing audio-visual synchronization broadcasting, need to select suitable reference stream.Because people's the sense of hearing is more responsive than vision, when fixed frequency sound is play, the change of time-out and speed all can make people be difficult to accept, bandwidth the lacking more than video that audio stream takies in addition.Therefore, synchronously take audio stream as the time leading, by adjusting object video, realize the synchronous of audio frequency and video.Herein, we using the timestamp of audio frequency as the relative reference time.After playback starts, with constant speed, voice data is taken out and sends into decoder from buffering area, and write the timestamp A of buffering area the first blocks of data _t.Then by A _ttimestamp V with screen buffer the first blocks of data _tcompare, according to both difference A _t-V _tdetermine the propelling movement speed of video data and the playback rate of video, specific implementation is:

1) as-100ms≤A _t-V _tduring≤100ms, the asynchrony phenomenon of the imperceptible audio frequency and video of people, Here it is synchronous region.In this case, audio frequency and video buffering is pressed normal speed propelling data, and the speed of broadcasting also remains unchanged.

2) when 100ms≤| A _t-V _t| during≤160ms, Here it is synchronous critical zone, need to synchronously adjust.

If 1. 100ms≤A _t-V _t≤ 160ms, the leading video of audio frequency, accelerates the data in pushing video buffering, accelerates the playback rate of video, audio frequency and video timestamp is trended towards identical.

If 2.-160ms≤A _t-V _t≤-100ms, i.e. audio frequency hysteresis video, the propelling movement speed of the video buffer that slows down, the playback rate of reduction video, trends towards audio frequency and video timestamp identical.

3) as | A _t-V _t| during>=160ms, people can obviously feel the asynchrony phenomenon of audio frequency and video, and Here it is step-out region need to re-start synchronous.

If 1. A _t-V _t>=160ms, the serious leading video of audio frequency, abandons video packets the oldest in screen buffer, until A _t=V _t, start to start playback with normal speed.

Above-mentioned synchronisation control means has kept well the time relationship between audio stream and video flowing in the running of system, in long playing situation, also there will not be the phenomenon that audio frequency is leading or lag behind, also do not occur the discontinuous broadcasting phenomenon of audio stream or video flowing.

Audio frequency is play thread as shown in Figure 4, when audio buffer arrives, prestores after length, and the sign that audio frequency is play will be set to true, and program starts from buffering area reading out data.Because the frequency acquisition of native system transmitting terminal audio frequency is 8KHZ, be quantified as 16, so we read voice data by the speed with 16000 bytes per second, simultaneously by timestamp assignment in every voice data to A _tvariable, judges whether that according to the difference of itself and video time stamp it is directly sent into decoder plays.

Video playback thread is similar with the flow process that audio frequency is play thread, difference is that audio frequency broadcasting only need to adjust in re-synchronization, the adjustment of video playback has run through whole synchronous playing process, when the time tolerance of audio stream and video flowing changes, playback rate that need to be by adjusting video flowing is to reach audio-visual synchronization effect.

Claims

1. a method for synchronizing audio and video in a network monitoring system, characterized in that: the method for synchronizing comprises the following steps:

(1) The video is compressed by the hardware compression built into the TW2835 chip to generate H.264 data, and the audio is compressed by software to generate the G.729 data format; then it is sent to the RTP library for packaging and sending;

Described RTP header comprises sequence number and time stamp, and in the sending process of data, the sequence number in the RTP data packet of each sending out all increases one by one, and time stamp marks the acquisition moment of audio-video data;

(2) After removing the IP header and UDP header from the data packet received from the network, first decide to put the RTP packet into the audio or video buffer according to the load type in the RTP header, and then according to the sequence number field in the RTP packet Insert the load data in the RTP packet into the correct position of the buffer in the order;

After the network data reception starts, the audio and video buffers will pre-store a certain amount of data packets; when the data of the audio and video streams fills the pre-stored area, they will start playing at the same time; the synchronization takes the audio stream as the time-oriented, and realizes the audio by adjusting the video object. Synchronization of video.

2. The method for synchronizing audio and video in a network monitoring system according to claim 1, characterized in that: in the step (2), the time stamp of the audio is used as a relative reference time, and after the playback starts, the audio is recorded at a constant rate The data is taken out from the buffer and sent to the decoder, and the time stamp _AT of the first block of data in the buffer is recorded, and then _AT is compared with the time stamp V _T of the first block of data in the video buffer. The value A _T - V _T determines the push rate of video data and the playback rate of the video. The specific implementation is:

2.1) When -100ms≤A _T -V _T ≤100ms, the audio and video buffer pushes data at the normal rate, and the playback rate remains unchanged;

2.2) When 100ms≤|A _T -V _T |≤160ms, synchronization adjustment is required;

①If 100ms≤AT- _V _T≤160ms , that is, the audio is ahead of the video, the data in the video buffer will be pushed faster, and the playback rate of the video will be accelerated, so that the audio and video timestamps tend to be the same;

②If _-160ms≤AT -V _T≤ -100ms, that is, the audio lags behind the video, slow down the push rate of the video buffer, reduce the playback rate of the video, and make the audio and video timestamps tend to be the same;

2.3) Resynchronization is required when |A _T -V _T |≥160ms;

① If A _T -V _T ≥ 160ms, that is, the audio is seriously ahead of the video, discard the oldest video packet in the video buffer until A _T = V _T , and start playback at a normal rate;

②If A _T -V _T ≤-160ms, that is, the audio lags behind the video severely, discard the oldest audio packet in the audio buffer until A _T =V _T , and start playback at a normal rate.

3. the method for synchronizing audio and video in the network monitoring system as claimed in claim 1 or 2, is characterized in that: described buffer zone is a data linked list, and described data linked list comprises data node, and every kind of media stream has two kinds of data nodes , one is the free data node FreeDatanode, and the other is the data node BusyDatanode in use; when a new RTP packet is received, apply for a FreeDatanode as the BusyDatanode, and write the media data loaded in the RTP packet and the sequence of the RTP packet number, and insert this BusyDatanode into the correct position of the buffer according to the serial number to restore the original time relationship of the media data in the buffer; when

After the data in the BusyDatanode data node is sent to the decoder for playback,

BusyDatanode will become FreeDatanode. When the FreeDatanode is used up, that is, when the buffer is full, the data in the oldest BusyDatanode will be deleted, and it will automatically be converted into a FreeDatanode.